Re: In-placre persistance change of a relation

2024-05-28 Thread Jelte Fennema-Nio
On Fri, 24 May 2024 at 00:09, Kyotaro Horiguchi  wrote:
> Along with rebasing, I changed the interface of XLogFsyncFile() to
> return a boolean instead of an error message.

Two notes after looking at this quickly during the advanced patch
feedback session:

1. I would maybe split 0003 into two separate patches. One to make SET
UNLOGGED fast, which seems quite easy to do because no WAL is needed.
And then a follow up to make SET LOGGED fast, which does all the
XLOG_FPI stuff.
2. When wal_level = minital, still some WAL logging is needed. The
pages that were changed since the last still need to be made available
for crash recovery.




Re: In-placre persistance change of a relation

2024-05-24 Thread Kyotaro Horiguchi
Rebased.

Along with rebasing, I changed the interface of XLogFsyncFile() to
return a boolean instead of an error message.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From bed74e638643d7491bbd86fe640c33db1e16f0e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v33 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..492ababd9c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8592,21 +8592,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8614,7 +8622,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8651,19 +8659,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8677,7 +8672,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943..badfe4abd6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.43.0

>From c200b85c1311f97bdae2ed20e2746c44d5c4aadb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v33 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/meson.build |   1 +
 src/backend/access/transam/rmgr.c  |   4 +-
 src/backend/access/transam/simpleundolog.c | 362 +
 src/backend/access/transam/twophase.c  |   3 +
 src/backend/access/transam/xact.c  |  24 ++
 src/backend/access/transam/xlog.c  |  42 ++-
 src/backend/catalog/storage.c  | 171 ++
 src/backend/storage/file/reinit.c  |  78 +
 src/backend/storage/smgr/smgr.c|   9 +
 src/bin/initdb/initdb.c|  17 +
 src/bin/pg_rewind/parsexlog.c  |   2 +-
 src/bin/pg_waldump/rmgrdesc.c  |   2 +-
 

Re: In-placre persistance change of a relation

2024-01-22 Thread Kyotaro Horiguchi
At Mon, 22 Jan 2024 15:36:31 +1100, Peter Smith  wrote 
in 
> 2024-01 Commitfest.
> 
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there was a CFbot test failure last time it was run [2]. Please have a
> look and post an updated version if necessary.

Thanks! I have added the necessary includes to the header file this
patch adds. With this change, "make headerscheck" now passes. However,
when I run "make cpluspluscheck" in my environment, it generates a
large number of errors in other areas, but I didn't find one related
to this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9a2b6fbda882587c127d3e50bccf89508837d1a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v32 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.39.3

>From c464013071dedc15b838e573ae828f150b3b60f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v32 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/meson.build |   1 +
 src/backend/access/transam/rmgr.c  |   4 +-
 src/backend/access/transam/simpleundolog.c | 362 +
 src/backend/access/transam/twophase.c  |   

Re: In-placre persistance change of a relation

2024-01-21 Thread Peter Smith
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/3461/
[2] https://cirrus-ci.com/task/6050020441456640

Kind Regards,
Peter Smith.




Re: In-placre persistance change of a relation

2024-01-14 Thread Kyotaro Horiguchi
At Tue, 9 Jan 2024 15:07:20 +0530, vignesh C  wrote in 
> CFBot shows compilation issues at [1] with:

Thanks!

The reason for those errors was that I didn't consider Meson at the
time. Additionally, the signature change of reindex_index() caused the
build failure. I fixed both issues. While addressing these issues, I
modified the simpleundolog module to honor
wal_sync_method. Previously, the sync method for undo logs was
determined independently, separate from xlog.c. However, I'm still not
satisfied with the method for handling PG_O_DIRECT.

In this version, I have added the changes to enable the use of
wal_sync_method outside of xlog.c as the first part of the patchset.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 40749357f24adf89dc79db9b34f5c053288489bb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v31 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.39.3

>From 5c120b94c407b971485ab52133399305e5e81a88 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v31 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/meson.build |   1 +
 src/backend/access/transam/rmgr.c  |   4 +-
 

Re: In-placre persistance change of a relation

2024-01-09 Thread vignesh C
On Mon, 4 Sept 2023 at 16:59, Kyotaro Horiguchi  wrote:
>
> At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I could turn this into something like undo longs in a simple form, but
> > I'd rather not craft a general-purpose undo log system for this unelss
> > it's absolutely necessary.
>
> This is a patch for a basic undo log implementation. It looks like it
> works well for some orphan-files-after-a-crash and data-loss-on-reinit
> cases.  However, it is far from complete and likely has issues with
> crash-safety and the durability of undo log files (and memory leaks
> and performance and..).
>
> I'm posting this to move the discussion forward.
>
> (This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.)

CFBot shows compilation issues at [1] with:
09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_twophase.c.o: in function
`FinishPreparedTransaction':
[09:34:44.987] 
/tmp/cirrus-ci-build/build/../src/backend/access/transam/twophase.c:1569:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`CommitTransaction':
[09:34:44.987] 
/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2372:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`AbortTransaction':
[09:34:44.987] 
/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2878:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`CommitSubTransaction':
[09:34:44.987] 
/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5016:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`AbortSubTransaction':
[09:34:44.987] 
/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5197:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o:/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:6080:
more undefined references to `AtEOXact_SimpleUndoLog' follow

[1] - https://cirrus-ci.com/task/5916232528953344

Regards,
Vignesh




Re: In-placre persistance change of a relation

2023-09-04 Thread Kyotaro Horiguchi
At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I could turn this into something like undo longs in a simple form, but
> I'd rather not craft a general-purpose undo log system for this unelss
> it's absolutely necessary.

This is a patch for a basic undo log implementation. It looks like it
works well for some orphan-files-after-a-crash and data-loss-on-reinit
cases.  However, it is far from complete and likely has issues with
crash-safety and the durability of undo log files (and memory leaks
and performance and..).

I'm posting this to move the discussion forward.

(This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From da5696b9026fa916ae991f7da616062c5b19e705 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v29 1/2] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/rmgr.c  |   4 +-
 src/backend/access/transam/simpleundolog.c | 343 +
 src/backend/access/transam/twophase.c  |   3 +
 src/backend/access/transam/xact.c  |  24 ++
 src/backend/access/transam/xlog.c  |  20 +-
 src/backend/catalog/storage.c  | 171 ++
 src/backend/storage/file/reinit.c  |  78 +
 src/backend/storage/smgr/smgr.c|   9 +
 src/bin/initdb/initdb.c|  17 +
 src/bin/pg_rewind/parsexlog.c  |   2 +-
 src/bin/pg_waldump/rmgrdesc.c  |   2 +-
 src/include/access/rmgr.h  |   2 +-
 src/include/access/rmgrlist.h  |  44 +--
 src/include/access/simpleundolog.h |  36 +++
 src/include/catalog/storage.h  |   3 +
 src/include/catalog/storage_ulog.h |  35 +++
 src/include/catalog/storage_xlog.h |   9 +
 src/include/storage/reinit.h   |   2 +
 src/include/storage/smgr.h |   1 +
 src/tools/pgindent/typedefs.list   |   6 +
 21 files changed, 780 insertions(+), 32 deletions(-)
 create mode 100644 src/backend/access/transam/simpleundolog.c
 create mode 100644 src/include/access/simpleundolog.h
 create mode 100644 src/include/catalog/storage_ulog.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	rmgr.o \
 	slru.o \
 	subtrans.o \
+	simpleundolog.o \
 	timeline.o \
 	transam.o \
 	twophase.o \
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7d67eda5f7..840cbdecd3 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -35,8 +35,8 @@
 #include "utils/relmapper.h"
 
 /* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
-	{ name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+	{ name, redo, desc, identify, startup, cleanup, mask, decode},
 
 RmgrData	RmgrTable[RM_MAX_ID + 1] = {
 #include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 00..ebbacce298
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,343 @@
+/*-
+ *
+ * simpleundolog.c
+ *		Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * In this module, procedures required during a transaction abort are
+ * logged. Persisting this information becomes crucial, particularly for
+ * ensuring reliable post-processing during the restart following a transaction
+ * crash. At present, in this module, logging of information is performed by
+ * simply appending data to a created file.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/clog.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/parallel.h"
+#include "access/xact.h"
+#include "catalog/storage_ulog.h"
+#include "storage/fd.h"
+
+#define ULOG_FILE_MAGIC 0x12345678
+
+typedef struct UndoLogFileHeader
+{
+	int32 magic;
+	bool  prepared;
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+	const char *name;
+	void	(*rm_undo) (SimpleUndoLogRecord *record, bool prepared);
+} UndoDescData;
+
+/* must be kept in sync with RmgrData definition in xlog_internal.h */
+#define 

Re: In-placre persistance change of a relation

2023-08-23 Thread Kyotaro Horiguchi
Thank you for looking this!

At Mon, 14 Aug 2023 12:38:48 -0700, Nathan Bossart  
wrote in 
> I think there are some good ideas here.  I started to take a look at the
> patches, and I've attached a rebased version of the patch set.  Apologies
> if I am repeating any discussions from upthread.
> 
> First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with
> the patch applied, and the results looked pretty impressive.
> 
>   before:
>   postgres=# alter table test set unlogged;
>   ALTER TABLE
>   Time: 5108.071 ms (00:05.108)
>   postgres=# alter table test set logged;
>   ALTER TABLE
>   Time: 6747.648 ms (00:06.748)
> 
>   after:
>   postgres=# alter table test set unlogged;
>   ALTER TABLE
>   Time: 25.609 ms
>   postgres=# alter table test set logged;
>   ALTER TABLE
>   Time: 1241.800 ms (00:01.242)

Thanks for confirmation. The difference between the both directions is
that making a table logged requires to emit WAL records for the entire
content.

> My first question is whether 0001 is a prerequisite to 0002.  I'm assuming
> it is, but the reason wasn't immediately obvious to me.  If it's just

In 0002, if a backend crashes after creating an init fork file but
before the associated commit, a lingering fork file could result in
data loss on the next startup. Thus, an utterly reliable file cleanup
mechanism is essential. 0001 also addresses the orphan storage files
issue arising from ALTER TABLE and similar commands.

> nice-to-have, perhaps we could simplify the patch set a bit.  I see that
> Heikki had some general concerns with the marker file approach [0], so
> perhaps it is at least worth brainstorming some alternatives if we _do_
> need it.

The rationale behind the file-based implementation is that any
leftover init fork file from a crash needs to be deleted before the
reinit(INIT) process kicks in, which happens irrelevantly to WAL,
before the start of crash recovery. I could implement it separately
from the reinit module, but I didn't since that results in almost a
duplication.

As commented in xlog.c, the purpose of the pre-recovery reinit CLEANUP
phase is to ensure hot standbys don't encounter erroneous unlogged
relations.  Based on that requirement, we need a mechanism to
guarantee that additional crucial operations are executed reliably at
the next startup post-crash, right before recovery kicks in (or reinit
CLEANUP). 0001 persists this data on a per-operation basis tightly
bonded to their target objects.

I could turn this into something like undo longs in a simple form, but
I'd rather not craft a general-purpose undo log system for this unelss
it's absolutely necessary.


> [0] https://postgr.es/m/9827ebd3-de2e-fd52-4091-a568387b1fc2%40iki.fi

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2023-08-14 Thread Nathan Bossart
I think there are some good ideas here.  I started to take a look at the
patches, and I've attached a rebased version of the patch set.  Apologies
if I am repeating any discussions from upthread.

First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with
the patch applied, and the results looked pretty impressive.

before:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 5108.071 ms (00:05.108)
postgres=# alter table test set logged;
ALTER TABLE
Time: 6747.648 ms (00:06.748)

after:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 25.609 ms
postgres=# alter table test set logged;
ALTER TABLE
Time: 1241.800 ms (00:01.242)

My first question is whether 0001 is a prerequisite to 0002.  I'm assuming
it is, but the reason wasn't immediately obvious to me.  If it's just
nice-to-have, perhaps we could simplify the patch set a bit.  I see that
Heikki had some general concerns with the marker file approach [0], so
perhaps it is at least worth brainstorming some alternatives if we _do_
need it.

[0] https://postgr.es/m/9827ebd3-de2e-fd52-4091-a568387b1fc2%40iki.fi

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5a4fb063a8b5e8a731373c4d06e51ad7fbeebebd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v29 1/3] Introduce storage mark files

In specific scenarios, certain operations followed by a crash-restart
may generate orphaned storage files that cannot be removed through
standard procedures or cause the server to fail during restart. This
commit introduces 'mark files' to convey information about the storage
file. In particular, an "UNCOMMITTED" mark file is implemented to
identify uncommitted files for removal during the subsequent startup.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  37 +++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 ++
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 268 +++-
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 287 +++---
 src/backend/storage/smgr/md.c |  94 ++-
 src/backend/storage/smgr/smgr.c   |  32 +++
 src/backend/storage/sync/sync.c   |  26 +-
 src/bin/pg_rewind/parsexlog.c |  16 ++
 src/common/relpath.c  |  47 ++--
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  35 ++-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 ++
 src/test/recovery/t/013_crash_restart.pl  |  21 ++
 src/tools/pgindent/typedefs.list  |   6 +
 22 files changed, 834 insertions(+), 129 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..e10f6af0e3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+Smgr MARK files

Re: In-placre persistance change of a relation

2023-04-27 Thread Kyotaro Horiguchi
(I find the misspelled subject makes it difficult to find the thread..)

At Thu, 27 Apr 2023 14:47:41 +0200, Jakub Wartak 
 wrote in 
> On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi
>  wrote:
> >
> > Rebased.
> >
> > I fixed some code comments and commit messages. I fixed the wrong
> > arrangement of some changes among patches.  Most importantly, I fixed
> > the a bug based on a wrong assumption that init-fork is not resides on
> > shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
> > to be removed.
> >
> > The new fourth patch is a temporary fix for recently added code, which
> > will soon be no longer needed.

This is no longer needed. Thank you, Thomas!

> Hi Kyotaro,
> 
> I've retested v28 of the patch with everything that came to my mind
> (basic tests, --enable-tap-tests, restarts/crashes along adding the
> data, checking if there were any files left over and I've checked for
> stuff that earlier was causing problems: GiST on geometry[PostGIS]).

Maybe it's fixed by dropping buffers.

> The only thing I've not tested this time were the performance runs
> done earlier. The patch passed all my very limited tests along with
> make check-world. Patch looks good to me on the surface from a
> usability point of view. I haven't looked at the code, so the patch
> might still need an in-depth review.

Thank you for conducting a thorough test. In this patchset, the first
one might be useful on its own and it is the most complex part. I'll
recheck it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: In-placre persistance change of a relation

2023-04-27 Thread Jakub Wartak
On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi
 wrote:
>
> Rebased.
>
> I fixed some code comments and commit messages. I fixed the wrong
> arrangement of some changes among patches.  Most importantly, I fixed
> the a bug based on a wrong assumption that init-fork is not resides on
> shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
> to be removed.
>
> The new fourth patch is a temporary fix for recently added code, which
> will soon be no longer needed.
>

Hi Kyotaro,

I've retested v28 of the patch with everything that came to my mind
(basic tests, --enable-tap-tests, restarts/crashes along adding the
data, checking if there were any files left over and I've checked for
stuff that earlier was causing problems: GiST on geometry[PostGIS]).
The only thing I've not tested this time were the performance runs
done earlier. The patch passed all my very limited tests along with
make check-world. Patch looks good to me on the surface from a
usability point of view. I haven't looked at the code, so the patch
might still need an in-depth review.

Regards,
-Jakub Wartak.




Re: In-placre persistance change of a relation

2023-04-25 Thread Kyotaro Horiguchi
At Fri, 17 Mar 2023 15:16:34 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Mmm. It took longer than I said, but this is the patch set that
> includes all three parts.
> 
> 1. "Mark files" to prevent orphan storage files for in-transaction
>   created relations after a crash.
> 
> 2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED
>   with wal_level minimal, and ALTER TABLE SET UNLOGGED with other
>   wal_levels, the commands don't require a file copy for the relation
>   storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits
>   bulk FPIs instead of a bunch of individual INSERTs.
> 
> 3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all
>   tables in a tablespace at once.
> 
> 
> As a side note, I quickly go over the behavior of the mark files
> introduced by the first patch, particularly what happens when deletion
> fails.
> 
> (1) The mark file for MAIN fork (".u") corresponds to all forks,
> while the mark file for INIT fork ("_init.u") corresponds to
> INIT fork alone.
> 
> (2) The mark file is created just before the the corresponding storage
> file is made. This is always logged in the WAL.
> 
> (3) The mark file is deleted after removing the corresponding storage
> file during the commit and rollback. This action is logged in the
> WAL, too. If the deletion fails, an ERROR is output and the
> transaction aborts.
> 
> (4) If a crash leaves a mark file behind, server will try to delete it
> after successfully removing the corresponding storage file during
> the subsequent startup that runs a recovery. If deletion fails,
> server leaves the mark file alone with emitting a WARNING. (The
> same behavior for non-mark files.)
> 
> (5) If the deletion of the mark file fails, the leftover mark file
> prevents the creation of the corresponding storage file (causing
> an ERROR).  The leftover mark files don't result in the removal of
> the wrong files due to that behavior.
> 
> (6) The mark file for an INIT fork is created only when ALTER TABLE
> SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal
> the crash-cleanup code to remove the INIT fork. (Otherwise the
> cleanup code removes the main fork instead. This is the main
> objective of introducing the mark files.)

Rebased.

I fixed some code comments and commit messages. I fixed the wrong
arrangement of some changes among patches.  Most importantly, I fixed
the a bug based on a wrong assumption that init-fork is not resides on
shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
to be removed.

The new fourth patch is a temporary fix for recently added code, which
will soon be no longer needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e645e4782c4a1562aa932f87f3932b4e54beac11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v28 1/4] Introduce storage mark files

In specific scenarios, certain operations followed by a crash-restart
may generate orphaned storage files that cannot be removed through
standard procedures or cause the server to fail during restart. This
commit introduces 'mark files' to convey information about the storage
file. In particular, an "UNCOMMITTED" mark file is implemented to
identify uncommitted files for removal during the subsequent startup.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  37 +++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 ++
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 268 +-
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 +++---
 src/backend/storage/smgr/md.c |  94 ++-
 src/backend/storage/smgr/smgr.c   |  32 +++
 src/backend/storage/sync/sync.c   |  26 +-
 src/bin/pg_rewind/parsexlog.c |  16 ++
 src/common/relpath.c  |  47 ++--
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  35 ++-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 ++
 src/test/recovery/t/013_crash_restart.pl  |  21 ++
 src/tools/pgindent/typedefs.list  |   6 +
 22 files changed, 847 insertions(+), 142 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if 

Re: In-placre persistance change of a relation

2023-03-17 Thread Kyotaro Horiguchi
At Fri, 03 Mar 2023 18:03:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Correctly they are three parts. The attached patch is the first part -
> the storage mark files, which are used to identify storage files that
> have not been committed and should be removed during the next
> startup. This feature resolves the issue of orphaned storage files
> that may result from a crash occurring during the execution of a
> transaction involving the creation of a new table.
> 
> I'll post all of the three parts shortly.

Mmm. It took longer than I said, but this is the patch set that
includes all three parts.

1. "Mark files" to prevent orphan storage files for in-transaction
  created relations after a crash.

2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED
  with wal_level minimal, and ALTER TABLE SET UNLOGGED with other
  wal_levels, the commands don't require a file copy for the relation
  storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits
  bulk FPIs instead of a bunch of individual INSERTs.

3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all
  tables in a tablespace at once.


As a side note, I quickly go over the behavior of the mark files
introduced by the first patch, particularly what happens when deletion
fails.

(1) The mark file for MAIN fork (".u") corresponds to all forks,
while the mark file for INIT fork ("_init.u") corresponds to
INIT fork alone.

(2) The mark file is created just before the the corresponding storage
file is made. This is always logged in the WAL.

(3) The mark file is deleted after removing the corresponding storage
file during the commit and rollback. This action is logged in the
WAL, too. If the deletion fails, an ERROR is output and the
transaction aborts.

(4) If a crash leaves a mark file behind, server will try to delete it
after successfully removing the corresponding storage file during
the subsequent startup that runs a recovery. If deletion fails,
server leaves the mark file alone with emitting a WARNING. (The
same behavior for non-mark files.)

(5) If the deletion of the mark file fails, the leftover mark file
prevents the creation of the corresponding storage file (causing
an ERROR).  The leftover mark files don't result in the removal of
the wrong files due to that behavior.

(6) The mark file for an INIT fork is created only when ALTER TABLE
SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal
the crash-cleanup code to remove the INIT fork. (Otherwise the
cleanup code removes the main fork instead. This is the main
objective of introducing the mark files.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ba4b8140fe582ceec4ea810621e17d6a1fe9c408 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v27 1/3] Storage mark files

In certain situations, specific operations followed by a crash-restart
can result in orphaned storage files.  These files cannot be removed
through standard methods.  To address this issue, this commit
implements 'mark files' that conveys information about the storage
file. Specifically, the "UNCOMMITED" mark file is introduced to denote
files that have not been committed and should be removed during the
next startup.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  37 +++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 ++
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 270 ++-
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 +++---
 src/backend/storage/smgr/md.c |  95 ++-
 src/backend/storage/smgr/smgr.c   |  32 +++
 src/backend/storage/sync/sync.c   |  26 +-
 src/bin/pg_rewind/parsexlog.c |  16 ++
 src/common/relpath.c  |  47 ++--
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  35 ++-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 ++
 src/test/recovery/t/013_crash_restart.pl  |  21 ++
 src/tools/pgindent/typedefs.list  |   6 +
 22 files changed, 848 insertions(+), 144 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = 

Re: In-placre persistance change of a relation

2023-03-03 Thread Kyotaro Horiguchi
At Wed, 1 Mar 2023 14:56:25 -0500, "Gregory Stark (as CFM)" 
 wrote in 
> On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi  
> wrote:
> >
> > Thank you for the comment!
> >
> > At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas  
> > wrote in
> > > I want to call out this part of this patch:
> 
> Looks like this patch has received some solid feedback from Heikki and
> you have a path forward. It's not currently building in the build farm
> either.
> 
> I'll set the patch to Waiting on Author for now.

Correctly they are three parts.

Correctly they are three parts. The attached patch is the first part -
the storage mark files, which are used to identify storage files that
have not been committed and should be removed during the next
startup. This feature resolves the issue of orphaned storage files
that may result from a crash occurring during the execution of a
transaction involving the creation of a new table.

I'll post all of the three parts shortly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1665e3428b9d777989864ea302eef8368a739e7e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v26] Storage mark files

In certain situations, specific operations followed by a crash-restart
can result in orphaned storage files.  These files cannot be removed
through standard methods.  To address this issue, this commit
implements 'mark files' that conveys information about the storage
file. Specifically, the "UNCOMMITED" mark file is introduced to denote
files that have not been committed and should be removed during the
next startup.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  37 +++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 ++
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 270 ++-
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 +++---
 src/backend/storage/smgr/md.c |  95 ++-
 src/backend/storage/smgr/smgr.c   |  32 +++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  16 ++
 src/common/relpath.c  |  47 ++--
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  35 ++-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 ++
 src/test/recovery/t/013_crash_restart.pl  |  21 ++
 src/tools/pgindent/typedefs.list  |   6 +
 22 files changed, 843 insertions(+), 144 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..bf83d19abd 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+Smgr MARK files
+
+
+An smgr mark file is an empty file that is created alongside a new
+relation storage file to signal that the storage file must be cleaned
+up during recovery.  In contrast to the four actions above, failing to
+remove 

Re: In-placre persistance change of a relation

2023-03-01 Thread Gregory Stark (as CFM)
On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi  wrote:
>
> Thank you for the comment!
>
> At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas  wrote 
> in
> > I want to call out this part of this patch:

Looks like this patch has received some solid feedback from Heikki and
you have a path forward. It's not currently building in the build farm
either.

I'll set the patch to Waiting on Author for now.


-- 
Gregory Stark
As Commitfest Manager




Re: In-placre persistance change of a relation

2023-02-06 Thread Kyotaro Horiguchi
Thank you for the comment!

At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas  wrote 
in 
> I want to call out this part of this patch:
> 
> > Also this allows for the cleanup of files left behind in the crash of
> > the transaction that created it.
> 
> This is interesting to a lot wider audience than ALTER TABLE SET
> LOGGED/UNLOGGED. It also adds most of the complexity, with the new
> marker files. Can you please split the first patch into two:
> 
> 1. Cleanup of newly created relations on crash
> 
> 2. ALTER TABLE SET LOGGED/UNLOGGED changes
> 
> Then we can review the first part independently.

Ah, indeed.  I'll do that.

> Regarding the first part, I'm not sure the marker files are the best
> approach to implement it. You need to create an extra file for every
> relation, just to delete it at commit. It feels a bit silly, but maybe

Agreed. (But I didn't come up with better idea..)

> it's OK in practice. The undo log patch set solved this problem with
> the undo log, but it looks like that patch set isn't going
> anywhere. Maybe invent a very lightweight version of the undo log for
> this?

I didn't thought on that line. Yes, indeed the marker files are a kind
of undo log.

Anyway, I'll split the current patch to two parts as suggested.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2023-02-02 Thread Heikki Linnakangas

I want to call out this part of this patch:


Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.


This is interesting to a lot wider audience than ALTER TABLE SET 
LOGGED/UNLOGGED. It also adds most of the complexity, with the new 
marker files. Can you please split the first patch into two:


1. Cleanup of newly created relations on crash

2. ALTER TABLE SET LOGGED/UNLOGGED changes

Then we can review the first part independently.

Regarding the first part, I'm not sure the marker files are the best 
approach to implement it. You need to create an extra file for every 
relation, just to delete it at commit. It feels a bit silly, but maybe 
it's OK in practice. The undo log patch set solved this problem with the 
undo log, but it looks like that patch set isn't going anywhere. Maybe 
invent a very lightweight version of the undo log for this?


- Heikki





Re: In-placre persistance change of a relation

2022-11-18 Thread Kyotaro Horiguchi
At Tue, 08 Nov 2022 11:33:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Indeed, thanks!  I'll do that in a few days.

Got too late, but rebased.. The contents of the two patches in the
last version was a bit shuffled but they are fixed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b8bcd1e9dc7c52c277de7f13bc21900efd2030dc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v25 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 559 +-
 src/backend/commands/tablecmds.c  | 267 +--
 src/backend/storage/buffer/bufmgr.c   |  87 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 340 +
 src/backend/storage/smgr/md.c |  95 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  43 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1483 insertions(+), 183 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..617a63e2c5 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+The Smgr MARK files
+
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time.  In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
 
 Skipping WAL for New RelFileLocator
 

Re: In-placre persistance change of a relation

2022-11-07 Thread Kyotaro Horiguchi
At Fri, 4 Nov 2022 09:32:52 +0900, Ian Lawrence Barwick  
wrote in 
> cfbot reports the patch no longer applies. As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.

Indeed, thanks!  I'll do that in a few days.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2022-11-03 Thread Ian Lawrence Barwick
2022年9月28日(水) 17:21 Kyotaro Horiguchi :
>
> Just rebased.

Hi

cfbot reports the patch no longer applies. As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.

Thanks

Ian Barwick




Re: In-placre persistance change of a relation

2022-09-28 Thread Kyotaro Horiguchi
Just rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8c0d5bd7b519149059d1b2b86a93ffe509e09b9b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v24 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 561 +-
 src/backend/commands/tablecmds.c  | 267 --
 src/backend/storage/buffer/bufmgr.c   |  85 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 
 src/backend/storage/smgr/md.c |  95 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  43 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1471 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 91c2578f7a..37527e16ca 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+The Smgr MARK files
+
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time.  In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
 
 Skipping WAL for New RelFileLocator
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2bb975943c..03e4bcec34 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 

Re: In-placre persistance change of a relation

2022-07-18 Thread Kyotaro Horiguchi
(Mmm. I haven't noticed an annoying misspelling in the subejct X( )

> Thank you for checking that!  It got a wider attack by b0a55e4329
> (RelFileNumber). The commit message suggests "relfilenode" as files

Then, now I stepped on my own foot. Rebased also on nodefuncs
autogeneration.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From cdd0282eb373ce79b8495b9a1160fb8c5122315e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v23 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 561 +-
 src/backend/commands/tablecmds.c  | 267 --
 src/backend/replication/basebackup.c  |   9 +-
 src/backend/storage/buffer/bufmgr.c   |  85 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 340 +
 src/backend/storage/smgr/md.c |  95 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  43 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1485 insertions(+), 183 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 734c39a4d0..f08bd7f42d 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+The Smgr MARK files
+
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time.  In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
 
 

Re: In-placre persistance change of a relation

2022-07-07 Thread Kyotaro Horiguchi
At Wed, 6 Jul 2022 08:44:18 -0700, Jacob Champion  
wrote in 
> On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in
> > > I don't think this can be commited to 15. So I post the fixed version
> > > then move this to the next CF.
> >
> > Then done. Thanks!
> 
> Hello! This patchset will need to be rebased over latest -- looks like
> b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350
> (Custom WAL Resource Managers) are interfering.

Thank you for checking that!  It got a wider attack by b0a55e4329
(RelFileNumber). The commit message suggests "relfilenode" as files
should be replaced with "relation storage/file" so I did that in
ResetUnloggedRelationsInDbspaceDir.

This patch said that:

>* INIT forks are never loaded to shared buffer so no point in
>* dropping buffers for such files.

But actually some *buildempty() functions use ReadBufferExtended() for
INIT_FORK. So that's wrong. So, I did that but... I don't like that.
Or I don't like that some AMs leave buffers for INIT fork after. But I
feel I'm misunderstanding here since I don't understand how the INIT
fork can work as expected after a crash that happens before the next
checkpoint flushes the buffers.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 864ce55a05e67d03119462efa1820905c222e9d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v22 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 561 +-
 src/backend/commands/tablecmds.c  | 267 --
 src/backend/replication/basebackup.c  |   9 +-
 src/backend/storage/buffer/bufmgr.c   |  85 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 340 +
 src/backend/storage/smgr/md.c |  95 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  43 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1485 insertions(+), 183 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case 

Re: In-placre persistance change of a relation

2022-07-06 Thread Jacob Champion
On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I don't think this can be commited to 15. So I post the fixed version
> > then move this to the next CF.
>
> Then done. Thanks!

Hello! This patchset will need to be rebased over latest -- looks like
b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350
(Custom WAL Resource Managers) are interfering.

Thanks,
--Jacob




Re: In-placre persistance change of a relation

2022-03-31 Thread Kyotaro Horiguchi
At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I don't think this can be commited to 15. So I post the fixed version
> then move this to the next CF.

Then done. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2022-03-31 Thread Kyotaro Horiguchi
At Thu, 31 Mar 2022 00:37:07 -0500, Justin Pryzby  wrote 
in 
> On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote:
> > Thanks! Version 20 is attached.
> 
> The patch failed an all CI tasks, and seems to have caused the macos task to
> hang.
> 
> http://cfbot.cputube.org/kyotaro-horiguchi.html
> 
> Would you send a fixed patch, or remove this thread from the CFBOT ?  Otherwis
e
> cirrrus will try to every day to rerun but take 1hr to time out, which is 
> twice
> as slow as the slowest OS.

That is found to be a thinko that causes mark files left behind in new
database created in the logged version of CREATE DATABASE. It is
easily fixed.

That being said, this failure revealed that pg_checksums or
pg_basebackup dislikes the mark files.  It happens even in a quite low
possibility. This would need further consideration and extra rounds of
reviews.

> I think this patch should be moved to the next CF and set to v16.

I don't think this can be commited to 15. So I post the fixed version
then move this to the next CF.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c4a2e19cd51c3a1470a916a235ab7c0e72ff498c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v21 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |   9 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 548 +-
 src/backend/commands/tablecmds.c  | 266 +--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  85 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 340 ++
 src/backend/storage/smgr/md.c |  94 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1464 insertions(+), 181 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 

Re: In-placre persistance change of a relation

2022-03-30 Thread Justin Pryzby
On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote:
> Thanks! Version 20 is attached.

The patch failed an all CI tasks, and seems to have caused the macos task to
hang.

http://cfbot.cputube.org/kyotaro-horiguchi.html

Would you send a fixed patch, or remove this thread from the CFBOT ?  Otherwise
cirrrus will try to every day to rerun but take 1hr to time out, which is twice
as slow as the slowest OS.

I think this patch should be moved to the next CF and set to v16.

Thanks,
-- 
Justin




Re: In-placre persistance change of a relation

2022-03-30 Thread Kyotaro Horiguchi
Thanks! Version 20 is attached.


At Wed, 30 Mar 2022 08:44:02 -0500, Justin Pryzby  wrote 
in 
> On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote:
> > Rebased on a recent xlog refactoring.
> 
> It'll come as no surprise that this neds to be rebased again.
> At least a few typos I reported in January aren't fixed.
> Set to "waiting".

Oh, I'm sorry for overlooking it. It somehow didn't show up on my
mailer.

> I started looking at this and reviewed docs and comments again.
> 
> > +typedef struct PendingCleanup
> > +{
> > +   RelFileNode relnode;/* relation that may need to be deleted 
> > */
> > +   int op; /* operation 
> > mask */
> > +   boolbufpersistence; /* buffer persistence to set */
> > +   int unlink_forknum; /* forknum to unlink */
> 
> This can be of data type "ForkNumber"

Right. Fixed.

> > +* We are going to create an init fork. If server crashes before the
> > +* current transaction ends the init fork left alone corrupts data while
> > +* recovery.  The mark file works as the sentinel to identify that
> > +* situation.
> 
> s/while/during/

This was in v17, but dissapeared in v18.

> > +* index-init fork needs further initialization. ambuildempty shoud do
> 
> should (I reported this before)
> 
> > +   if (inxact_created)
> > +   {
> > +   SMgrRelation srel = smgropen(rnode, InvalidBackendId);
> > +
> > +   /*
> > +* INIT forks never be loaded to shared buffer so no point in 
> > dropping
> 
> "are never loaded"

If was fixed in v18.

> > +   elog(DEBUG1, "perform in-place persistnce change");
> 
> persistence (I reported this before)

Sorry. Fixed.

> > +   /*
> > +* While wal_level >= replica, switching to LOGGED requires the
> > +* relation content to be WAL-logged to recover the table.
> > +* We don't emit this fhile wal_level = minimal.
> 
> while (or "if")

There are "While" and "fhile". I changed the latter to "if".

> > +* The relation is persistent and stays remain 
> > persistent. Don't
> > +* drop the buffers for this relation.
> 
> "stays remain" is redundant (I reported this before)

Thanks. I changed it to "stays persistent".

> > +   if (unlink(rm_path) < 0)
> > +   ereport(ERROR,
> > +   (errcode_for_file_access(),
> > +errmsg("could not remove file 
> > \"%s\": %m",
> > +   rm_path)));
> 
> The parens around errcode are unnecessary since last year.
> I suggest to avoid using them here and elsewhere.

It is just moved from elsewhere without editing, but of course I can
do that.  I didn't know about that change of ereport and not found the
corresponding commit, but I found that Tom mentioned that change.

https://www.postgresql.org/message-id/flat/5063.1584641224%40sss.pgh.pa.us#63e611c30800133bbddb48de857668e8
> Now that we can rely on having varargs macros, I think we could
> stop requiring the extra level of parentheses, ie instead of
...
> ereport(ERROR,
> errcode(ERRCODE_DIVISION_BY_ZERO),
> errmsg("division by zero"));
> 
> (The old syntax had better still work, of course.  I'm not advocating
> running around and changing existing calls.)

I changed all ereport calls added by this patch to this style.

> > +* And we may have SMGR_MARK_UNCOMMITTED file.  Remove 
> > it if the
> > +* fork files has been successfully removed. It's ok if 
> > the file
> 
> file

Fixed.

> > + 
> > +  All tables in the current database in a tablespace can be changed by 
> > using
> 
> given tablespace

I did /database in a tablespace/database in the given tablespace/. Is
it right?

> > +  the ALL IN TABLESPACE form, which will lock all 
> > tables
> 
> which will first lock
>
> > +  to be changed first and then change each one.  This form also 
> > supports
> 
> remove "first" here

This is almost a dead copy of the description of SET TABLESPACE. This
change makes the two almost the same description vary slightly in that
wordings.  Anyway I did that as suggested only for the part this patch
adds in this version.

> > +  OWNED BY, which will only change tables owned by 
> > the
> > +  roles specified.  If the NOWAIT option is 
> > specified
> 
> specified roles.
> is specified, (comma)

This is the same as above. I did that but it makes the description
differ from another almost-the-same description.

> > +  then the command will fail if it is unable to acquire all of the 
> > locks
> 
> if it is unable to immediately acquire
>
> > +  required immediately.  The information_schema
> 
> remove immediately

Ditto.

> > +  relations are not considered part of the 

Re: In-placre persistance change of a relation

2022-03-30 Thread Justin Pryzby
On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote:
> Rebased on a recent xlog refactoring.

It'll come as no surprise that this neds to be rebased again.
At least a few typos I reported in January aren't fixed.
Set to "waiting".




Re: In-placre persistance change of a relation

2022-03-01 Thread Kyotaro Horiguchi
At Tue, 01 Mar 2022 14:14:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> - Removed the default case in smgr_desc since it seems to me we don't
>  assume out-of-definition values in xlog records elsewhere.

Stupid.  The complier on the CI environemnt complains for
uninitialized variable even though it (presumably) knows that the all
paths of the switch statement set the variable.  Added default value
to try to silence compiler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ec75c49ffd939f6db8e0d840ef043c18845d1b9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v19 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |   9 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 548 +-
 src/backend/commands/tablecmds.c  | 266 +--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  86 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 ++
 src/backend/storage/smgr/md.c |  94 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 24 files changed, 1459 insertions(+), 183 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+The Smgr MARK files
+
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time.  In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
+
 Skipping WAL for New RelFileNode
 
 

Re: In-placre persistance change of a relation

2022-02-28 Thread Kyotaro Horiguchi
Rebased on a recent xlog refactoring.

No functional changes have been made.

- Removed the default case in smgr_desc since it seems to me we don't
 assume out-of-definition values in xlog records elsewhere.

- Simplified some added to storage.c.

- Fix copy-pasto'ed comments in extractPageInfo().

- The previous version smgrDoPendingCleanups() assumes that init-fork
  are not loaded onto shared buffer but it is wrong
  (SetRelationBuffersPersistence assumes the opposite.).  Thus we need
  to drop buffers before unlink an init fork. But it is already
  guaranteed by logic so I rewrote the comment for for PCOP_UNLINK_FORK.

  > * Unlink the fork file. Currently we use this only for
  > * init forks and we're sure that the init fork is not
  > * loaded on shared buffers.  For RelationDropInitFork
  > * case, the function dropped that buffers. For
  > * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
  > * is set and the buffers have been dropped just before.
  
  This logic has the same critical window as
  DropRelFilenodeBuffers. That is, if file deletion fails after
  successful buffer dropping, theoretically the file content of the
  init fork may be stale. However, AFAICS init-fork is write-once fork
  so I don't think that actually matters.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 420a9d9a0dae3bcfb1396c14997624ad67a3e557 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v18 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |   9 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/catalog/storage.c | 548 +-
 src/backend/commands/tablecmds.c  | 266 +--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  86 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 ++
 src/backend/storage/smgr/md.c |  94 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 23 files changed, 1459 insertions(+), 182 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..f8908e2c0a 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			

Re: In-placre persistance change of a relation

2022-01-18 Thread Kyotaro Horiguchi
At Tue, 18 Jan 2022 10:37:53 -0500, Tom Lane  wrote in 
> Julien Rouhaud  writes:
> > The cfbot is failing on all OS with this version of the patch.  Apparently
> > v16-0002 introduces some usage of "testtablespace" client-side variable 
> > that's
> > never defined, e.g.
> 
> That test infrastructure got rearranged very recently, see d6d317dbf.

Thanks to both. It seems that even though I know about the change, I
forgot to make my repo up to date before checking.

The v17 attached changes only the following point (as well as
corresponding "expected" file).

-+CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
++CREATE TABLESPACE regress_tablespace LOCATION '';

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c227842521de00d5da9dffb2f2dd86e8c1c171a8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v17 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  94 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/test/recovery/t/027_persistence_change.pl | 263 +
 24 files changed, 1724 insertions(+), 182 deletions(-)
 create mode 100644 src/test/recovery/t/027_persistence_change.pl

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..2c674e5de0 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart 

Re: In-placre persistance change of a relation

2022-01-18 Thread Tom Lane
Julien Rouhaud  writes:
> The cfbot is failing on all OS with this version of the patch.  Apparently
> v16-0002 introduces some usage of "testtablespace" client-side variable that's
> never defined, e.g.

That test infrastructure got rearranged very recently, see d6d317dbf.

regards, tom lane




Re: In-placre persistance change of a relation

2022-01-17 Thread Julien Rouhaud
Hi,

On Fri, Jan 14, 2022 at 11:43:10AM +0900, Kyotaro Horiguchi wrote:
> I found a bug.
> 
> mdmarkexists() didn't close the tentatively opend fd. This is a silent
> leak on Linux and similars and it causes delete failure on Windows.
> It was the reason of the CI failure.
> 
> 027_persistence_change.pl uses interactive_psql() that doesn't work on
> the Windos VM on the CI.
> 
> In this version the following changes have been made in 0001.
> 
> - Properly close file descriptor in mdmarkexists.
> 
> - Skip some tests when IO::Pty is not available.
>   It might be better to separate that part.
> 
> Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I
> noticed that it doesn't implement OWNED BY part and doesn't have test
> and documenttaion (it was PoC). Added all of them to 0002.

The cfbot is failing on all OS with this version of the patch.  Apparently
v16-0002 introduces some usage of "testtablespace" client-side variable that's
never defined, e.g.
https://api.cirrus-ci.com/v1/artifact/task/4670105480069120/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs:

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out 
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out   
2022-01-18 04:26:38.744707547 +
+++ 
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out
2022-01-18 04:30:37.557078083 +
@@ -948,76 +948,71 @@
 CREATE SCHEMA testschema;
 GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
 CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
+ERROR:  syntax error at or near ":"
+LINE 1: CREATE TABLESPACE regress_tablespace LOCATION :'testtablespa...




Re: In-placre persistance change of a relation

2022-01-13 Thread Kyotaro Horiguchi
I found a bug.

mdmarkexists() didn't close the tentatively opend fd. This is a silent
leak on Linux and similars and it causes delete failure on Windows.
It was the reason of the CI failure.

027_persistence_change.pl uses interactive_psql() that doesn't work on
the Windos VM on the CI.

In this version the following changes have been made in 0001.

- Properly close file descriptor in mdmarkexists.

- Skip some tests when IO::Pty is not available.
  It might be better to separate that part.

Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I
noticed that it doesn't implement OWNED BY part and doesn't have test
and documenttaion (it was PoC). Added all of them to 0002.

At Tue, 11 Jan 2022 09:33:55 +, Jakub Wartak  
wrote in 
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
> 
> I've retested v15 of the patch with everything that came to my mind. The 
> patch passes all my tests (well, there's this just windows / cfbot issue). 
> Patch looks good to me. I haven't looked in-depth at the code, so patch might 
> still need review.

Thanks for checking.

> FYI, about potential usage of this patch: the most advanced test that I did 
> was continually bouncing wal_level - it works. So chain of :
> 1. wal_level=replica->minimal
> 2. alter table set unlogged and load a lot of data, set logged
> 3. wal_level=minimal->replica
> 4. barman incremental backup # rsync(1) just backups changed files in steps 2 
> and 3 (not whole DB)
> 5. some other (logged) work
> The idea is that when performing mass-alterations to the DB (think nightly 
> ETL/ELT on TB-sized DBs), one could skip backing up most of DB and then just 
> quickly backup only the changed files - during the maintenance window - e.g. 
> thanks to local-rsync barman mode. This is the output of barman show-backups 
> after loading data to unlogged table each such cycle:
> mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 
> 16.0 KiB
> mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 
> 80.2 KiB
> mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 
> 496.3 KiB
> And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on 
> disk: 36.1 GiB (-74.96% deduplication ratio).  

Ah, The patch skips duping relation files. This is advantageous that
that not only eliminates the I/O activities the duping causes but also
reduce the size of incremental backup.  I didn't noticed only the
latter advantage.

> The only thing I've found out that bouncing wal_level also forces 
> max_wal_senders=X -> 0 -> X which in turn requires dropping replication slot 
> for pg_receievewal (e.g. barman receive-wal 
> --create-slot/--drop-slot/--reset). I have tested the restore using barman 
> recover afterwards to backup 20220110T094905  and indeed it worked OK using 
> this patch too.

Year, it is irrelevant to this patch but I'm annoyed by the
restriction.  I think it would be okay that max_wal_senders is
forcibly set to 0 while wal_level=minimal..

> The new status of this patch is: Needs review

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d6bf0bd0d60391b24d5be7942b546acfffa3d7b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v16 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  94 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 

Re: In-placre persistance change of a relation

2022-01-11 Thread Jakub Wartak
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I've retested v15 of the patch with everything that came to my mind. The patch 
passes all my tests (well, there's this just windows / cfbot issue). Patch 
looks good to me. I haven't looked in-depth at the code, so patch might still 
need review.

FYI, about potential usage of this patch: the most advanced test that I did was 
continually bouncing wal_level - it works. So chain of :
1. wal_level=replica->minimal
2. alter table set unlogged and load a lot of data, set logged
3. wal_level=minimal->replica
4. barman incremental backup # rsync(1) just backups changed files in steps 2 
and 3 (not whole DB)
5. some other (logged) work
The idea is that when performing mass-alterations to the DB (think nightly 
ETL/ELT on TB-sized DBs), one could skip backing up most of DB and then just 
quickly backup only the changed files - during the maintenance window - e.g. 
thanks to local-rsync barman mode. This is the output of barman show-backups 
after loading data to unlogged table each such cycle:
mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 
16.0 KiB
mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 
80.2 KiB
mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 
496.3 KiB
And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 
36.1 GiB (-74.96% deduplication ratio).  

The only thing I've found out that bouncing wal_level also forces 
max_wal_senders=X -> 0 -> X which in turn requires dropping replication slot 
for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). 
I have tested the restore using barman recover afterwards to backup 
20220110T094905  and indeed it worked OK using this patch too.

The new status of this patch is: Needs review


Re: In-placre persistance change of a relation

2022-01-07 Thread Kyotaro Horiguchi
At Thu, 06 Jan 2022 16:39:21 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Fantastic! I'll give it a try. Thanks!

I did that and found that the test stumbled on newlines.
Tests succeeded for other than Windows.

Windows version fails for a real known issue.

[7916][postmaster] LOG:  received immediate shutdown request
[7916][postmaster] LOG:  database system is shut down
[6228][postmaster] LOG:  starting PostgreSQL 15devel, compiled by Visual C++ 
build 1929, 64-bit
[6228][postmaster] LOG:  listening on Unix socket 
"C:/Users/ContainerAdministrator/AppData/Local/Temp/NcMnt2KTsr/.s.PGSQL.58698"
[2948][startup] LOG:  database system was interrupted; last known up at 
2022-01-07 07:12:14 GMT
[2948][startup] LOG:  database system was not properly shut down; automatic 
recovery in progress
[2948][startup] LOG:  redo starts at 0/1484280
[2948][startup] LOG:  invalid record length at 0/14A47B8: wanted 24, got 0
[2948][startup] FATAL:  could not remove file "base/12759/16384.u": Permission 
denied
[6228][postmaster] LOG:  startup process (PID 2948) exited with exit code 1


Mmm.. Someone is still grasping the file after restart?

Anyway, I post the fixed version.  This still fails on Windows..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 48527df0c7d094a8ca7cc8d0c90df02bfd7c2614 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v15 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  93 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/test/recovery/t/027_persistence_change.pl | 247 
 24 files changed, 1707 insertions(+), 182 deletions(-)
 create mode 100644 src/test/recovery/t/027_persistence_change.pl

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			

Re: In-placre persistance change of a relation

2022-01-05 Thread Kyotaro Horiguchi
At Wed, 05 Jan 2022 20:42:32 -0800, Andres Freund  wrote in 
> Hi, 
> 
> On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi 
>  wrote:
> >I'm still not sure the reason the test item failed but I repost the
> >updated version then watch what the CI says.
> 
> Fwiw, you can now test the same way as cfbot does with a lower turnaround 
> time, as explained in src/tools/ci/README

Fantastic! I'll give it a try. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2022-01-05 Thread Andres Freund
Hi, 

On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi  
wrote:
>At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund  wrote in 
>> The tap tests seems to fail on all platforms. See
>> https://cirrus-ci.com/build/4911549314760704
>> 
>> E.g. the linux failure is
>> 
>> [16:45:15.569]
>> [16:45:15.569] #   Failed test 'inserted'
>> [16:45:15.569] #   at t/027_persistence_change.pl line 121.
>> [16:45:15.569] # Looks like you failed 1 test of 25.
>> [16:45:15.569] [16:45:15] t/027_persistence_change.pl ..
>> [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
>> [16:45:15.569] Failed 1/25 subtests
>> [16:45:15.569] [16:45:15]
>> [16:45:15.569]
>> [16:45:15.569] Test Summary Report
>> [16:45:15.569] ---
>> [16:45:15.569] t/027_persistence_change.pl(Wstat: 256 Tests: 25 
>> Failed: 1)
>> [16:45:15.569]   Failed test:  18
>> [16:45:15.569]   Non-zero exit status: 1
>> [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr  0.03 sys 
>> + 48.94 cusr 17.13 csys = 66.24 CPU)
>> 
>> https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change
>
>Thank you very much. It still doesn't fail on my devlopment
>environment (CentOS8), but I found a silly bug of the test script.
>I'm still not sure the reason the test item failed but I repost the
>updated version then watch what the CI says.

Fwiw, you can now test the same way as cfbot does with a lower turnaround time, 
as explained in src/tools/ci/README
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: In-placre persistance change of a relation

2022-01-05 Thread Kyotaro Horiguchi
At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund  wrote in 
> The tap tests seems to fail on all platforms. See
> https://cirrus-ci.com/build/4911549314760704
> 
> E.g. the linux failure is
> 
> [16:45:15.569]
> [16:45:15.569] #   Failed test 'inserted'
> [16:45:15.569] #   at t/027_persistence_change.pl line 121.
> [16:45:15.569] # Looks like you failed 1 test of 25.
> [16:45:15.569] [16:45:15] t/027_persistence_change.pl ..
> [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
> [16:45:15.569] Failed 1/25 subtests
> [16:45:15.569] [16:45:15]
> [16:45:15.569]
> [16:45:15.569] Test Summary Report
> [16:45:15.569] ---
> [16:45:15.569] t/027_persistence_change.pl(Wstat: 256 Tests: 25 
> Failed: 1)
> [16:45:15.569]   Failed test:  18
> [16:45:15.569]   Non-zero exit status: 1
> [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr  0.03 sys + 
> 48.94 cusr 17.13 csys = 66.24 CPU)
> 
> https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change

Thank you very much. It still doesn't fail on my devlopment
environment (CentOS8), but I found a silly bug of the test script.
I'm still not sure the reason the test item failed but I repost the
updated version then watch what the CI says.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e2deae1bef19827803e0e8f85b1e45e3fcd88505 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v14 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  93 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/test/recovery/t/027_persistence_change.pl | 247 
 24 files changed, 1707 insertions(+), 182 deletions(-)
 create mode 100644 src/test/recovery/t/027_persistence_change.pl

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 

Re: In-placre persistance change of a relation

2022-01-04 Thread Andres Freund
On 2021-12-23 15:33:35 +0900, Kyotaro Horiguchi wrote:
> At Thu, 23 Dec 2021 15:01:41 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I added TAP test to excecise the in-place persistence change.
>
> We don't need a base table for every index. TAP test revised.

The tap tests seems to fail on all platforms. See
https://cirrus-ci.com/build/4911549314760704

E.g. the linux failure is

[16:45:15.569]
[16:45:15.569] #   Failed test 'inserted'
[16:45:15.569] #   at t/027_persistence_change.pl line 121.
[16:45:15.569] # Looks like you failed 1 test of 25.
[16:45:15.569] [16:45:15] t/027_persistence_change.pl ..
[16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
[16:45:15.569] Failed 1/25 subtests
[16:45:15.569] [16:45:15]
[16:45:15.569]
[16:45:15.569] Test Summary Report
[16:45:15.569] ---
[16:45:15.569] t/027_persistence_change.pl(Wstat: 256 Tests: 25 Failed: 
1)
[16:45:15.569]   Failed test:  18
[16:45:15.569]   Non-zero exit status: 1
[16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr  0.03 sys + 
48.94 cusr 17.13 csys = 66.24 CPU)

https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change

Greetings,

Andres Freund




Re: In-placre persistance change of a relation

2021-12-22 Thread Kyotaro Horiguchi
At Thu, 23 Dec 2021 15:01:41 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I added TAP test to excecise the in-place persistence change.

We don't need a base table for every index. TAP test revised.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 112c077561bb24a0b40995e2d6ada7b33edd6475 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v13 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  93 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/test/recovery/t/027_persistence_change.pl | 244 
 24 files changed, 1704 insertions(+), 182 deletions(-)
 create mode 100644 src/test/recovery/t/027_persistence_change.pl

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The Smgr MARK files
+
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time.  In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
 
 Skipping WAL for New RelFileNode
 

Re: In-placre persistance change of a relation

2021-12-22 Thread Kyotaro Horiguchi
At Wed, 22 Dec 2021 08:42:14 +, Jakub Wartak  
wrote in 
> I think there's slight omission:
...
> apparently reindex_index() params cannot be NULL - the same happens with 
> switching persistent

Hmm. a3dc926009 has changed the interface. (But the name is also
changed after that.)

-reindex_relation(Oid relid, int flags, int options)
+reindex_relation(Oid relid, int flags, ReindexParams *params)

> I'll also try to give another shot to the patch early next year - as we are 
> starting long Christmas/holiday break here 
> - with verifying WAL for GiST and more advanced setup (more crashes, and 
> standby/archiving/barman to see 
> how it's possible to use wal_level=minimal <-> replica transitions). 

Thanks. I added TAP test to excecise the in-place persistence change.

have a nice holiday, Jakub!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 19a8bb14dc863981e33ce8a10ecb9b87a4aa3937 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v12 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  52 ++
 src/backend/access/transam/README |   8 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlog.c |  17 +
 src/backend/catalog/storage.c | 545 +-
 src/backend/commands/tablecmds.c  | 266 +++--
 src/backend/replication/basebackup.c  |   3 +-
 src/backend/storage/buffer/bufmgr.c   |  88 +++
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 344 +++
 src/backend/storage/smgr/md.c |  93 ++-
 src/backend/storage/smgr/smgr.c   |  32 +
 src/backend/storage/sync/sync.c   |  20 +-
 src/bin/pg_rewind/parsexlog.c |  24 +
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  42 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |  10 +-
 src/include/storage/smgr.h|  17 +
 src/test/recovery/t/027_persistence_change.pl | 268 +
 24 files changed, 1728 insertions(+), 182 deletions(-)
 create mode 100644 src/test/recovery/t/027_persistence_change.pl

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ 

RE: In-placre persistance change of a relation

2021-12-22 Thread Jakub Wartak
Hi Kyotaro,

> At Tue, 21 Dec 2021 13:07:28 +, Jakub Wartak
>  wrote in
> > So what's suspicious is that 122880 -> 0 file size truncation. I've
> > investigated WAL and it seems to contain TRUNCATE records after logged
> FPI images, so when the crash recovery would kick in it probably clears this
> table (while it shouldn't).
> 
> Darn..  It is too silly that I wrongly issued truncate records for the target
> relation of the function (rel) instaed of the relation on which we're 
> currently
> operating at that time (r).
> 
> [..]
> The following fix works.

Cool, I have verified basic stuff that was coming to my mind, now even cfbot is 
happy with v11, You should happy too I hope :)

> I made another change in this version. Previously only btree among all index
> AMs was processed in the in-place manner.  In this version we do that all
> AMs except GiST.  Maybe if gistGetFakeLSN behaved the same way for
> permanent and unlogged indexes, we could skip index rebuild in exchange of
> some extra WAL records emitted while it is unlogged.

I think there's slight omission:

-- unlogged table -> logged with GiST:
DROP TABLE IF EXISTS testcase;
CREATE UNLOGGED TABLE testcase(geom geometry not null);
CREATE INDEX idx_testcase_gist ON testcase USING gist(geom);
INSERT INTO testcase(geom) SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(-1.0, 
2.0),4326), 0.0001);
ALTER TABLE testcase SET LOGGED;

-- crashes with:
(gdb) where
#0  reindex_index (indexId=indexId@entry=65541, 
skip_constraint_checks=skip_constraint_checks@entry=true, 
persistence=persistence@entry=112 'p', params=params@entry=0x0) at index.c:3521
#1  0x0062f494 in RelationChangePersistence (tab=tab@entry=0x1947258, 
persistence=112 'p', lockmode=lockmode@entry=8) at tablecmds.c:5434
#2  0x00642819 in ATRewriteTables (context=0x7ffc19c04520, 
lockmode=, wqueue=0x7ffc19c04388, parsetree=0x1925ec8) at 
tablecmds.c:5644
[..]
#10 0x007f078f in exec_simple_query (query_string=0x1925340 "ALTER 
TABLE testcase SET LOGGED;") at postgres.c:1215

apparently reindex_index() params cannot be NULL - the same happens with 
switching persistent 
table to unlogged one too (with GiST). 

I'll also try to give another shot to the patch early next year - as we are 
starting long Christmas/holiday break here 
- with verifying WAL for GiST and more advanced setup (more crashes, and 
standby/archiving/barman to see 
how it's possible to use wal_level=minimal <-> replica transitions). 

-J.








Re: In-placre persistance change of a relation

2021-12-21 Thread Kyotaro Horiguchi
Hello, Jakub.

At Tue, 21 Dec 2021 13:07:28 +, Jakub Wartak  
wrote in 
> So what's suspicious is that 122880 -> 0 file size truncation. I've 
> investigated WAL and it seems to contain TRUNCATE records
> after logged FPI images, so when the crash recovery would kick in it probably 
> clears this table (while it shouldn't).

Darn..  It is too silly that I wrongly issued truncate records for the
target relation of the function (rel) instaed of the relation on which
we're currently operating at that time (r).

> However if I perform CHECKPOINT just before crash the WAL stream contains 
> just RUNNING_XACTS and CHECKPOINT_ONLINE 
> redo records, this probably prevents truncating. I'm newbie here so please 
> take this theory with grain of salt, it can be 
> something completely different.

It is because the WAL records are inconsistent with the on-disk state.
After a crash before a checkpoint after the SET LOGGED, recovery ends with
recoverying the broken WAL records, but after that the on-disk state
is persisted and the broken WAL records are not replayed.

The following fix works.

--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5478,7 +5478,7 @@ RelationChangePersistence(AlteredTableInfo *tab, char 
persistence,
xl_smgr_truncate xlrec;
 
xlrec.blkno = 0;
-   xlrec.rnode = rel->rd_node;
+   xlrec.rnode = r->rd_node;
xlrec.flags = SMGR_TRUNCATE_ALL;
 

I made another change in this version. Previously only btree among all
index AMs was processed in the in-place manner.  In this version we do
that all AMs except GiST.  Maybe if gistGetFakeLSN behaved the same
way for permanent and unlogged indexes, we could skip index rebuild in
exchange of some extra WAL records emitted while it is unlogged.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0cac0fade05322c1aa8b7ec020f8fe1f9e5fb50e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v11 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xact.c  |   7 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 545 -
 src/backend/commands/tablecmds.c   | 265 ++--
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 344 +++-
 src/backend/storage/smgr/md.c  |  93 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   3 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/common/relpath.h   |   9 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 23 files changed, 1459 insertions(+), 182 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == 

RE: In-placre persistance change of a relation

2021-12-21 Thread Jakub Wartak
Hi Kyotaro,

> I took a bit too long detour but the patch gets to pass make-world for me.

Good news, v10 passes all the tests for me (including TAP recover ones). 
There's major problem I think:

drop table t6;
create unlogged table t6 (id bigint, t text);
create sequence s1;
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 
100);
alter table t6 set logged;
select pg_sleep(1);
<--optional checkpoint, more on this later.
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
select count(*) from t6; -- shows 0 rows

But If I perform checkpoint before crash, data is there..  apparently the 
missing steps done by checkpointer 
seem to help. If checkpoint is not done, then some peeking reveals that upon 
startup there is truncation (?!):

$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate 
stop
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
112723206  120 -rw---   1 postgres postgres   122880 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741
112723202   24 -rw---   1 postgres postgres24576 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741_fsm
$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start done
server started
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
1127232060 -rw---   1 postgres postgres0 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741
112723202   16 -rw---   1 postgres postgres16384 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741_fsm

So what's suspicious is that 122880 -> 0 file size truncation. I've 
investigated WAL and it seems to contain TRUNCATE records
after logged FPI images, so when the crash recovery would kick in it probably 
clears this table (while it shouldn't).

However if I perform CHECKPOINT just before crash the WAL stream contains just 
RUNNING_XACTS and CHECKPOINT_ONLINE 
redo records, this probably prevents truncating. I'm newbie here so please take 
this theory with grain of salt, it can be 
something completely different.

-J.





Re: In-placre persistance change of a relation

2021-12-21 Thread Kyotaro Horiguchi
At Tue, 21 Dec 2021 17:13:21 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ugh! I completely forgot about TAP tests.. Thanks for the testing and
> sorry for the bugs.
> 
> This is a bit big change so I need a bit of time before posting the
> next version.

I took a bit too long detour but the patch gets to pass make-world for
me.

In this version:

- When relation persistence is changed from logged to unlogged, buffer
 persistence is flipped then an init-fork is created along with a mark
 file for the fork (RelationCreateInitFork). The mark file is removed
 at commit but left alone after a crash before commit. At the next
 startup, ResetUnloggedRelationsInDbspaceDir() removes the init fork
 file if it finds the mark file corresponding to the file.

- When relation persistence is changed from unlogged to logged, buffer
  persistence is flipped then the exisging init-fork is marked to be
  dropped at commit (RelationDropInitFork). Finally the whole content
  is WAL-logged in the page-wise manner (RelationChangePersistence),

- The two operations above are repeatable within a transaction and
 commit makes the last operation persist and rollback make the all
 operations abandoned.

- Storage files are created along with a "mark" file for the
 relfilenode. It behaves the same way to the above except the mark
 files corresponds to the whole relfilenode.

- The at-commit operations this patch adds require to be WAL-logged so
 they don't fit pendingDeletes list, which is executed after commit. I
 added a new pending-work list pendingCleanups that is executed just
 after pendingSyncs.  (new in this version)

 
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From bc7e14b8af3c72e4ab99c964688d18ef4545f8b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v10 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xact.c  |   7 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 545 -
 src/backend/commands/tablecmds.c   | 256 ++--
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 344 +++-
 src/backend/storage/smgr/md.c  |  93 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   3 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/common/relpath.h   |   9 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 23 files changed, 1450 insertions(+), 182 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", 

Re: In-placre persistance change of a relation

2021-12-21 Thread Kyotaro Horiguchi
Ugh! I completely forgot about TAP tests.. Thanks for the testing and
sorry for the bugs.

This is a bit big change so I need a bit of time before posting the
next version.


At Mon, 20 Dec 2021 13:38:35 +, Jakub Wartak  
wrote in 
> 1) check-worlds seems OK but make -C src/test/recovery check shows a couple 
> of failing tests here locally and in 
> https://cirrus-ci.com/task/4699985735319552?logs=test#L807 :
> t/009_twophase.pl  (Wstat: 256 Tests: 24 Failed: 1)
>   Failed test:  21
>   Non-zero exit status: 1

PREPARE TRANSACTION requires uncommited file creation to be
committed. Concretely we need to remove the "mark" files for the
in-transaction created relation file during PREPARE TRANSACTION.

pendingSync is not a parallel mechanism with pendingDeletes so we
cannot move mark deletion to pendingSync.

After all I decided to add a separate list pendingCleanups for pending
non-deletion tasks separately from pendingDeletes and execute it
before insering the commit record.  Not only the above but also all of
the following failures vanished by the change.

> t/014_unlogged_reinit.pl   (Wstat: 512 Tests: 12 Failed: 2)
>   Failed tests:  9-10
>   Non-zero exit status: 2
> t/018_wal_optimize.pl  (Wstat: 7424 Tests: 0 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 38 tests but ran 0.
> t/022_crash_temp_files.pl  (Wstat: 7424 Tests: 6 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 9 tests but ran 6.


> 018 made no sense, I've tried to take a quick look with wal_level=minimal why 
> it is failing , it is mystery to me as the sequence seems to be pretty basic 
> but the outcome is not:

I think this shares the same cause.

> ~> cat repro.sql
> create tablespace tbs1 location '/tbs1';
> CREATE TABLE moved (id int);
> INSERT INTO moved VALUES (1);
> BEGIN;
> ALTER TABLE moved SET TABLESPACE tbs1;
> CREATE TABLE originated (id int);
> INSERT INTO originated VALUES (1);
> CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1;
> COMMIT;
..
> ERROR:  could not open file "base/32833/32839": No such file or directory


> z3=# \dt+
>   List of relations
>  Schema |Name| Type  |  Owner   | Persistence |  Size   | Description
> ++---+--+-+-+-
>  public | moved  | table | postgres | permanent   | 0 bytes |
>  public | originated | table | postgres | permanent   | 0 bytes |
> 
> This happens even without placing on tablespace at all {for originated table 
> , but no for moved on}, some major mishap is there (commit should guarantee 
> correctness) or I'm tired and having sloppy fingers.
> 
> 2) minor one testcase, still something is odd.
> 
> drop tablespace tbs1;
> create tablespace tbs1 location '/tbs1';
> CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1;
> CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1;
> CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1;
> CREATE TABLE t7 (a int) tablespace tbs1;
> insert into t7 values (1);
> insert into t5 values (1);
> insert into t6 values (1);
> \dt+
>  List of relations
>  Schema | Name | Type  |  Owner   | Persistence |Size| Description
> +--+---+--+-++-
>  public | t4   | table | postgres | unlogged| 0 bytes|
>  public | t5   | table | postgres | unlogged| 8192 bytes |
>  public | t6   | table | postgres | unlogged| 8192 bytes |
>  public | t7   | table | postgres | permanent   | 8192 bytes |
> (4 rows)
> 
> ALTER TABLE ALL IN TABLESPACE tbs1 set logged; 
> ==> STILL WARNING:  unrecognized node type: 349
> \dt+
>  List of relations
>  Schema | Name | Type  |  Owner   | Persistence |Size| Description
> +--+---+--+-++-
>  public | t4   | table | postgres | permanent   | 0 bytes|
>  public | t5   | table | postgres | permanent   | 8192 bytes |
>  public | t6   | table | postgres | permanent   | 8192 bytes |
>  public | t7   | table | postgres | permanent   | 8192 bytes |
> 
> So it did rewrite however this warning seems to be unfixed. I've tested on 
> e2c52beecdea152ca680a22ef35c6a7da55aa30f.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: In-placre persistance change of a relation

2021-12-20 Thread Jakub Wartak
Hi Kyotaro,

> At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi
>  wrote in
> > At Mon, 20 Dec 2021 07:59:29 +, Jakub Wartak
> >  wrote in
> > > BTW fast feedback regarding that ALTER patch  (there were 4 unlogged
> tables):
> > > #  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
> > > WARNING:  unrecognized node type: 349
> >
> > lol  I met a server crash. Will fix. Thanks!
> 
> That crash vanished after a recompilation for me and I don't see that error.  
> On
> my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which
> 0002 adds.  So perhaps make clean/make all would fix that.

The fastest I could - I've repeated the whole cycle about that one with fresh 
v9 (make clean, configure, make install, fresh initdb) and I've found two 
problems:

1) check-worlds seems OK but make -C src/test/recovery check shows a couple of 
failing tests here locally and in 
https://cirrus-ci.com/task/4699985735319552?logs=test#L807 :
t/009_twophase.pl  (Wstat: 256 Tests: 24 Failed: 1)
  Failed test:  21
  Non-zero exit status: 1
t/014_unlogged_reinit.pl   (Wstat: 512 Tests: 12 Failed: 2)
  Failed tests:  9-10
  Non-zero exit status: 2
t/018_wal_optimize.pl  (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 38 tests but ran 0.
t/022_crash_temp_files.pl  (Wstat: 7424 Tests: 6 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 9 tests but ran 6.

018 made no sense, I've tried to take a quick look with wal_level=minimal why 
it is failing , it is mystery to me as the sequence seems to be pretty basic 
but the outcome is not:
~> cat repro.sql
create tablespace tbs1 location '/tbs1';
CREATE TABLE moved (id int);
INSERT INTO moved VALUES (1);
BEGIN;
ALTER TABLE moved SET TABLESPACE tbs1;
CREATE TABLE originated (id int);
INSERT INTO originated VALUES (1);
CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1;
COMMIT;

~> psql -f repro.sql z3; sleep 1; /usr/pgsql-15/bin/pg_ctl -D 
/var/lib/pgsql/15/data -l logfile -m immediate stop
CREATE TABLESPACE
CREATE TABLE
INSERT 0 1
BEGIN
ALTER TABLE
CREATE TABLE
INSERT 0 1
CREATE INDEX
COMMIT
waiting for server to shut down done
server stopped
~> /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start done
server started
z3# select * from moved;
ERROR:  could not open file "pg_tblspc/32834/PG_15_202112131/32833/32838": No 
such file or directory
z3=# select * from originated;
ERROR:  could not open file "base/32833/32839": No such file or directory
z3=# \dt+
  List of relations
 Schema |Name| Type  |  Owner   | Persistence |  Size   | Description
++---+--+-+-+-
 public | moved  | table | postgres | permanent   | 0 bytes |
 public | originated | table | postgres | permanent   | 0 bytes |

This happens even without placing on tablespace at all {for originated table , 
but no for moved on}, some major mishap is there (commit should guarantee 
correctness) or I'm tired and having sloppy fingers.

2) minor one testcase, still something is odd.

drop tablespace tbs1;
create tablespace tbs1 location '/tbs1';
CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1;
CREATE TABLE t7 (a int) tablespace tbs1;
insert into t7 values (1);
insert into t5 values (1);
insert into t6 values (1);
\dt+
 List of relations
 Schema | Name | Type  |  Owner   | Persistence |Size| Description
+--+---+--+-++-
 public | t4   | table | postgres | unlogged| 0 bytes|
 public | t5   | table | postgres | unlogged| 8192 bytes |
 public | t6   | table | postgres | unlogged| 8192 bytes |
 public | t7   | table | postgres | permanent   | 8192 bytes |
(4 rows)

ALTER TABLE ALL IN TABLESPACE tbs1 set logged; 
==> STILL WARNING:  unrecognized node type: 349
\dt+
 List of relations
 Schema | Name | Type  |  Owner   | Persistence |Size| Description
+--+---+--+-++-
 public | t4   | table | postgres | permanent   | 0 bytes|
 public | t5   | table | postgres | permanent   | 8192 bytes |
 public | t6   | table | postgres | permanent   | 8192 bytes |
 public | t7   | table | postgres | permanent   | 8192 bytes |

So it did rewrite however this warning seems to be unfixed. I've tested on 
e2c52beecdea152ca680a22ef35c6a7da55aa30f.

-J.


Re: In-placre persistance change of a relation

2021-12-20 Thread Kyotaro Horiguchi
At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 20 Dec 2021 07:59:29 +, Jakub Wartak  
> wrote in 
> > BTW fast feedback regarding that ALTER patch  (there were 4 unlogged 
> > tables):
> > #  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
> > WARNING:  unrecognized node type: 349
> 
> lol  I met a server crash. Will fix. Thanks!

That crash vanished after a recompilation for me and I don't see that
error.  On my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which
0002 adds.  So perhaps make clean/make all would fix that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2021-12-20 Thread Kyotaro Horiguchi
At Mon, 20 Dec 2021 07:59:29 +, Jakub Wartak  
wrote in 
> Hi Kyotaro, I'm glad you are still into this
> 
> > I didn't register for some reasons.
> 
> Right now in v8 there's a typo in ./src/backend/catalog/storage.c :
> 
> storage.c: In function 'RelationDropInitFork':
> storage.c:385:44: error: expected statement before ')' token
> pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much

Yeah, I thought that I had removed it. v9 patch I believe is correct.

> > 1. I'm not sure that we want to have the new mark files.
> 
> I can't help with such design decision, but if there are doubts maybe then 
> add checking return codes around:
> a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark()
> b) mdunlinkmark() inside mdunlinkmark()
> and PANIC if something goes wrong?

The point is it is worth the complexity it adds. Since the mark file
can resolve another existing (but I don't recall in detail) issue and
this patchset actually fixes it, it can be said to have a certain
extent of persuasiveness. But that doesn't change the fact that it's
additional complexity.

> > 2. Aside of possible bugs, I'm not confident that the crash-safety of
> >  this patch is actually water-tight. At least we need tests for some
> >  failure cases.
> >
> > 3. As mentioned in transam/README, failure in removing smgr mark files
> >leads to immediate shut down.  I'm not sure this behavior is acceptable.
> 
> Doesn't it happen for most of the stuff already? There's even data_sync_retry 
> GUC.

Hmm. Yes, actually it is "as water-tight as possible".  I just want
others' eyes on that perspective.  CF could be the entry point of
others but I'm a bit hesitent to add a new entry..

> > 4. Including the reasons above, this is not fully functionally.
> >For example, if we execute the following commands on primary,
> >replica dones't work correctly. (boom!)
> > 
> >=# CREATE UNLOGGED TABLE t (a int);
> >=# ALTER TABLE t SET LOGGED;
> > 
> > 
> > The following fixes are done in the attched v8.
> > 
> > - Rebased. Referring to Jakub and Justin's work, I replaced direct
> >   access to ->rd_smgr with RelationGetSmgr() and removed calls to
> >   RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
> >   TABLESPACE SET LOGGED/UNLOGGED" statement part.
> > 
> > - Fixed RelationCreate/DropInitFork's behavior for non-target
> >   relations. (From Jakub's work).
> > 
> > - Fixed wording of some comments.
> > 
> > - As revisited, I found a bug around recovery. If the logged-ness of a
> >   relation gets flipped repeatedly in a transaction, duplicate
> >   pending-delete entries are accumulated during recovery and work in a
> >   wrong way. sgmr_redo now adds up to one entry for a action.
> > 
> > - The issue 4 above is not fixed (yet).
> 
> Thanks again, If you have any list of crush tests ideas maybe I'll have some 
> minutes 
> to try to figure them out. Is there is any goto list of stuff to be checked 
> to add confidence
> to this patch (as per point #2) ?

Just causing a crash (kill -9) after executing problem-prone command
sequence, then seeing recovery works well would sufficient.

For example:

create unlogged table; begin; insert ..; alter table set logged;
.  Recovery works.

"create logged; begin; {alter unlogged; alter logged;} * 1000; alter
logged; commit/abort" doesn't pollute pgdata.


> BTW fast feedback regarding that ALTER patch  (there were 4 unlogged tables):
> #  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
> WARNING:  unrecognized node type: 349

lol  I met a server crash. Will fix. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: In-placre persistance change of a relation

2021-12-19 Thread Jakub Wartak
Hi Kyotaro, I'm glad you are still into this

> I didn't register for some reasons.

Right now in v8 there's a typo in ./src/backend/catalog/storage.c :

storage.c: In function 'RelationDropInitFork':
storage.c:385:44: error: expected statement before ')' token
pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much

> 1. I'm not sure that we want to have the new mark files.

I can't help with such design decision, but if there are doubts maybe then add 
checking return codes around:
a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark()
b) mdunlinkmark() inside mdunlinkmark()
and PANIC if something goes wrong?

> 2. Aside of possible bugs, I'm not confident that the crash-safety of
>  this patch is actually water-tight. At least we need tests for some
>  failure cases.
>
> 3. As mentioned in transam/README, failure in removing smgr mark files
>leads to immediate shut down.  I'm not sure this behavior is acceptable.

Doesn't it happen for most of the stuff already? There's even data_sync_retry 
GUC.

> 4. Including the reasons above, this is not fully functionally.
>For example, if we execute the following commands on primary,
>replica dones't work correctly. (boom!)
> 
>=# CREATE UNLOGGED TABLE t (a int);
>=# ALTER TABLE t SET LOGGED;
> 
> 
> The following fixes are done in the attched v8.
> 
> - Rebased. Referring to Jakub and Justin's work, I replaced direct
>   access to ->rd_smgr with RelationGetSmgr() and removed calls to
>   RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
>   TABLESPACE SET LOGGED/UNLOGGED" statement part.
> 
> - Fixed RelationCreate/DropInitFork's behavior for non-target
>   relations. (From Jakub's work).
> 
> - Fixed wording of some comments.
> 
> - As revisited, I found a bug around recovery. If the logged-ness of a
>   relation gets flipped repeatedly in a transaction, duplicate
>   pending-delete entries are accumulated during recovery and work in a
>   wrong way. sgmr_redo now adds up to one entry for a action.
> 
> - The issue 4 above is not fixed (yet).

Thanks again, If you have any list of crush tests ideas maybe I'll have some 
minutes 
to try to figure them out. Is there is any goto list of stuff to be checked to 
add confidence
to this patch (as per point #2) ?

BTW fast feedback regarding that ALTER patch  (there were 4 unlogged tables):
#  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING:  unrecognized node type: 349

-J.




Re: In-placre persistance change of a relation

2021-12-19 Thread Kyotaro Horiguchi
At Mon, 20 Dec 2021 15:28:23 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> 
> 4. Including the reasons above, this is not fully functionally.
>For example, if we execute the following commands on primary,
>replica dones't work correctly. (boom!)
> 
>=# CREATE UNLOGGED TABLE t (a int);
>=# ALTER TABLE t SET LOGGED;

> - The issue 4 above is not fixed (yet).

Not only for the case, RelationChangePersistence needs to send a
truncate record before FPIs.  If primary crashes amid of the
operation, the table content will be vanish with the persistence
change. That is the correct behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b28163fd7b3527e69f5b76f252891f800d7ac98c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v9 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 593 +++--
 src/backend/commands/tablecmds.c   | 256 +--
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 344 ++
 src/backend/storage/smgr/md.c  |  93 +++-
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/bin/pg_rewind/parsexlog.c  |  24 +
 src/common/relpath.c   |  47 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/common/relpath.h   |   9 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 22 files changed, 1465 insertions(+), 207 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The Smgr MARK files
+
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time.  In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, 

Re: In-placre persistance change of a relation

2021-12-19 Thread Kyotaro Horiguchi
Hello, Jakub.

At Fri, 17 Dec 2021 09:10:30 +, Jakub Wartak  
wrote in 
> the patch didn't apply clean (it's from March; some hunks were failing), so 
> I've fixed it and the combined git format-patch is attached. It did conflict 
> with the following:

Thanks for looking this. Also thanks for Justin for finding the silly
use-after-free bug. (Now I see the regression test fails and I'm not
sure how come I didn't find this before.)

>   b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
>   7b565843a94 - Add call to object access hook at the end of table 
> rewrite in ALTER TABLE
>   9ce346eabf3 - Report progress of startup operations that take a long 
> time.
>   f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().
> 
> I'm especially worried if I didn't screw up something/forgot something 
> related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests 
> passed".

About the last one, all rel->rd_smgr acesses need to be repalced with
RelationGetSmgr(). On the other hand we can simply remove
RelationOpenSmgr() calls since the target smgrrelation is guaranteed
to be loaded by RelationGetSmgr().

The fix you made for RelationCreate/DropInitFork is correct and
changes you made would work, but I prefer that the code not being too
permissive for unknown (or unexpected) states.

> Basic demonstration of this patch (with wal_level=minimal):
>   create unlogged table t6 (id bigint, t text);
>   -- produces 110GB table, takes ~5mins
>   insert into t6 select nextval('s1'), repeat('A', 1000) from 
> generate_series(1, 1);
>   alter table t6 set logged;
>   on baseline SET LOGGED takes: ~7min10s
>   on patched SET LOGGED takes: 25s
> 
> So basically one can - thanks to this patch - use his application (performing 
> classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) 
> and perform literally batch upload and then just switch the tables to LOGGED. 

This result is significant. That operation finally requires WAL writes
but I was not sure how much gain FPIs (or bulk WAL logging) gives in
comparison to operational WALs.

> Some more intensive testing also looks good, assuming table prepared to put 
> pressure on WAL:
>   create unlogged table t_unlogged (id bigint, t text) partition by hash 
> (id);
>   create unlogged table t_unlogged_h0 partition of t_unlogged  FOR VALUES 
> WITH (modulus 4, remainder 0);
>   [..]
>   create unlogged table t_unlogged_h3 partition of t_unlogged  FOR VALUES 
> WITH (modulus 4, remainder 3);
> 
> Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and 
> Lock/extend .
>   t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 
> 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes 
> generates ~1MB of WAL
>   t_unlogged.sql = insert into t_unlogged select nextval('s1'), 
> repeat('A', 1000) from generate_series(1, 1000); # according to 
> pg_wal_stats.wal_bytes generates ~3kB of WAL
> 
> so using: pgbench -f .sql -T 30 -P 1 -c 32 -j 3 t 
> with synchronous_commit =ON(default):
>   with t_logged.sql:   tps = 229 (lat avg = 138ms)
>   with t_unlogged.sql  tps = 283 (lat avg = 112ms) # almost all on 
> LWLock/WALWrite
> with synchronous_commit =OFF:
>   with t_logged.sql:   tps = 413 (lat avg = 77ms)
>   with t_unloged.sql:  tps = 782 (lat avg = 40ms)
> Afterwards switching the unlogged ~16GB partitions takes 5s per partition. 
> 
> As the thread didn't get a lot of traction, I've registered it into current 
> commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
> and in 'Ready for review' state.
> 
> I think it behaves as almost finished one and apparently after reading all 
> those discussions that go back over 10years+ time span about this feature, 
> and lot of failed effort towards wal_level=noWAL I think it would be nice to 
> finally start getting some of that of it into the core.

Thanks for taking the performance benchmark.

I didn't register for some reasons.

1. I'm not sure that we want to have the new mark files.

2. Aside of possible bugs, I'm not confident that the crash-safety of
  this patch is actually water-tight. At least we need tests for some
  failure cases.

3. As mentioned in transam/README, failure in removing smgr mark files
   leads to immediate shut down.  I'm not sure this behavior is acceptable.

4. Including the reasons above, this is not fully functionally.
   For example, if we execute the following commands on primary,
   replica dones't work correctly. (boom!)

   =# CREATE UNLOGGED TABLE t (a int);
   =# ALTER TABLE t SET LOGGED;


The following fixes are done in the attched v8.

- Rebased. Referring to Jakub and Justin's work, I replaced direct
  access to ->rd_smgr with RelationGetSmgr() and removed calls to
  RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
  TABLESPACE SET LOGGED/UNLOGGED" 

Re: In-placre persistance change of a relation

2021-12-17 Thread Justin Pryzby
On Fri, Dec 17, 2021 at 02:33:25PM +, Jakub Wartak wrote:
> >  Justin wrote:
> > On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> > > As the thread didn't get a lot of traction, I've registered it into 
> > > current
> > commitfest
> > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
> > est.postgresql.org%2F36%2F3461%2Fdata=04%7C01%7CJakub.Wartak%
> > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
> > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
> > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> > CI6Mn0%3D%7C3000sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
> > NuBwiW4%3Dreserved=0 with You as the author and in 'Ready for
> > review' state.
> > 
> > The patch is failing:
> [..] 
> > I think you ran "make check", but should run something like this:
> > make check-world -j8 >check-world.log 2>&1 && echo Success
> 
> Hi Justin,
> 
> I've repeated the check-world and it says to me all is ok locally (also with 
> --enable-cassert --enable-debug , at least on Amazon Linux 2) and also 
> installcheck on default params seems to be ok  
> I don't seem to understand why testfarm reports errors for tests like "path, 
> polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on 
> MacOS(?) . 
> Could someone point to me where to start looking/fixing?

Since it says this, it looks a lot like a memory error like a use-after-free
 - like in fsync_parent_path():

 CREATE TABLE PATH_TBL (f1 path);
+ERROR:  could not open file <>  Pacific": No such file or directory

I see at least this one is still failing, though:
time make -C src/test/recovery check
>From 676ecf794b2b0e98d8f31e4245f6f455da5e19cb Mon Sep 17 00:00:00 2001
From: Jakub Wartak 
Date: Thu, 16 Dec 2021 12:03:42 +
Subject: [PATCH 1/2] In-place table persistence change with new command ALTER
 TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.

ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED: To ease invoking
ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence
of all tables in the specified tablespace.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 518 +++--
 src/backend/commands/tablecmds.c   | 374 --
 src/backend/nodes/copyfuncs.c  |  16 +
 src/backend/nodes/equalfuncs.c |  15 +
 src/backend/parser/gram.y  |  20 +
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 318 +++
 src/backend/storage/smgr/md.c  |  92 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/backend/tcop/utility.c |  11 +
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/commands/tablecmds.h   |   2 +
 src/include/common/relpath.h   |   9 +-
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |   9 +
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 29 files changed, 1583 insertions(+), 179 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch 

RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak
>  Justin wrote:
> On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> > As the thread didn't get a lot of traction, I've registered it into current
> commitfest
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
> est.postgresql.org%2F36%2F3461%2Fdata=04%7C01%7CJakub.Wartak%
> 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
> 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> CI6Mn0%3D%7C3000sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
> NuBwiW4%3Dreserved=0 with You as the author and in 'Ready for
> review' state.
> 
> The patch is failing:
[..] 
> I think you ran "make check", but should run something like this:
> make check-world -j8 >check-world.log 2>&1 && echo Success

Hi Justin,

I've repeated the check-world and it says to me all is ok locally (also with 
--enable-cassert --enable-debug , at least on Amazon Linux 2) and also 
installcheck on default params seems to be ok  
I don't seem to understand why testfarm reports errors for tests like "path, 
polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) 
. 
Could someone point to me where to start looking/fixing?

-J.




Re: In-placre persistance change of a relation

2021-12-17 Thread Justin Pryzby
On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> I'm especially worried if I didn't screw up something/forgot something 
> related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests 
> passed".

> As the thread didn't get a lot of traction, I've registered it into current 
> commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
> and in 'Ready for review' state. 
> I think it behaves as almost finished one and apparently after reading all 
> those discussions that go back over 10years+ time span about this feature, 
> and lot of failed effort towards wal_level=noWAL I think it would be nice to 
> finally start getting some of that of it into the core.

The patch is failing:
http://cfbot.cputube.org/kyotaro-horiguchi.html
https://api.cirrus-ci.com/v1/artifact/task/5564333871595520/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs

I think you ran "make check", but should run something like this:
make check-world -j8 >check-world.log 2>&1 && echo Success

-- 
Justin




RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak
> Kyotaro wrote:
> > In this version, I got rid of the "CLEANUP FORK"s, and added a new
> > system "Smgr marks".  The mark files have the name of the
> > corresponding fork file followed by ".u" (which means Uncommitted.).
> > "Uncommited"-marked main fork means the same as the
> CLEANUP2_FORKNUM
> > and uncommitted-marked init fork means the same as the
> CLEANUP_FORKNUM
> > in the previous version.x
> >
> > I noticed that the previous version of the patch still leaves an
> > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> > revmoed at rollback.  In this version the responsibility to remove the
> > mark files is moved to SyncPostCheckpoint, where main fork files are
> > actually removed.
> 
> For the record, I noticed that basebackup could be confused by the mark files
> but I haven't looked that yet.
> 

Good morning Kyotaro,

the patch didn't apply clean (it's from March; some hunks were failing), so 
I've fixed it and the combined git format-patch is attached. It did conflict 
with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table 
rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long 
time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().

I'm especially worried if I didn't screw up something/forgot something related 
to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".

Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from 
generate_series(1, 1);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s

So basically one can - thanks to this patch - use his application (performing 
classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) 
and perform literally batch upload and then just switch the tables to LOGGED. 

Some more intensive testing also looks good, assuming table prepared to put 
pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash 
(id);
create unlogged table t_unlogged_h0 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 3);

Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and 
Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 
1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes 
generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), 
repeat('A', 1000) from generate_series(1, 1000); # according to 
pg_wal_stats.wal_bytes generates ~3kB of WAL

so using: pgbench -f .sql -T 30 -P 1 -c 32 -j 3 t 
with synchronous_commit =ON(default):
with t_logged.sql:   tps = 229 (lat avg = 138ms)
with t_unlogged.sql  tps = 283 (lat avg = 112ms) # almost all on 
LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql:   tps = 413 (lat avg = 77ms)
with t_unloged.sql:  tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition. 

As the thread didn't get a lot of traction, I've registered it into current 
commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
and in 'Ready for review' state. 
I think it behaves as almost finished one and apparently after reading all 
those discussions that go back over 10years+ time span about this feature, and 
lot of failed effort towards wal_level=noWAL I think it would be nice to 
finally start getting some of that of it into the core.

-Jakub Wartak.


v7-0001-In-place-table-persistence-change-with-new-comman.patch
Description:  v7-0001-In-place-table-persistence-change-with-new-comman.patch


Re: In-placre persistance change of a relation

2021-03-24 Thread Kyotaro Horiguchi
At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (I'm not sure when the subject was broken..)
> 
> At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Commit bea449c635 conflicts with this on the change of the definition
> > of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
> 
> In this version, I got rid of the "CLEANUP FORK"s, and added a new
> system "Smgr marks".  The mark files have the name of the
> corresponding fork file followed by ".u" (which means Uncommitted.).
> "Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
> and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
> in the previous version.x
> 
> I noticed that the previous version of the patch still leaves an
> orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> revmoed at rollback.  In this version the responsibility to remove the
> mark files is moved to SyncPostCheckpoint, where main fork files are
> actually removed.

For the record, I noticed that basebackup could be confused by the
mark files but I haven't looked that yet.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2021-03-24 Thread Kyotaro Horiguchi
(I'm not sure when the subject was broken..)

At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Commit bea449c635 conflicts with this on the change of the definition
> of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks".  The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x

I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback.  In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 27ea96d84dfc2f3e0d62c4b8f7d20cc30771cf86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v6 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 520 +++--
 src/backend/commands/tablecmds.c   | 246 ++--
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 346 +++-
 src/backend/storage/smgr/md.c  |  92 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/common/relpath.h   |   9 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 22 files changed, 1384 insertions(+), 206 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..7cf77e4a02 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is 

Re: In-placre persistance change of a relation

2021-01-14 Thread Kyotaro Horiguchi
At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > This version RelationChangePersistence() is changed not to choose
> > > in-place method for indexes other than btree. It seems to be usable
> > > with all kind of indexes other than Gist, but at the mement it applies
> > > only to btrees.
> > > 
> > > 1: 
> > > https://www.postgresql.org/message-id/ca+tgmozez5rons49c7mepjhjndqmqtvrz_lcqukprwdmrev...@mail.gmail.com
> > 
> > Hmm. This is not wroking correctly. I'll repost after fixint that.
> 
> I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
> handles file operations in the wrong order and with the wrong logic.
> It also needed to drop buffers and forget fsync requests.
> 
> I thought that the two cases that this patch is expected to fix
> (orphan relation files and uncommited init files) can share the same
> "cleanup" fork but that is wrong. I had to add one more additional
> fork to differentiate the cases of SET UNLOGGED and of creation of
> UNLOGGED tables...
> 
> The attached is a new version, that seems working correctly but looks
> somewhat messy. I'll continue working.

Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5f785f181acdac18952f504ec45ce41f285c05bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v5 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 436 +++--
 src/backend/commands/tablecmds.c   | 246 +++---
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/reinit.c  | 316 --
 src/backend/storage/smgr/md.c  |  13 +-
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   4 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  22 +-
 src/include/common/relpath.h   |   6 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/md.h   |   2 +
 src/include/storage/reinit.h   |   3 +-
 src/include/storage/smgr.h |   1 +
 17 files changed, 1028 insertions(+), 167 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..547107a771 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The CLEANUP fork file
+
+
+An CLEANUP fork is created when a new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time.  In contrast
+to 4 above, failure to remove an CLEANUP fork file will lead to data
+loss, in which case the server will shut down.
+
 
 Skipping WAL for 

Re: In-placre persistance change of a relation

2021-01-12 Thread Kyotaro Horiguchi
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > This version RelationChangePersistence() is changed not to choose
> > in-place method for indexes other than btree. It seems to be usable
> > with all kind of indexes other than Gist, but at the mement it applies
> > only to btrees.
> > 
> > 1: 
> > https://www.postgresql.org/message-id/ca+tgmozez5rons49c7mepjhjndqmqtvrz_lcqukprwdmrev...@mail.gmail.com
> 
> Hmm. This is not wroking correctly. I'll repost after fixint that.

I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.

I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...

The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 88e9374529cbd8f983f2c82baadea94b475e46dd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v4 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 436 +++--
 src/backend/commands/tablecmds.c   | 246 +++---
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/reinit.c  | 322 --
 src/backend/storage/smgr/md.c  |  13 +-
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   4 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  22 +-
 src/include/common/relpath.h   |   6 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/md.h   |   2 +
 src/include/storage/reinit.h   |   3 +-
 src/include/storage/smgr.h |   1 +
 17 files changed, 1034 insertions(+), 167 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..547107a771 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The CLEANUP fork file
+
+
+An CLEANUP fork is created when a new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time.  In contrast
+to 4 above, failure to remove an CLEANUP fork file will lead to data
+loss, in which case the server will shut down.
+
 
 Skipping WAL for New RelFileNode
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..6dcbcbe387 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@

Re: In-placre persistance change of a relation

2021-01-08 Thread Kyotaro Horiguchi
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> This version RelationChangePersistence() is changed not to choose
> in-place method for indexes other than btree. It seems to be usable
> with all kind of indexes other than Gist, but at the mement it applies
> only to btrees.
> 
> 1: 
> https://www.postgresql.org/message-id/ca+tgmozez5rons49c7mepjhjndqmqtvrz_lcqukprwdmrev...@mail.gmail.com

Hmm. This is not wroking correctly. I'll repost after fixint that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2021-01-07 Thread Kyotaro Horiguchi
At Fri, 25 Dec 2020 09:12:52 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Hello.
> 
> At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > The patch is attached to the next message.
> 
> The reason for separating this message is that I modified this so that
> it could solve another issue.
> 
> There's a complain about orphan files after crash. [1]
> 
> 1: https://www.postgresql.org/message-id/16771-cbef7d97ba93f...@postgresql.org
> 
> That is, the case where a relation file is left alone after a server
> crash that happened before the end of the transaction that has created
> a relation.  As I read this, I noticed this feature can solve the
> issue with a small change.
> 
> This version gets changes in RelationCreateStorage and
> smgrDoPendingDeletes.
> 
> Previously inittmp fork is created only along with an init fork. This
> version creates one always when a relation storage file is created. As
> the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
> inttmp fork of a logged relations is found.  Now that pendingDeletes
> can contain multiple entries for the same relation, it has been
> modified not to close the same smgr multiple times.
> 
> - It might be better to split 0001 into two peaces.
> 
> - The function name ResetUnloggedRelationsInDbspaceDir is no longer
>   represents the function correctly.

As pointed by Robert in another thread [1], persisntence of (at least)
GiST index cannot be flipped in-place due to incompatibility of fake
LSNs with real ones.

This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.

1: 
https://www.postgresql.org/message-id/ca+tgmozez5rons49c7mepjhjndqmqtvrz_lcqukprwdmrev...@mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1d47e7872d1e7ef18007f752e55cec9772373cc9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v3 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/access/transam/README  |  10 +
 src/backend/catalog/storage.c  | 420 +++--
 src/backend/commands/tablecmds.c   | 246 ---
 src/backend/storage/buffer/bufmgr.c|  88 ++
 src/backend/storage/file/reinit.c  | 162 ++
 src/backend/storage/smgr/md.c  |   4 +-
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   3 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  22 +-
 src/include/common/relpath.h   |   5 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/smgr.h |   1 +
 14 files changed, 854 insertions(+), 140 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..51616b2458 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The INITTMP fork file
+
+
+An INITTMP 

Re: In-placre persistance change of a relation

2020-12-24 Thread Kyotaro Horiguchi
Hello.

At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> The patch is attached to the next message.

The reason for separating this message is that I modified this so that
it could solve another issue.

There's a complain about orphan files after crash. [1]

1: https://www.postgresql.org/message-id/16771-cbef7d97ba93f...@postgresql.org

That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation.  As I read this, I noticed this feature can solve the
issue with a small change.

This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.

Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found.  Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.

- It might be better to split 0001 into two peaces.

- The function name ResetUnloggedRelationsInDbspaceDir is no longer
  represents the function correctly.
  
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From dbe9ef477df8570b0b0def2b5f089b0001aa2eab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v2 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/access/transam/README  |  10 +
 src/backend/catalog/storage.c  | 394 +++--
 src/backend/commands/tablecmds.c   | 213 ++---
 src/backend/storage/buffer/bufmgr.c|  88 ++
 src/backend/storage/file/reinit.c  | 164 +-
 src/backend/storage/smgr/md.c  |   4 +-
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   3 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  22 +-
 src/include/common/relpath.h   |   5 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/smgr.h |   1 +
 14 files changed, 800 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..097dacfee6 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..51616b2458 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The INITTMP fork file
+
+
+An INITTMP fork is created when new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time. The file is
+removed at transaction end but is left when the process crashes before
+the transaction ends. In contrast to 4 above, failure to remove an
+INITTMP file will lead to data loss, in which case the server will
+shut down.
+
 
 Skipping WAL for New RelFileNode
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..f4dddbad55 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
 
 

Re: In-placre persistance change of a relation

2020-12-24 Thread Kyotaro Horiguchi
Thanks for the comment! Sorry for the late reply.

At Fri, 4 Dec 2020 07:49:22 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Kyotaro Horiguchi 
> > > No, not really.  The issue is more around what happens if we crash
> > > part way through.  At crash recovery time, the system catalogs are not
> > > available, because the database isn't consistent yet and, anyway, the
> > > startup process can't be bound to a database, let alone every database
> > > that might contain unlogged tables.  So the sentinel that's used to
> > > decide whether to flush the contents of a table or index is the
> > > presence or absence of an _init fork, which the startup process
> > > obviously can see just fine.  The _init fork also tells us what to
> > > stick in the relation when we reset it; for a table, we can just reset
> > > to an empty file, but that's not legal for indexes, so the _init fork
> > > contains a pre-initialized empty index that we can just copy over.
> > >
> > > Now, to make an unlogged table logged, you've got to at some stage
> > > remove those _init forks.  But this is not a transactional operation.
> > > If you remove the _init forks and then the transaction rolls back,
> > > you've left the system an inconsistent state.  If you postpone the
> > > removal until commit time, then you have a problem if it fails,
> > 
> > It's true. That are the cause of headache.
> ...
> > The current implement is simple.  It's enough to just discard old or
> > new relfilenode according to the current transaction ends with commit
> > or abort. Tweaking of relfilenode under use leads-in some skews in
> > some places.  I used pendingDelete mechanism a bit complexified way
> > and a violated an abstraction (I think, calling AM-routines from
> > storage.c is not good.) and even introduce a new fork kind only to
> > mark a init fork as "not committed yet".  There might be better way,
> > but I haven't find it.
> 
> I have no alternative idea yet, too.  I agree that we want to avoid them, 
> especially introducing inittmp fork...  Anyway, below are the rest of my 
> review comments for 0001.  I want to review 0002 when we have decided to go 
> with 0001.
> 
> 
> (2)
> XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:
> 
> [src/include/catalog/storage_xlog.h]
> /*
>  * Declarations for smgr-related XLOG records
>  *
>  * Note: we log file creation and truncation here, but logging of deletion
>  * actions is handled by xact.c, because it is part of transaction commit.
>  */

Sure. Rewrote it.

> [src/backend/access/transam/README]
> 3. Deleting a table, which requires an unlink() that could fail.
> 
> Our approach here is to WAL-log the operation first, but to treat failure
> of the actual unlink() call as a warning rather than error condition.
> Again, this can leave an orphan file behind, but that's cheap compared to
> the alternatives.  Since we can't actually do the unlink() until after
> we've committed the DROP TABLE transaction, throwing an error would be out
> of the question anyway.  (It may be worth noting that the WAL entry about
> the file deletion is actually part of the commit record for the dropping
> transaction.)

Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I
added a brief description about INITTMP fork to the file.


The INITTMP fork file


An INITTMP fork is created when new relation file is created to mark
the relfilenode needs to be cleaned up at recovery time. The file is
removed at transaction end but is left when the process crashes before
the transaction ends. In contrast to 4 above, failure to remove an
INITTMP file will lead to data loss, in which case the server will
shut down.


> (3)
> +/* This is bit-map, not ordianal numbers  */
> 
> There seems to be no comments using "bit-map".  "Flags for ..." can be seen 
> here and there.

I revmoed the comment and use (1 << n) notation to show the fact
instead.


> (4)
> Some wrong spellings:
> 
> swithing -> switching
> alredy -> already
> recovery -> recover
> situaton -> situation

Oops! Fixed them.

> (5)
> + table_close(classRel, NoLock);
> +
> +
> +
> +
>  }
> 
> These empty lines can be deleted.

s/can/should/ :p. Fixed.

> 
> (6)
> +/*
> + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
> + */
> +void
> +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
> ...
> +  * Make an XLOG entry reporting the file unlink.
> 
> Not unlink but buffer persistence?

Silly copy-pasto. Fixed.

> (7)
> + /*
> +  * index-init fork needs further initialization. ambuildempty shoud do
> +  * WAL-log and file sync by itself but otherwise we do that by myself.
> +  */
> + if (rel->rd_rel->relkind == RELKIND_INDEX)
> + rel->rd_indam->ambuildempty(rel);
> + else
> + {
> + log_smgrcreate(, INIT_FORKNUM);
> + smgrimmedsync(srel, INIT_FORKNUM);
> + }
> +
> + /*
> +  * We 

RE: In-placre persistance change of a relation

2020-12-03 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> > No, not really.  The issue is more around what happens if we crash
> > part way through.  At crash recovery time, the system catalogs are not
> > available, because the database isn't consistent yet and, anyway, the
> > startup process can't be bound to a database, let alone every database
> > that might contain unlogged tables.  So the sentinel that's used to
> > decide whether to flush the contents of a table or index is the
> > presence or absence of an _init fork, which the startup process
> > obviously can see just fine.  The _init fork also tells us what to
> > stick in the relation when we reset it; for a table, we can just reset
> > to an empty file, but that's not legal for indexes, so the _init fork
> > contains a pre-initialized empty index that we can just copy over.
> >
> > Now, to make an unlogged table logged, you've got to at some stage
> > remove those _init forks.  But this is not a transactional operation.
> > If you remove the _init forks and then the transaction rolls back,
> > you've left the system an inconsistent state.  If you postpone the
> > removal until commit time, then you have a problem if it fails,
> 
> It's true. That are the cause of headache.
...
> The current implement is simple.  It's enough to just discard old or
> new relfilenode according to the current transaction ends with commit
> or abort. Tweaking of relfilenode under use leads-in some skews in
> some places.  I used pendingDelete mechanism a bit complexified way
> and a violated an abstraction (I think, calling AM-routines from
> storage.c is not good.) and even introduce a new fork kind only to
> mark a init fork as "not committed yet".  There might be better way,
> but I haven't find it.

I have no alternative idea yet, too.  I agree that we want to avoid them, 
especially introducing inittmp fork...  Anyway, below are the rest of my review 
comments for 0001.  I want to review 0002 when we have decided to go with 0001.


(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:

[src/include/catalog/storage_xlog.h]
/*
 * Declarations for smgr-related XLOG records
 *
 * Note: we log file creation and truncation here, but logging of deletion
 * actions is handled by xact.c, because it is part of transaction commit.
 */

[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.

Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives.  Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway.  (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)


(3)
+/* This is bit-map, not ordianal numbers  */

There seems to be no comments using "bit-map".  "Flags for ..." can be seen 
here and there.


(4)
Some wrong spellings:

+   /* we flush this buffer when swithing to PERMANENT */

swithing -> switching

+* alredy flushed out by RelationCreate(Drop)InitFork called 
just

alredy -> already

+* relation content to be WAL-logged to recovery the table.

recovery -> recover

+* The inittmp fork works as the sentinel to identify that situaton.

situaton -> situation


(5)
+   table_close(classRel, NoLock);
+
+
+
+
 }

These empty lines can be deleted.


(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+* Make an XLOG entry reporting the file unlink.

Not unlink but buffer persistence?


(7)
+   /*
+* index-init fork needs further initialization. ambuildempty shoud do
+* WAL-log and file sync by itself but otherwise we do that by myself.
+*/
+   if (rel->rd_rel->relkind == RELKIND_INDEX)
+   rel->rd_indam->ambuildempty(rel);
+   else
+   {
+   log_smgrcreate(, INIT_FORKNUM);
+   smgrimmedsync(srel, INIT_FORKNUM);
+   }
+
+   /*
+* We have created the init fork. If server crashes before the current
+* transaction ends the init fork left alone corrupts data while 
recovery.
+* The inittmp fork works as the sentinel to identify that situaton.
+*/
+   smgrcreate(srel, INITTMP_FORKNUM, false);
+   log_smgrcreate(, INITTMP_FORKNUM);
+   smgrimmedsync(srel, INITTMP_FORKNUM);

If the server crashes between these two processings, only the init fork exists. 
 Is it correct to create the inittmp fork first?


(8)
+   if (inxact_created)
+   {
+   SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+   smgrclose(srel);
+   log_smgrunlink(, 

Re: In-placre persistance change of a relation

2020-11-13 Thread Kyotaro Horiguchi
At Fri, 13 Nov 2020 07:15:41 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> Hello, Tsunakawa-San
> 

Thanks for sharing it!

> > Do you know the reason why data copy was done before?  And, it may be
> > odd for me to ask this, but I think I saw someone referred to the past
> > discussion that eliminating data copy is difficult due to some processing at
> > commit.  I can't find it.
> I can share 2 sources why to eliminate the data copy is difficult in hackers 
> thread.
> 
> Tom's remark and the context to copy relation's data.
> https://www.postgresql.org/message-id/flat/31724.1394163360%40sss.pgh.pa.us#31724.1394163...@sss.pgh.pa.us

https://www.postgresql.org/message-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x...@mail.gmail.com

> No, not really.  The issue is more around what happens if we crash
> part way through.  At crash recovery time, the system catalogs are not
> available, because the database isn't consistent yet and, anyway, the
> startup process can't be bound to a database, let alone every database
> that might contain unlogged tables.  So the sentinel that's used to
> decide whether to flush the contents of a table or index is the
> presence or absence of an _init fork, which the startup process
> obviously can see just fine.  The _init fork also tells us what to
> stick in the relation when we reset it; for a table, we can just reset
> to an empty file, but that's not legal for indexes, so the _init fork
> contains a pre-initialized empty index that we can just copy over.
> 
> Now, to make an unlogged table logged, you've got to at some stage
> remove those _init forks.  But this is not a transactional operation.
> If you remove the _init forks and then the transaction rolls back,
> you've left the system an inconsistent state.  If you postpone the
> removal until commit time, then you have a problem if it fails,

It's true. That are the cause of headache.

> particularly if it works for the first file but fails for the second.
> And if you crash at any point before you've fsync'd the containing
> directory, you have no idea which files will still be on disk after a
> hard reboot.

This is not an issue in this patch *except* the case where init fork
is failed to removed but the following removal of inittmp fork
succeeds.  Another idea is adding a "not-yet-committed" property to a
fork.  I added a new fork type for easiness of the patch but I could
go that way if that is an issue.

> Amit-San quoted this thread and mentioned that point in another thread.
> https://www.postgresql.org/message-id/CAA4eK1%2BHDqS%2B1fhs5Jf9o4ZujQT%3DXBZ6sU0kOuEh2hqQAC%2Bt%3Dw%40mail.gmail.com

This sounds like a bit differrent discussion. Making part-of-a-table
UNLOGGED looks far difficult to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2020-11-12 Thread Kyotaro Horiguchi
At Fri, 13 Nov 2020 06:43:13 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> Hi Horiguchi-san,
> 
> 
> Thank you for making a patch so quickly.  I've started looking at it.
> 
> What makes you think this is a PoC?  Documentation and test cases?  If 
> there's something you think that doesn't work or are concerned about, can you 
> share it?

The latest version is heavily revised and is given much comment so it
might have exited from PoC state.  The necessity of documentation is
doubtful since this patch doesn't user-facing behavior other than
speed. Some tests are required especialy about recovery and
replication perspective but I haven't been able to make it. (One of
the tests needs to cause crash while a transaction is running.)

> Do you know the reason why data copy was done before?  And, it may be odd for 
> me to ask this, but I think I saw someone referred to the past discussion 
> that eliminating data copy is difficult due to some processing at commit.  I 
> can't find it.

To imagine that, just because it is simpler considering rollback and
code sharing, and maybe no one have been complained that SET
LOGGED/UNLOGGED looks taking a long time than required/expected.

The current implement is simple.  It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places.  I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet".  There might be better way,
but I haven't find it.

(The patch scans all shared buffer blocks for each relation).

> (1)
> @@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
>   */
>  #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
>  
> +struct SmgrRelationData;
> 
> This declaration is already in the file:
> 
> /* forward declared, to avoid having to expose buf_internals.h here */
> struct WritebackContext;
> 
> /* forward declared, to avoid including smgr.h here */
> struct SMgrRelationData;

Hmmm. Nice chatch. And will fix in the next version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: In-placre persistance change of a relation

2020-11-12 Thread osumi.takami...@fujitsu.com
Hello, Tsunakawa-San


> Do you know the reason why data copy was done before?  And, it may be
> odd for me to ask this, but I think I saw someone referred to the past
> discussion that eliminating data copy is difficult due to some processing at
> commit.  I can't find it.
I can share 2 sources why to eliminate the data copy is difficult in hackers 
thread.

Tom's remark and the context to copy relation's data.
https://www.postgresql.org/message-id/flat/31724.1394163360%40sss.pgh.pa.us#31724.1394163...@sss.pgh.pa.us

Amit-San quoted this thread and mentioned that point in another thread.
https://www.postgresql.org/message-id/CAA4eK1%2BHDqS%2B1fhs5Jf9o4ZujQT%3DXBZ6sU0kOuEh2hqQAC%2Bt%3Dw%40mail.gmail.com

Best,
Takamichi Osumi




RE: In-placre persistance change of a relation

2020-11-12 Thread tsunakawa.ta...@fujitsu.com
Hi Horiguchi-san,


Thank you for making a patch so quickly.  I've started looking at it.

What makes you think this is a PoC?  Documentation and test cases?  If there's 
something you think that doesn't work or are concerned about, can you share it?

Do you know the reason why data copy was done before?  And, it may be odd for 
me to ask this, but I think I saw someone referred to the past discussion that 
eliminating data copy is difficult due to some processing at commit.  I can't 
find it.



(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
  */
 #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
 
+struct SmgrRelationData;

This declaration is already in the file:

/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;

/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;


Regards
Takayuki Tsunakawa





Re: In-placre persistance change of a relation

2020-11-12 Thread Kyotaro Horiguchi
Hello.  Before posting the next version, I'd like to explain what this
patch is.

1. The Issue

Bulk data loading is a long-time taking, I/O consuming task.  Many
DBAs want that task is faster, even at the cost of increasing risk of
data-loss.  wal_level=minimal is an answer to such a
request. Data-loading onto a table that is created in the current
transaction omits WAL-logging and synced at commit.

However, the optimization doesn't benefit the case where the
data-loading is performed onto existing tables. There are quite a few
cases where data is loaded into tables that already contains a lot of
data. Those cases don't take benefit of the optimization.

Another possible solution for bulk data-loading is UNLOGGED
tables. But when we switch LOGGED/UNLOGGED of a table, all the table
content is copied to a newly created heap, which is costly.


2. Proposed Solutions.

There are two proposed solutions are discussed on this mailing
list. One is wal_level = none (*1), which omits WAL-logging almost at
all. Another is extending the existing optimization to the ALTER TABLE
SET LOGGED/UNLOGGED cases, which is to be discussed in this new
thread.


3. In-place Persistence Change

So the attached is a PoC patch of the "another" solution.  When we
want to change table persistence in-place, basically we need to do the
following steps.

(the talbe is exclusively locked)

(1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap.

(2) Create or delete the init fork for existing heap.

(3) Flush all buffers of the relation to file system.

(4) Sync heap files.

(5) Make catalog changes.


4. Transactionality

The 1, 2 and 5 above need to be abort-able. 5 is rolled back by
existing infrastructure, and rolling-back of 1 and 2 are achieved by
piggybacking on the pendingDeletes mechanism.


5. Replication

Furthermore, that changes ought to be replicable to standbys.  Catalog
changes are replicated as usual. 

On-the-fly creation of the init fork leads to recovery mess. Even
though it is removed at abort, if the server crashed before
transaction end, the file is left alone and corrupts database in the
next recovery. I sought a way to create the init fork in
smgrPendingDelete but that needs relcache and relcache is not
available at that late of commit. Finally, I introduced the fifth fork
kind "INITTMP"(_itmp) only to signal that the init file is not
committed. I don't like that way but it seems working fine...


6. SQL Command

The second file in the patchset adds a syntax that changes persistence
of all tables in a tablespace.

ALTER TABLE ALL IN TABLESPACE  SET LOGGED/UNLOGGED [ NOWAIT ];


7. Testing

I tried to write TAP test for this, but IPC::Run::harness (or
interactive_psql) doesn't seem to work for me.  I'm not sure what
exactly is happening but pty redirection doesn't work.

  $in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out;

works but

  $in = "ls\n"; $out = ""; run ["/usr/bin/bash"], 'pty>', 
\$out; print $out;

doesn't respond.


The patch is attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 05d1971d0f4f0f42899f5d6857892128487eeb40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v3 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/catalog/storage.c  | 355 +++--
 src/backend/commands/tablecmds.c   | 217 ---
 src/backend/storage/buffer/bufmgr.c|  88 ++
 src/backend/storage/file/reinit.c  | 206 --
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   3 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  16 ++
 src/include/common/relpath.h   |   5 +-
 src/include/storage/bufmgr.h   |   4 +
 src/include/storage/smgr.h |   1 +
 12 files changed, 784 insertions(+), 142 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..097dacfee6 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = 

Re: In-placre persistance change of a relation

2020-11-11 Thread Kyotaro Horiguchi
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund  wrote in 
> Hi,
> 
> I suggest outlining what you are trying to achieve here. Starting a new
> thread and expecting people to dig through another thread to infer what
> you are actually trying to achive isn't great.

Agreed. I'll post that. Thanks.

> FWIW, I'm *extremely* doubtful it's worth adding features that depend on
> a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
> I understand.  If somebody added support for dynamically adapting
> wal_level (e.g. wal_level=auto, that increases wal_level to
> replica/logical depending on the presence of replication slots), it'd
> perhaps be different.

Yes, this depends on wal_level=minimal for switching from UNLOGGED to
LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables
optimization for wal_level=minimal.  And it expands that optimization
to COPY/INSERT-to-existent-tables, which seems worth doing.

Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I
came to think that even in that case skipping table copy reduces I/O
significantly, even though FPI-WAL is emitted.


> On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> > 
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> >   be safely roll-backed. (It might be better to drop buffers).
> 
> That's obviously a no-go. I think you might be able to address this if
> you accept that the command cannot be run in a transaction (like
> CONCURRENTLY). Then you can first do the catalog changes, change the
> persistence level, and commit.

Of course.  The next version reverts persistence change at abort.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2020-11-11 Thread Andres Freund
Hi,

I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.

FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand.  If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.


On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
> FWIW this is a revised version of the PoC, which has some known
> problems.
> 
> - Flipping of Buffer persistence is not WAL-logged nor even be able to
>   be safely roll-backed. (It might be better to drop buffers).

That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.

Greetings,

Andres Freund