date:20210318

Re: Logical Replication vs. 2PC

2021-03-18 Thread Ajin Cherian

On Thu, Mar 18, 2021 at 8:46 PM Amit Kapila  wrote:

>
>
> However, if the user has setup synchronous_standby_names for all the
> subscriptions then we won't be able to proceed because the prepare on
> publisher will wait for all the subscriptions to ack and the
> subscriptions are waiting for the first prepare to finish.
>

But is it a valid use case to have two synchronous standbys which are two
subscriptions that are on the same server both with 2pc enabled?
If the purpose of synchronous standby is for durability to prevent data
loss, then why split your tables across 2 subscriptions which are on the
same server?
Maybe it could be documented warning users from having such a setup. Do we
really want to create a solution for an impractical scenario?

regards,
Ajin Cherian
Fujitsu Australia

Re: Parallel Inserts in CREATE TABLE AS

2021-03-18 Thread Bharath Rupireddy

On Wed, Jan 27, 2021 at 1:47 PM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 27, 2021 at 1:25 PM Tang, Haiying
>  wrote:
> > I choose 5 cases which pick parallel insert plan in CTAS to measure the 
> > patched performance. Each case run 30 times.
> >
> > Most of the tests execution become faster with this patch.
> >
> > However, Test NO 4(create table xxx as table xxx.) appears performance 
> > degradation. I tested various table size(2/10/20 millions), they all have a 
> > 6%-10% declines. I think it may need some check at this problem.
> >
> >
> >
> > Below are my test results. 'Test NO' is corresponded to 'Test NO' in 
> > attached test_ctas.sql file.
> >
> > reg%=(patched-master)/master
> >
> > Test NO |  Test Case |reg%  | patched(ms)  | master(ms)
> >
> > ||--|--|-
> >
> > 1   |  CTAS select from table| -9%  | 16709.50477  | 18370.76660
> >
> > 2   |  Append plan   | -14% | 16542.97807  | 19305.86600
> >
> > 3   |  initial plan under Gather node| -5%  | 13374.27187  | 14120.02633
> >
> > 4   |  CTAS table| 10%  | 20835.48800  | 18986.40350
> >
> > 5   |  CTAS select from execute  | -6%  | 16973.73890  | 18008.59789
> >
> >
> >
> > About Test NO 4:
> >
> > In master(HEAD), this test case picks serial seq scan.
> >
> > query plan likes:
> >
> > --
> >
> > Seq Scan on public.tenk1  (cost=0.00..444828.12 rows=1012 width=244) 
> > (actual time=0.005..1675.268 rows=1000 loops=1)
> >
> >Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
> > twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4  
> > Planning Time: 0.053 ms  Execution Time: 20165.023 ms
> >
> >
> >
> > With this patch, it will choose parallel seq scan and parallel insert.
> >
> > query plan likes:
> >
> > --
> >
> > Gather  (cost=1000.00..370828.03 rows=1012 width=244) (actual 
> > time=20428.823..20437.143 rows=0 loops=1)
> >
> >Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
> > twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
> >
> >Workers Planned: 4
> >
> >Workers Launched: 4
> >
> > ->  Create test
> >
> >->  Parallel Seq Scan on public.tenk1  (cost=0.00..369828.03 
> > rows=253 width=244) (actual time=0.021..411.094 rows=200 loops=5)
> >
> >  Output: unique1, unique2, two, four, ten, twenty, hundred, 
> > thousand, twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, 
> > string4
> >
> >  Worker 0:  actual time=0.023..390.856 rows=1858407 loops=1
> >
> >  Worker 1:  actual time=0.024..468.587 rows=2264494 loops=1
> >
> >  Worker 2:  actual time=0.023..473.170 rows=2286580 loops=1
> >
> >  Worker 3:  actual time=0.027..373.727 rows=1853216 loops=1  
> > Planning Time: 0.053 ms  Execution Time: 20437.643 ms
> >
> >
> >
> > test machine spec:
> >
> > CentOS 8.2, 128G RAM, 40 processors, disk SAS
>
> Thanks a lot for the performance tests and test cases. I will analyze
> why the performance is degrading one case and respond soon.

I analyzed performance of parallel inserts in CTAS for different cases
with tuple size 32bytes, 59bytes, 241bytes and 1064bytes. We could
gain if the tuple sizes are lower. But if the tuple size is larger
i..e 1064bytes, there's a regression with parallel inserts. Upon
further analysis, it turned out that the parallel workers are
requiring frequent extra blocks addition while concurrently extending
the relation(in RelationAddExtraBlocks) and the majority of the time
spent is going into flushing those new empty pages/blocks onto the
disk. I saw no regression when I incremented(for testing purpose) the
rate at which the extra blocks are added in RelationAddExtraBlocks to
extraBlocks = Min(1024, lockWaiters * 512); (currently it is
extraBlocks = Min(512, lockWaiters * 20); Incrementing the extra
blocks addition rate is not a practical solution to this problem
though.

In an offlist discussion with Robert and Dilip, using fallocate to
extend the relation may help to extend the relation faster. In regards
to this, it looks like the AIO/DIO patch set of Andres [1]  which
involves using fallocate() to extend files will surely be helpful.
Until then, we honestly feel that the parallel inserts in CTAS patch
set be put on hold and revive it later.

[1] - 
https://www.postgresql.org/message-id/flat/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Thomas Munro

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao  wrote:
> Thanks for updating the patch! It looks good to me!
> I have one minor comment for the patch.
>
> +   elog(LOG, "could not open %s: %m", path);
> +   return;
> +   }
> +   if (syncfs(fd) < 0)
> +   elog(LOG, "could not sync filesystem for \"%s\": %m", path);
>
> Since these are neither internal errors nor low-level debug messages, 
> ereport() should be used for them rather than elog()? For example,

Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao  wrote:
> 0002 patch looks good to me. Thanks!
> I have minor comments.

Ok, I made the changes you suggested.  Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).
From c19c74cb50255141db4b8f823f2867eadb1de422 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v6 1/3] Provide recovery_init_sync_method=syncfs.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, for Linux only, where we call syncfs() on every
possibly different filesystem under the data directory.  This avoids
faulting in potentially many inodes from potentially slow storage.  It
comes with some caveats, so it's not the default.

Reported-by: Michael Brown 
Reviewed-by: Fujii Masao 
Reviewed-by: Paul Guo 
Reviewed-by: Bruce Momjian 
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure |  2 +-
 configure.ac  |  1 +
 doc/src/sgml/config.sgml  | 30 +
 src/backend/storage/file/fd.c | 65 ++-
 src/backend/utils/misc/guc.c  | 17 +
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in|  3 +
 src/include/storage/fd.h  |  6 ++
 src/tools/msvc/Solution.pm|  1 +
 9 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..32de8235bf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9721,6 +9721,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  recovery_init_sync_method (enum)
+   
+recovery_init_sync_method configuration parameter
+   
+  
+  
+   
+When set to fsync, which is the default,
+PostgreSQL will recursively open and fsync
+all files in the data directory before crash recovery begins.  This
+is intended to make sure that all WAL and data files are durably stored
+on disk before replaying changes.  This applies whenever starting a
+database cluster that did not shut down cleanly, including copies
+created with pg_basebackup.
+   
+   
+On Linux, syncfs may be used instead, to ask the
+operating system to flush the whole file system.  This may be a lot
+faster, because it doesn't need to open each file one by one.  On the
+other hand, it may be slower if the file system is shared by other
+applications that modify a lot of files, since those files will also
+be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+errors

Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL

2021-03-18 Thread Fujii Masao




On 2021/03/18 18:48, Fujii Masao wrote:

WAIT_EVENT_WAL_RECEIVER_WAIT_START is waiting for waiting for starup
process to kick me.  So it may be either IPC or Activity.  Since
walreceiver hasn't sent anything to startup, so it's activity, rather
than IPC.  However, the behavior can be said that it convey a piece of
information from startup to wal receiver so it also can be said to be
an IPC. (That is the reason why I don't object for IPC.)


IMO this should be IPC because walreceiver is mainly waiting for the
interaction with the startup process, during this wait event. Since you can
live with IPC, probably our consensus is to use IPC?


If this is ok, I'd like to apply the attached patch at first.
This patch changes the type of WAIT_EVENT_WAL_RECEIVER_WAIT_START
from Client to IPC.

BTW, I found that recently WalrcvExit wait event was introduced.
But this name is not consistent with other events. I'm thinking that
it's better to rename it to WalReceiverExit.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From f5a8c8866fc8c750181109a7b3aae9b90e795668 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Fri, 19 Mar 2021 13:56:27 +0900
Subject: [PATCH] Change the type of WalReceiverWaitStart wait event from
 Client to IPC.

Previously the type of this wait event was Client. But while this
wait event is being reported, walreceiver process is waiting for
the startup process to set initial data for streaming replication.
It's not waiting for any activity on a socket connected to a user
application or walsender. So this commit changes the type for
WalReceiverWaitStart wait event to IPC.

Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi
Discussion: 
https://postgr.es/m/cdacc27c-37ff-f1a4-20e2-ce19933ab...@oss.nttdata.com
---
 doc/src/sgml/monitoring.sgml| 10 +-
 src/backend/postmaster/pgstat.c |  6 +++---
 src/include/pgstat.h|  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..19540206f9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1171,11 +1171,6 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   SSLOpenServer
   Waiting for SSL while attempting connection.
  
- 
-  WalReceiverWaitStart
-  Waiting for startup process to send initial data for streaming
-   replication.
- 
  
   WalSenderWaitForWAL
   Waiting for WAL to be flushed in WAL sender process.
@@ -1771,6 +1766,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   WalrcvExit
   Waiting for the walreceiver to exit.
  
+ 
+  WalReceiverWaitStart
+  Waiting for startup process to send initial data for streaming
+   replication.
+ 
  
   XactGroupUpdate
   Waiting for the group leader to update transaction status at
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..b7af7c2707 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3970,9 +3970,6 @@ pgstat_get_wait_client(WaitEventClient w)
case WAIT_EVENT_SSL_OPEN_SERVER:
event_name = "SSLOpenServer";
break;
-   case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-   event_name = "WalReceiverWaitStart";
-   break;
case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
event_name = "WalSenderWaitForWAL";
break;
@@ -4127,6 +4124,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_WALRCV_EXIT:
event_name = "WalrcvExit";
break;
+   case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+   event_name = "WalReceiverWaitStart";
+   break;
case WAIT_EVENT_XACT_GROUP_UPDATE:
event_name = "XactGroupUpdate";
break;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..2c82313550 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -953,7 +953,6 @@ typedef enum
WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
WAIT_EVENT_SSL_OPEN_SERVER,
-   WAIT_EVENT_WAL_RECEIVER_WAIT_START,
WAIT_EVENT_WAL_SENDER_WAIT_WAL,
WAIT_EVENT_WAL_SENDER_WRITE_DATA,
 } WaitEventClient;
@@ -1010,6 +1009,7 @@ typedef enum
WAIT_EVENT_SAFE_SNAPSHOT,
WAIT_EVENT_SYNC_REP,
WAIT_EVENT_WALRCV_EXIT,
+   WAIT_EVENT_WAL_RECEIVER_WAIT_START,
WAIT_EVENT_XACT_GROUP_UPDATE
 } WaitEventIPC;
 
-- 
2.27.0

Re: Using COPY FREEZE in pgbench

2021-03-18 Thread Tatsuo Ishii

> I have looked in the code of PQprotocolVersion. The only case in which
> it returns 0 is there's no connection. Yes, you are right. Once the
> connection has been successfuly established, there's no chance it
> fails. So I agree with you.

Attached v3 patch addresses this.

>> The "g" item in the section describing initialization steps
>> (i.e. option -I). I'd suggest just to replace "COPY" with "COPY
>> FREEZE" in the sentence.
> 
> Ok. The section is needed to be modified.

This is also addressed in the patch.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 50cf22ba6b..9badafbc1f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -220,6 +220,9 @@ pgbench  options  d
 data is generated in pgbench client and then
 sent to the server. This uses the client/server bandwidth
 extensively through a COPY.
+pgbench uses FREEZE option with 14 or later
+version of PostgreSQL to speed up
+subsequent VACUUM.
 Using g causes logging to print one message
 every 100,000 rows while generating data for the
 pgbench_accounts table.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e69d43b26b..a842b59188 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3976,6 +3976,7 @@ initGenerateDataClientSide(PGconn *con)
 	PGresult   *res;
 	int			i;
 	int64		k;
+	int			server_version;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4022,7 +4023,18 @@ initGenerateDataClientSide(PGconn *con)
 	/*
 	 * accounts is big enough to be worth using COPY and tracking runtime
 	 */
-	res = PQexec(con, "copy pgbench_accounts from stdin");
+
+	/*
+	 * If server version is 14.0 or later, we can take account of freeze
+	 * option of copy.
+	 */
+	server_version = PQserverVersion(con);
+
+	if (server_version >= 14)
+		res = PQexec(con, "copy pgbench_accounts from stdin with (freeze on)");
+	else
+		res = PQexec(con, "copy pgbench_accounts from stdin");
+
 	if (PQresultStatus(res) != PGRES_COPY_IN)
 	{
 		pg_log_fatal("unexpected copy in result: %s", PQerrorMessage(con));

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-18 Thread Fujii Masao





On 2021/03/18 19:16, Masahiro Ikeda wrote:

As you said, the temporary stats files don't removed if the stats collector is 
killed with SIGQUIT.
So, if the user change the GUC parameter "stats_temp_directory" after immediate 
shutdown,
temporary stats file can't be removed forever.

But, I think this case is rarely happened and unimportant. Actually, 
pgstat_write_statsfiles()
didn't check error of unlink() and the same problem is occurred if the server 
is crashed now.
The documentation said following. I think it's enough.


Thanks for considering this! Understood.



I don't have any strong opinion this behaivor is useless too.

Since the reinitialized phase is not executed when only the stats collector is 
crashed
(since it didn't touch the shared memory), if the permanent stats file is 
exists, the
stats collector can use it. But, IIUC, the case is rare.

The case is happened by operation mistake which a operator sends the SIGQUIT 
signal to
the stats collector. Please let me know if there are more important case.

But, if SIGKILL is sent by operator, the stats can't be rescure now because the 
permanent stats
files can't be written before exiting. Since the case which can rescure the 
stats is rare,
I think it's ok to initialize the stats even if SIGQUIT is sent.


Sounds reasonable.



When only the stats collector exits by SIGQUIT, with the patch
FreeWaitEventSet() is also skipped. Is this ok?


Thanks, I fixed it.


I'm still not sure if FreeWaitEventSet() is actually necessary in that
exit case. Could you tell me why you thought FreeWaitEventSet() is
necessary in the case?

Since it's not good to do many things in a signal handler, even when
FreeWaitEventSet() is really necessary, it should be called at subsequent
startup of the stats collector instead of calling it in the handler
at the end? BTW, I found bgworker calls FreeWaitEventSet() via
ShutdownLatchSupport() at its startup. But I'm also not sure if this
is really necessary at the startup...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Logical Replication vs. 2PC

2021-03-18 Thread Amit Kapila

On Thu, Mar 18, 2021 at 5:31 PM vignesh C  wrote:
>
> On Thu, Mar 18, 2021 at 3:16 PM Amit Kapila  wrote:
> >
> >
> > In short, on the subscriber, both the apply workers (corresponding to
> > two subscriptions) are getting the same prepare transaction GID,
> > leading to an error on the subscriber and making the publisher wait
> > forever.
> >
> > Thoughts?
>
> I see the main problem here is because the GID clashes as you have
> rightly pointed out. I'm not sure if we are allowed to change the
> GID's in the subscriber.
> If we are allowed to change the GID's in the subscriber. Worker can do
> something like: When the apply worker is applying the prepared
> transaction, try to apply the prepare transaction with the GID as is.
> If there is an error GID already in use, workers can try to catch that
> error and change the GID to a fixed length hash key of (GID,
> subscription name, node name, timestamp,etc) to generate a unique hash
> key(modified GID), prepare the transaction with the generated hash
> key. Store this key and the original GID for later use, this will be
> required during commit prepared or in case of rollback prepared. When
> applying the commit prepared or rollback prepared, change the GID with
> the hash key that was used during the prepare transaction.
>

I think it will be tricky to distinguish the clash is due to the user
has already prepared a xact with the same GID on a subscriber or it is
from one of the apply workers. For earlier cases, the user needs to
take action. You need to change both file format and WAL for this and
not sure but generating hash key for this looks a bit shaky. Now, we
might be able to make it work but how about if we always append subid
with GID for prepare and store GID and subid separately in WAL (I
think we can store additional subscriber-id information
conditionally). Then during recovery, we will use both GID and subid
for prepare but for decoding, we will only use GID. This way for
cascaded set up we can always send GID by reading WAL and the
downstream subscriber will append its subid to GID. I know this is
also not that straight-forward but I don't have any better ideas at
the moment.

> If we are not allowed to change the GID's in the subscriber. This
> thought is in similar lines where in one of the earlier design
> prepared spool files was used. Can we have some mechanism where we can
> identify this scenario and store the failing prepare transaction
> information, so that when the worker is restarted worker can use this
> stored information to identify the failed prepare transaction, once
> worker identifies that it is a failed prepare transaction then all of
> this transaction can be serialized into a file and later when the
> apply worker receives a commit prepared it can get the changes from
> the file and apply this transaction  or discard the file in case of
> rollback prepared.
>

Hmm, this idea will face similar problems as described here [1].

Note: added Petr Jelinek to see if he has any opinion on this matter.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LVEdPYnjdajYzu3k6KEii1%2BF0jdQ6sWnYugiHcSGZD6Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: cleanup temporary files after crash

2021-03-18 Thread Tom Lane

[ reads code... ]
... no, I think the problem is the test is still full of race conditions.

In the first place, waiting till you see the output of a SELECT that's
before the useful query is not enough to guarantee that the useful query
is done, or even started.  That's broken on both sessions.

In the second place, even if the second INSERT has started, you don't know
that it's reached the point of blocking on the tuple conflict yet.
Which in turn means that it might not've filled its tuplestore yet.

In short, this script is designed to ensure that the test query can't
finish too soon, but that proves nothing about whether the test query
has even started.  And since you also haven't really guaranteed that the
intended-to-be-blocking query is done, I don't think that the first
condition really holds either.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-18 Thread Julien Rouhaud

On Thu, Mar 18, 2021 at 03:23:49PM -0400, Bruce Momjian wrote:
> On Fri, Mar 19, 2021 at 02:06:56AM +0800, Julien Rouhaud wrote:
> 
> The above text is the part that made me think an extension could display
> a query id even if disabled by the GUC.

With the last version of the patch I sent it was the case.

> Oh, OK.  I can see an extension setting the query id on its own --- we
> can't prevent that from happening.  It is probably enough to tell
> extensions to honor the GUC, since they would want it enabled so it
> displays in pg_stat_activity and log_line_prefix.

Ok.  So no new hook, and we keep using post_parse_analyze_hook as the official
way to have custom queryid implementation, with this new behavior:

> > - if some extension calculates a queryid during post_parse_analyze_hook, we
> >   will always reset it.
> 
> OK, good.

Now that I'm back on the code I remember why I did it this way.  It's
unfortunately not really possible to make things work this way.

pg_stat_statements' post_parse_analyze_hook relies on a queryid already being
computed, as it's where we know where the constants are recorded.  It means:

- we have to call post_parse_analyze_hook *after* doing core queryid
  calculation
- if users want to use a third party module to calculate a queryid, they'll
  have to make sure that the module's post_parse_analyze_hook is called
  *before* pg_stat_statements' one.
- even if they do so, they'll still have to pay the price of core queryid
  calculation

So it would be very hard to configure and will be too expensive.  I think that
we have to choose to either we make compute_query_id only trigger core
calculation (like it was in previous patch version), or introduce a new hook.

> I think compute_query_id works, and is shorter.

WFM.

> OK, good to know.  I can run some tests here if people would like me to.

+1.  A read only pgbench will be some kind od worse case scenario that can be
used I think.

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Fujii Masao





On 2021/03/19 11:22, Fujii Masao wrote:



On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro  wrote:

PS: For illustration/discussion, I've also attached a "none" patch.  I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.


Erm... I attached the wrong version by mistake.  Here's a better one.


0002 patch looks good to me. Thanks!
I have minor comments.

- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for
+ * all potential filesystem, or do nothing, depending on
+ * recovery_init_sync_method setting.

The comment in SyncDataDirectory() should be updated so that
it mentions "none" method, as the above?

+This is only safe if all buffered data is known to have been flushed
+to disk already, for example by a tool such as
+pg_basebackup.  It is not a good idea to

Isn't it better to add something like "without --no-sync"
to "pg_basebackup" part? Which would prevent users from misunderstanding
that pg_basebackup always ensures that whatever options are specified.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/19/21 3:57 AM, Tom Lane wrote:
> Tomas Vondra  writes:
>> So crake and florican seem to be happy now. Not sure about lapwing yet.
>> But interestingly enough, prion and curculio got unhappy. They worked
>> fine with the older test, but now it fails with the "no such file or
>> directory" message. I wonder what makes them different from the other
>> x86_64 machines ...
> 
> I'm confused about why the new query would use a temp file at all.
> Maybe they aren't using the same plan?
> 

Because generate_series() stashes the results into a tuplestore, and the
test sets work_mem to just 64kB. Maybe I'm missing some detail, though.

The plan is extremely simple - just Function Scan feeding data into an
Insert, not sure what other plan could be used.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cleanup temporary files after crash

2021-03-18 Thread Tom Lane

Tomas Vondra  writes:
> So crake and florican seem to be happy now. Not sure about lapwing yet.
> But interestingly enough, prion and curculio got unhappy. They worked
> fine with the older test, but now it fails with the "no such file or
> directory" message. I wonder what makes them different from the other
> x86_64 machines ...

I'm confused about why the new query would use a temp file at all.
Maybe they aren't using the same plan?

regards, tom lane

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/19/21 2:21 AM, Tomas Vondra wrote:
> On 3/19/21 1:54 AM, Euler Taveira wrote:
>> On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:
>>> Well, that's better, bit it still does not do the trick on the 32-bit
>>> machine - in that case a 1000 rows with int4 still fit into work_mem, so
>>> the temp file is not created. Per my experiments about 1040 rows are
>>> needed - s close ;-) So let's make it 2000.
>> My 32-bit laptop needs some repairs so I blindly chose 1k rows.
>>
> 
> Yeah, it's not immediately obvious how many rows are needed. 2000 would
> be enough, but I just bumped it up to 5000 for good measure.
> 
>>> We might as well check that the temp file actually exists, before
>>> killing the backend. Just to be sure.
>> Do you mean with remove_temp_files_after_crash = on? New version attached.
>>
> 
> On second thought, we don't need that check. I realized the second part
> of the test (checking remove_temp_files_after_crash=off) actually does
> verify that we've created the file, so that should be OK.
> 
> I've pushed the previous version with the 5000 rows. Let's see what
> crake and florican think about this ...
> 

So crake and florican seem to be happy now. Not sure about lapwing yet.

But interestingly enough, prion and curculio got unhappy. They worked
fine with the older test, but now it fails with the "no such file or
directory" message. I wonder what makes them different from the other
x86_64 machines ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: libpq debug log

2021-03-18 Thread tsunakawa.ta...@fujitsu.com

Hello Iwata-san,


Thanks to removing the static arrays, the code got much smaller.  I'm happy 
with this result.


(1)
+  ( for messages from client to server,
+  and  for messages from server to client),

I think this was meant to say "> or <".  So, maybe you should remove "," at the 
end of the first line, and replace "and" with "or".


(2)
+   case 8 :/* GSSENCRequest or SSLRequest */
+   /* These messsage does not reach here. */

messsage does -> messages do


(3)
+   fprintf(f, "\tAuthentication");

Why don't you move this \t in each message type to the end of:

+   fprintf(conn->Pfdebug, "%s\t%s\t%d", timestr, prefix, length);

Plus, don't we say in the manual that fields (timestamp, direction, length, 
message type, and message content) are separated by a tab?
Also, the code doesn't seem to separate the message type and content with a tab.


(4)
Where did the processing for unknown message go in pqTraceOutputMessage()?  Add 
default label?


(5)
+   case 16:/* CancelRequest */
+   fprintf(conn->Pfdebug, "%s\t>\t%d\tCancelRequest", 
timestr, length);

I think this should follow pqTraceOutputMessage().  That is, each case label 
only print the message type name in case other message types are added in the 
future.



Regards
Takayuki Tsunakawa

Re: comment fix in postmaster.c

2021-03-18 Thread Fujii Masao





On 2021/03/16 18:27, Fujii Masao wrote:


Thanks for the path! LGTM. Barring any objection, I will push the patch.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Fujii Masao





On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro  wrote:

PS: For illustration/discussion, I've also attached a "none" patch.  I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.


Erm... I attached the wrong version by mistake.  Here's a better one.


Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.

+   elog(LOG, "could not open %s: %m", path);
+   return;
+   }
+   if (syncfs(fd) < 0)
+   elog(LOG, "could not sync filesystem for \"%s\": %m", path);

Since these are neither internal errors nor low-level debug messages, ereport() 
should be used for them rather than elog()? For example,

ereport(LOG,
(errcode_for_file_access(),
 errmsg("could not open \"%s\": %m", path)))

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] pgbench: improve \sleep meta command

2021-03-18 Thread Fujii Masao





On 2021/03/19 10:02, kuroda.hay...@fujitsu.com wrote:

Dear Fujii-san,

I confirmed your patch and some parse functions, and I agree
the check condition in evaluateSleep() is correct.
No problem is found.


Thanks for reviewing the patch!
Barring any objection, I will commit this patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 2:05 PM Robert Haas  wrote:
> On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan  wrote:
> > Most anti-wraparound VACUUMs are really not emergencies, though.
>
> That's true, but it's equally true that most of the time it's not
> necessary to wear a seatbelt to avoid personal injury. The difficulty
> is that it's hard to predict on which occasions it is necessary, and
> therefore it is advisable to do it all the time.

Just to be clear: This was pretty much the point I was making here --
although I guess you're making the broader point about autovacuum and
freezing in general.

The fact that we can *continually* reevaluate if an ongoing VACUUM is
at risk of taking too long is entirely the point here. We can in
principle end index vacuuming dynamically, whenever we feel like it
and for whatever reasons occur to us (hopefully these are good reasons
-- the point is that we get to pick and choose). We can afford to be
pretty aggressive about not giving up, while still having the benefit
of doing that when it *proves* necessary. Because: what are the
chances of the emergency mechanism ending index vacuuming being the
wrong thing to do if we only do that when the system clearly and
measurably has no more than about 10% of the possible XID space to go
before the system becomes unavailable for writes?

What could possibly matter more than that?

By making the decision dynamic, the chances of our
threshold/heuristics causing the wrong behavior become negligible --
even though we're making the decision based on a tiny amount of
(current, authoritative) information. The only novel risk I can think
about is that somebody comes to rely on the mechanism saving the day,
over and over again, rather than fixing a fixable problem.

> autovacuum decides
> whether an emergency exists, in the first instance, by comparing
> age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
> for at least two reasons. First, what matters is not when the vacuum
> starts, but when the vacuum finishes.

To be fair the vacuum_set_xid_limits() mechanism that you refer to
makes perfect sense. It's just totally insufficient for the reasons
you say.

> A user who has no tables larger
> than 100MB can set autovacuum_freeze_max_age a lot closer to the high
> limit without risk of hitting it than a user who has a 10TB table. The
> time to run vacuum is dependent on both the size of the table and the
> applicable cost delay settings, none of which autovacuum knows
> anything about. It also knows nothing about the XID consumption rate.
> It's relying on the user to set autovacuum_freeze_max_age low enough
> that all the anti-wraparound vacuums will finish before the system
> crashes into a wall.

Literally nobody on earth knows what their XID burn rate is when it
really matters. It might be totally out of control that one day of
your life where it truly matters (e.g., due to a recent buggy code
deployment, which I've seen up close). That's how emergencies work.

A dynamic approach is not merely preferable. It seems essential. No
top-down plan is going to be smart enough to predict that it'll take a
really long time to get that one super-exclusive lock on relatively
few pages.

> Second, what happens to one table affects what
> happens to other tables. Even if you have perfect knowledge of your
> XID consumption rate and the speed at which vacuum will complete, you
> can't just configure autovacuum_freeze_max_age to allow exactly enough
> time for the vacuum to complete once it hits the threshold, unless you
> have one autovacuum worker per table so that the work for that table
> never has to wait for work on any other tables. And even then, as you
> mention, you have to worry about the possibility that a vacuum was
> already in progress on that table itself. Here again, we rely on the
> user to know empirically how high they can set
> autovacuum_freeze_max_age without cutting it too close.

But the VM is a lot more useful when you effectively eliminate index
vacuuming from the picture. And VACUUM has a pretty good understanding
of how that works. Index vacuuming remains the achilles' heel, and I
think that avoiding it in some cases has tremendous value. It has
outsized importance now because we've significantly ameliorated the
problems in the heap, by having the visibility map. What other factor
can make VACUUM take 10x longer than usual on occasion?

Autovacuum scheduling is essentially a top-down model of the needs of
the system -- and one with a lot of flaws. IMV we can make the model's
simplistic view of reality better by making the reality better (i.e.
simpler, more tolerant of stressors) instead of making the model
better.

> Now, that's not actually a good thing, because most users aren't smart
> enough to do that, and will either leave a gigantic safety margin that
> they don't need, or will leave an inadequate safety margin and take
> the system down. However, it means we need to be very, very careful
>

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Thomas Munro

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro  wrote:
> PS: For illustration/discussion, I've also attached a "none" patch.  I
> also couldn't resist rebasing my "wal" mode patch, which I plan to
> propose for PG15 because there is not enough time left for this
> release.

Erm... I attached the wrong version by mistake.  Here's a better one.
(Note: I'm not expecting any review of the 0003 patch in this CF, I
just wanted to share the future direction I'd like to consider for
this problem.)
From a5faf47f720684bd7952ecf0ca0c495d9ec14989 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v5 1/3] Provide recovery_init_sync_method=syncfs.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, for Linux only, where we call syncfs() on every
possibly different filesystem under the data directory.  This avoids
faulting in inodes for potentially very many inodes.  It comes with some
caveats, so it's not the default.

Reported-by: Michael Brown 
Reviewed-by: Fujii Masao 
Reviewed-by: Paul Guo 
Reviewed-by: Bruce Momjian 
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure |  2 +-
 configure.ac  |  1 +
 doc/src/sgml/config.sgml  | 30 +
 src/backend/storage/file/fd.c | 61 ++-
 src/backend/utils/misc/guc.c  | 17 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in|  3 +
 src/include/storage/fd.h  |  6 ++
 src/tools/msvc/Solution.pm|  1 +
 9 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..32de8235bf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9721,6 +9721,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  recovery_init_sync_method (enum)
+   
+recovery_init_sync_method configuration parameter
+   
+  
+  
+   
+When set to fsync, which is the default,
+PostgreSQL will recursively open and fsync
+all files in the data directory before crash recovery begins.  This
+is intended to make sure that all WAL and data files are durably stored
+on disk before replaying changes.  This applies whenever starting a
+database cluster that did not shut down cleanly, including copies
+created with pg_basebackup.
+   
+   
+On Linux, syncfs may be used instead, to ask the
+operating system to flush the whole file system.  This may be a lot
+faster, because it doesn't need to open each file one by one.  On the
+other hand, it may be slower if the file system is shared by other
+applications that modify a lot of files, since those files will also
+be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+errors encountered while writing data to disk may not be reported to
+PostgreSQL, and relevant error messages may
+appear only in kernel logs.
+   
+  
+ 
+
 
 

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 110ba31517..14a07f09a4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include 
 #include 
 #include

Re: WIP: WAL prefetch (another approach)

2021-03-18 Thread Tomas Vondra

On 3/18/21 1:54 AM, Thomas Munro wrote:
> On Thu, Mar 18, 2021 at 12:00 PM Tomas Vondra
>  wrote:
>> On 3/17/21 10:43 PM, Stephen Frost wrote:
>>> Guess I'm just not a fan of pushing out a change that will impact
>>> everyone by default, in a possibly negative way (or positive, though
>>> that doesn't seem terribly likely, but who knows), without actually
>>> measuring what that impact will look like in those more common cases.
>>> Showing that it's a great win when you're on ZFS or running with FPWs
>>> disabled is good and the expected best case, but we should be
>>> considering the worst case too when it comes to performance
>>> improvements.
>>>
>>
>> Well, maybe it'll behave differently on systems with ZFS. I don't know,
>> and I have no such machine to test that at the moment. My argument
>> however remains the same - if if happens to be a problem, just don't
>> enable (or disable) the prefetching, and you get the current behavior.
> 
> I see the road map for this feature being to get it working on every
> OS via the AIO patchset, in later work, hopefully not very far in the
> future (in the most portable mode, you get I/O worker processes doing
> pread() or preadv() calls on behalf of recovery).  So I'll be glad to
> get this infrastructure in, even though it's maybe only useful for
> some people in the first release.
> 

+1 to that


>> FWIW I'm not sure there was a discussion or argument about what should
>> be the default setting (enabled or disabled). I'm fine with not enabling
>> this by default, so that people have to enable it explicitly.
>>
>> In a way that'd be consistent with effective_io_concurrency being 1 by
>> default, which almost disables regular prefetching.
> 
> Yeah, I'm not sure but I'd be fine with disabling it by default in the
> initial release.  The current patch set has it enabled, but that's
> mostly for testing, it's not an opinion on how it should ship.
> 

+1 to that too. Better to have it disabled by default than not at all.


> I've attached a rebased patch set with a couple of small changes:
> 
> 1.  I abandoned the patch that proposed
> pg_atomic_unlocked_add_fetch_u{32,64}() and went for a simple function
> local to xlogprefetch.c that just does pg_atomic_write_u64(counter,
> pg_atomic_read_u64(counter) + 1), in response to complaints from
> Andres[1].
> 
> 2.  I fixed a bug in ReadRecentBuffer(), and moved it into its own
> patch for separate review.
> 
> I'm now looking at Horiguchi-san and Heikki's patch[2] to remove
> XLogReader's callbacks, to try to understand how these two patch sets
> are related.  I don't really like the way those callbacks work, and
> I'm afraid had to make them more complicated.  But I don't yet know
> very much about that other patch set.  More soon.
> 

OK. Do you think we should get both of those patches in, or do we need
to commit them in a particular order? Or what is your concern?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/19/21 1:54 AM, Euler Taveira wrote:
> On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:
>> Well, that's better, bit it still does not do the trick on the 32-bit
>> machine - in that case a 1000 rows with int4 still fit into work_mem, so
>> the temp file is not created. Per my experiments about 1040 rows are
>> needed - s close ;-) So let's make it 2000.
> My 32-bit laptop needs some repairs so I blindly chose 1k rows.
> 

Yeah, it's not immediately obvious how many rows are needed. 2000 would
be enough, but I just bumped it up to 5000 for good measure.

>> We might as well check that the temp file actually exists, before
>> killing the backend. Just to be sure.
> Do you mean with remove_temp_files_after_crash = on? New version attached.
> 

On second thought, we don't need that check. I realized the second part
of the test (checking remove_temp_files_after_crash=off) actually does
verify that we've created the file, so that should be OK.

I've pushed the previous version with the 5000 rows. Let's see what
crake and florican think about this ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Thomas Munro

On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao  wrote:
> On 2021/03/18 23:03, Bruce Momjian wrote:
> >> Are we sure we want to use the word "crash" here?  I don't remember
> >> seeing it used anywhere else in our user interface.
>
> We have GUC restart_after_crash.
>
> >  I guess it is
> >> "crash recovery".
> >
> > Maybe call it "recovery_sync_method"
> +1. This name sounds good to me. Or recovery_init_sync_method, because that
> sync happens in the initial phase of recovery.

Yeah, I was trying to fit the existing pattern
{restart,remove_temp_files}_after_crash.  But
recovery_init_sync_method also sounds good to me.  I prefer the
version with "init"... without "init", people might get the wrong idea
about what this controls, so let's try that.  Done in the attached
version.

> Another idea from different angle is data_directory_sync_method. If we adopt
> this, we can easily extend this feature for other use cases (other than sync 
> at
> the beginning of recovery) without changing the name.
> I'm not sure if such cases actually exist, though.

I can't imagine what -- it's like using a sledge hammer to crack a nut.

(I am aware of a semi-related idea: use the proposed fsinfo() Linux
system call to read the filesystem-wide error counter at every
checkpoint to see if anything bad happened that Linux forgot to tell
us about through the usual channels.  That's the same internal
mechanism used by syncfs() to report errors, but last I checked it
hadn't been committed yet.  I don't think that's share anything with
this code though.)

>From your earlier email:

On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao  wrote:
> +database cluster that did not shut down cleanly, including copies
> +created with pg_basebackup.
>
> "pg_basebackup" should be "pg_basebackup"?

Fixed.

> +   while ((de = ReadDir(dir, "pg_tblspc")))
>
> The comment of SyncDataDirectory() says "Errors are logged but not
> considered fatal". So ReadDirExtended() with LOG level should be used
> here, instead?

Fixed.

> Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

How could this be useful?

> +   fd = open(path, O_RDONLY);
>
> For current use, it's not harmful to use open() and close(). But isn't
> it safer to use OpenTransientFile() and CloseTransientFile(), instead?

Ok, done, for consistency with other code.

> Because do_syncfs() may be used for other purpose in the future.

I hope not :-)

> +   if (syncfs(fd) < 0)
> +   elog(LOG, "syncfs failed for %s: %m", path);
>
> According to the message style guide, this message should be something
> like "could not sync filesystem for \"%s\": %m"?

Fixed.

> I confirmed that no error was reported when crash recovery started with
> syncfs, in old Linux. I should also confirm that no error is reported in that
> case in Linux 5.8+, but I don't have that environement. So I've not tested
> this feature in Linux 5.8+

I have a Linux 5.10 system.  Here's a trace of SyncDataDirectory on a
system that has two tablespaces and has a symlink for pg_wal:

[pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
[pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4)  = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5)  = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5)  = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0
[pid 3861224] close(4)  = 0
[pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4)  = 0

To see how it looks when syncfs() fails, I added a fake implementation
that fails with EUCLEAN on every second call, and then the output
looks like this:

...
1616111356.663 startup 3879272 LOG:  database system was interrupted;
last known up at 2021-03-19 12:48:33 NZDT
1616111356.663 startup 3879272 LOG:  could not sync filesystem for
"pg_tblspc/16406": Structure needs cleaning
1616111356.663 startup 3879272 LOG:  could not sync filesystem for
"pg_wal": Structure needs cleaning
1616111356.663 startup 3879272 LOG:  database system was not properly
shut down; automatic recovery in progress
...

A more common setup with no tablespaces and pg_wal as a non-symlink looks like:

[pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0
[pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861448] syncfs(4) = 0
[pid 3861448] close(4)  = 0
[pid 3861448] openat(AT_FDCWD,

Re: libpq compression

2021-03-18 Thread Justin Pryzby

On Thu, Mar 18, 2021 at 07:30:09PM +, Daniil Zakhlystov wrote:
> The new status of this patch is: Ready for Committer

The CF bot is failing , because the last patch sent to the list is from January:
| Latest attachment (libpq-compression-31.patch) at 2021-01-12 14:05:22 from 
Konstantin Knizhnik ...

The most recent messages have all had links to github repos without patches
attached.

Also, the branches I've looked at on the github repos all have messy history,
and need to be squished into a coherent commit or set of commits.

Would you send a patch to the list with the commits as you propose to merge
them ?

Also, I'm not sure, but I think you may find that the zstd configure.ac should
use pkgconfig.  This allowed the CIs to compile these patches.  Without
pkg-config, the macos CI couldn't find (at least) LZ4.k
https://commitfest.postgresql.org/32/2813/
https://commitfest.postgresql.org/32/3015/

Also, in those patches, we have separate "not for merge" patches which enable
the compression library by default.  This allows the CIs to exercise the
feature.

-- 
Justin

RE: [PATCH] pgbench: improve \sleep meta command

2021-03-18 Thread kuroda.hay...@fujitsu.com

Dear Fujii-san,

I confirmed your patch and some parse functions, and I agree
the check condition in evaluateSleep() is correct.
No problem is found.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: cleanup temporary files after crash

2021-03-18 Thread Euler Taveira

On Thu, Mar 18, 2021, at 8:34 PM, Tomas Vondra wrote:
> Well, that's better, bit it still does not do the trick on the 32-bit
> machine - in that case a 1000 rows with int4 still fit into work_mem, so
> the temp file is not created. Per my experiments about 1040 rows are
> needed - s close ;-) So let's make it 2000.
My 32-bit laptop needs some repairs so I blindly chose 1k rows.

> We might as well check that the temp file actually exists, before
> killing the backend. Just to be sure.
Do you mean with remove_temp_files_after_crash = on? New version attached.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index c37b227770..38e935d641 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -5,9 +5,8 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
-use Time::HiRes qw(usleep);
 
-plan tests => 9;
+plan tests => 10;
 
 
 # To avoid hanging while expecting some specific input from a psql
@@ -33,8 +32,7 @@ $node->safe_psql(
 # create table, insert rows
 $node->safe_psql(
 	'postgres',
-	q[CREATE TABLE tab_crash (a text);
-		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+	q[CREATE TABLE tab_crash (a integer UNIQUE);]);
 
 # Run psql, keeping session alive, so we have an alive backend to kill.
 my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
@@ -62,6 +60,32 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Open a 2nd session that will block the 1st session. The UNIQUE constraint
+# will prevent the temporary file from the 1st session to be removed.
+my ($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+my $killme2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin2,
+	'>',
+	\$killme_stdout2,
+	'2>',
+	\$killme_stderr2,
+	$psql_timeout);
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -69,15 +93,18 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 2000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
+# Check for the existence of a temporary file
+is($node->safe_psql(
+	'postgres',
+	'SELECT COUNT(1) FROM pg_ls_dir($$base/pgsql_tmp$$)'),
+	qq(1), 'one temporary file');
 
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
@@ -85,6 +112,7 @@ is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');
@@ -118,6 +146,20 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Restart the 2nd psql session
+($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+$killme2->run();
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -125,22 +167,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 2000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql

Re: psql tab completion for \h with IMPORT FOREIGN SCHEMA

2021-03-18 Thread Michael Paquier

On Thu, Mar 18, 2021 at 07:45:36AM +, gkokola...@pm.me wrote:
> It seems helpful. Thank you.

Thanks, applied then.
--
Michael


signature.asc
Description: PGP signature

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Tom Lane

Peter Geoghegan  writes:
> On Thu, Mar 18, 2021 at 4:40 PM Tom Lane  wrote:
>> Good question.  We don't have a standard about that (whether to
>> do those in separate or the same commits), but we could establish one
>> if it seems helpful.

> I don't think that it matters too much, but it will necessitate
> updating the file multiple times. It might become natural to just do
> everything together in a way that it wasn't before.

Doubt that it matters.  The workflow would have to be "commit and push
the mechanical updates, then edit the tracking file, commit and push
that".  You don't have the commit hash nailed down till you've pushed.
So if we decided to do the mechanical updates in several commits,
not just one, I'd still be inclined to do them all and then edit the
tracking file once.

regards, tom lane

Re: GROUP BY DISTINCT

2021-03-18 Thread Vik Fearing

On 3/19/21 12:52 AM, Tomas Vondra wrote:
> 
> On 3/19/21 12:26 AM, Tom Lane wrote:
>> I wrote:
>>> This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
>>> I'll work on figuring that out.
>>
>> Actually, the problem is pretty obvious after comparing this use
>> of foreach_delete_current() to every other one.  I'm not sure why
>> the compiler warnings are phrased just as they are, but the fix
>> I just pushed does make 4.5 happy.
>>
> 
> Thanks! Yeah, that looks obvious. Funny the older compilers noticed
> this, not the new fancy ones.

+1

I'm glad the buildfarm is so diverse.
-- 
Vik Fearing

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 4:40 PM Tom Lane  wrote:
> Good question.  We don't have a standard about that (whether to
> do those in separate or the same commits), but we could establish one
> if it seems helpful.

I don't think that it matters too much, but it will necessitate
updating the file multiple times. It might become natural to just do
everything together in a way that it wasn't before.

The really big wins come from excluding the enormous pgindent run
commits, especially for the few historic pgindent runs where the rules
changed -- there are no more than a handful of those. They tend to
generate an enormous amount of churn that touches almost everything.
So it probably isn't necessary to worry about smaller things.

--
Peter Geoghegan

Re: GROUP BY DISTINCT

2021-03-18 Thread Tomas Vondra



On 3/19/21 12:26 AM, Tom Lane wrote:
> I wrote:
>> This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
>> I'll work on figuring that out.
> 
> Actually, the problem is pretty obvious after comparing this use
> of foreach_delete_current() to every other one.  I'm not sure why
> the compiler warnings are phrased just as they are, but the fix
> I just pushed does make 4.5 happy.
> 

Thanks! Yeah, that looks obvious. Funny the older compilers noticed
this, not the new fancy ones.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Tom Lane

Peter Geoghegan  writes:
> What about reformat-dat-files and perltidy runs? It seems that you
> have sometimes used all three reformatting tools to produce one commit
> -- but not always. ISTM that I should get any of those that I missed.
> And that the pgindent README (which already mentions these other
> tools) ought to be updated to be explicit about the policy applying
> equally to commits that apply any of the two other tools in bulk.

Good question.  We don't have a standard about that (whether to
do those in separate or the same commits), but we could establish one
if it seems helpful.

regards, tom lane

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/18/21 10:44 PM, Euler Taveira wrote:
> On Thu, Mar 18, 2021, at 6:00 PM, Euler Taveira wrote:
>> On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:
>>> OK. Can you prepare a patch with the proposed test approach?
>> I'm on it.
> What do you think about this patch?
> 

Well, that's better, bit it still does not do the trick on the 32-bit
machine - in that case a 1000 rows with int4 still fit into work_mem, so
the temp file is not created. Per my experiments about 1040 rows are
needed - s close ;-) So let's make it 2000.

We might as well check that the temp file actually exists, before
killing the backend. Just to be sure.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 4:05 PM Tom Lane  wrote:
> b5d69b7c22ee4c44b8bb99cfa0466ffaf3b5fab9  # Sun Jun 7 16:57:08 2020 -0400
> # pgindent run prior to branching v13.
>
> which is easy to make from "git log" or "git show" output.  (Because
> of the precedent of those tools, I'd rather write the commit hash
> before the rest of the entry.)

WFM.

What about reformat-dat-files and perltidy runs? It seems that you
have sometimes used all three reformatting tools to produce one commit
-- but not always. ISTM that I should get any of those that I missed.
And that the pgindent README (which already mentions these other
tools) ought to be updated to be explicit about the policy applying
equally to commits that apply any of the two other tools in bulk.

-- 
Peter Geoghegan

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 07:05:03PM -0400, Tom Lane wrote:
> Other points: the file should be kept in src/tools/pgindent/, and
> it should definitely NOT have a name beginning with ".".

Well, if we want github and others to eventually honor a file, it seems
.git-blame-ignore-revs at the top of the tree is the common location for
this.  Of course, I don't know if they will ever do that, and can move
it later if needed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: GROUP BY DISTINCT

2021-03-18 Thread Tom Lane

I wrote:
> This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
> I'll work on figuring that out.

Actually, the problem is pretty obvious after comparing this use
of foreach_delete_current() to every other one.  I'm not sure why
the compiler warnings are phrased just as they are, but the fix
I just pushed does make 4.5 happy.

regards, tom lane

Re: GROUP BY DISTINCT

2021-03-18 Thread Tom Lane

I wrote:
> Hmm ... prairiedog isn't showing the warning, but maybe gaur will.

Bingo:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:5: warning: value computed is not used

This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.

regards, tom lane

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Tom Lane

Peter Geoghegan  writes:
> On Thu, Mar 18, 2021 at 3:39 PM Tom Lane  wrote:
>> I don't object to maintaining such a file; if it makes "git blame"
>> work better, that's a huge win.  However, the file as you have it
>> seems rather unreadable.  I'd much rather have something that includes
>> the commit date and/or first line of commit message.  Is there any
>> flexibility in the format, or does git blame insist it be just like this?

> I ended up doing it that way because I was in a hurry to see how much
> it helped. I can fix it up.
> We could require (but not automatically enforce) that the first line
> of the commit message be included above each hash, as a comment. You
> could also require reverse chronological ordering of commits. That
> would make everything easy to follow.

Given that the file will be added to manually, I think just having an
existing format to follow will be easy enough.  I'd suggest something
like

b5d69b7c22ee4c44b8bb99cfa0466ffaf3b5fab9  # Sun Jun 7 16:57:08 2020 -0400
# pgindent run prior to branching v13.

which is easy to make from "git log" or "git show" output.  (Because
of the precedent of those tools, I'd rather write the commit hash
before the rest of the entry.)

The date is important IMO because otherwise it's quite unclear whether
to add a new entry at the top or the bottom.

Other points: the file should be kept in src/tools/pgindent/, and
it should definitely NOT have a name beginning with ".".

regards, tom lane

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 4:00 PM Bruce Momjian  wrote:
> Probably because later commits might collide with shorter hashes.  When
> you are reporting a hash that only looks _backward_, this is not an
> issue.

Right, but it's extremely unlikely to happen by accident. I was
suggesting that there might be a security issue. I could fairly easily
make my git commit match a prefix intended to uniquely identify your
git commit if I set out to do so.

There are projects that might have to consider that possibility,
though perhaps we're not one of them.

-- 
Peter Geoghegan

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 03:46:10PM -0700, Peter Geoghegan wrote:
> It's worth noting that git insists that you provide the full hash of
> commits here. This is not something I remember it insisting upon in
> any other area. There is probably a very good practical reason for
> that.

Probably because later commits might collide with shorter hashes.  When
you are reporting a hash that only looks _backward_, this is not an
issue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 3:39 PM Tom Lane  wrote:
> I don't object to maintaining such a file; if it makes "git blame"
> work better, that's a huge win.  However, the file as you have it
> seems rather unreadable.  I'd much rather have something that includes
> the commit date and/or first line of commit message.  Is there any
> flexibility in the format, or does git blame insist it be just like this?

I ended up doing it that way because I was in a hurry to see how much
it helped. I can fix it up.

We could require (but not automatically enforce) that the first line
of the commit message be included above each hash, as a comment. You
could also require reverse chronological ordering of commits. That
would make everything easy to follow.

It's worth noting that git insists that you provide the full hash of
commits here. This is not something I remember it insisting upon in
any other area. There is probably a very good practical reason for
that.

-- 
Peter Geoghegan

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Tom Lane

Peter Geoghegan  writes:
> Attached is my .git-blame-ignore-revs file, which has pgindent commits
> that I'd like to exclude from git blame. The file is helpful on its
> own. But what we really ought to do is commit the file (perhaps with
> some revisions) and require that it be maintained by the official
> project workflow documented at src/tools/pgindent/README.

I don't object to maintaining such a file; if it makes "git blame"
work better, that's a huge win.  However, the file as you have it
seems rather unreadable.  I'd much rather have something that includes
the commit date and/or first line of commit message.  Is there any
flexibility in the format, or does git blame insist it be just like this?

regards, tom lane

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 03:20:37PM -0700, Peter Geoghegan wrote:
> On Thu, Mar 18, 2021 at 3:12 PM Bruce Momjian  wrote:
> > It would be kind of nice if the file can be generated automatically.  I
> > have you checked if 'pgindent' being on the first line of the commit is
> > sufficient?
> 
> I generated the file by looking for commits that:
> 
> 1) Mentioned "pgindent" or "PGINDENT" in the entire commit message.
> 
> 2) Had more than 20 or 30 files changed.
> 
> This left me with fewer than 50 commits that cover over 20 years of
> history since the first pgindent commit. I also added one or two
> others that I somehow missed (maybe you happened to spell it "pg
> indent" that year) through trial and error. The file that I sent to
> the list works really well for me.
> 
> I don't think that it's a good idea to automate this process, because
> we certainly don't want to let incorrect entries slip in. And because
> there just isn't a lot left to automate -- running pgindent on the
> tree is something that happens no more than 2 or 3 times a year. It
> could easily be added to the checklist in the README. It should take
> less than 5 minutes a year.

Sounds like a plan.  We should mention adding to this file somewhere in
our pgindent README.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: GROUP BY DISTINCT

2021-03-18 Thread Tom Lane

Thomas Munro  writes:
> On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
>  wrote:
>> Thanks for the info. So it's likely related to older gcc releases. The
>> question is how to tweak the code to get rid of this ...

> It's frustrating to have to do press-ups to fix a problem because a
> zombie Debian 7 system is running with -Werror (though it's always
> possible that it's telling us something interesting...).  Anyway, I
> think someone with a GCC < 4.8 compiler would have to investigate.  I
> was hoping to help, but none of my systems have one in easy-to-install
> format...

Hmm ... prairiedog isn't showing the warning, but maybe gaur will.
I can take a look if nobody else is stepping up.

regards, tom lane

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 3:12 PM Bruce Momjian  wrote:
> It would be kind of nice if the file can be generated automatically.  I
> have you checked if 'pgindent' being on the first line of the commit is
> sufficient?

I generated the file by looking for commits that:

1) Mentioned "pgindent" or "PGINDENT" in the entire commit message.

2) Had more than 20 or 30 files changed.

This left me with fewer than 50 commits that cover over 20 years of
history since the first pgindent commit. I also added one or two
others that I somehow missed (maybe you happened to spell it "pg
indent" that year) through trial and error. The file that I sent to
the list works really well for me.

I don't think that it's a good idea to automate this process, because
we certainly don't want to let incorrect entries slip in. And because
there just isn't a lot left to automate -- running pgindent on the
tree is something that happens no more than 2 or 3 times a year. It
could easily be added to the checklist in the README. It should take
less than 5 minutes a year.

-- 
Peter Geoghegan

PG13 fails to startup even though the current transaction is equal to the target transaction

2021-03-18 Thread Sean Jezewski

We've noticed what may be a regression / bug in PG13.

I work at Heroku on the Data team, where we manage a fleet of Postgres
services. This change has resulted in breaking the UX we offer to customers
to manage their PG services. In particular, ‘forks’ and ‘point in time
restores’ seem broken for PG13.

I believe it is related to this patch (
https://www.postgresql.org/message-id/993736dd3f1713ec1f63fc3b653839f5%40lako.no
)

For PG12, we expect:

-- We create a new Postgres service from archive and provide a
recovery_target_xid
-- PG replays the archive until the end of the archive is reached, and the
current transaction == recovery_target_xid
-- We measure the current transaction via the query SELECT
pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot())
-- Since the current transaction is exactly equal to the target
transaction, we perform the promotion

For PG12, what we get:

-- This process completes smoothly, and the new postgres service is up and
running

For PG13, we expect:

-- We create a new Postgres service from archive and provide a
recovery_target_xid
-- PG replays the archive until the end of the archive is reached, and the
current transaction == recovery_target_xid
-- We measure the current transaction via the query SELECT
pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot())
-- Since the current transaction is exactly equal to the target
transaction, we perform the promotion

For PG13, what we get:

-- On promotion we see the postgres process exit with the following log
lines:

Mar 17 14:47:49 ip-10-0-146-54 25a9551c_65ec_4870_99e9_df69151984a0[7]:
[18-1] sql_error_code = 0 LOG: promote trigger file found:
/etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
Mar 17 14:47:49 ip-10-0-146-54 25a9551c_65ec_4870_99e9_df69151984a0[7]:
[19-1] sql_error_code = 0 LOG: redo done at 0/60527E0
Mar 17 14:47:49 ip-10-0-146-54 25a9551c_65ec_4870_99e9_df69151984a0[7]:
[20-1] sql_error_code = 0 LOG: last completed transaction was at log
time 2021-03-17 14:42:44.901448+00
Mar 17 14:47:49 ip-10-0-146-54 25a9551c_65ec_4870_99e9_df69151984a0[7]:
[21-1] sql_error_code = XX000 FATAL: recovery ended before configured
recovery target was reached
Mar 17 14:47:49 ip-10-0-146-54 25a9551c_65ec_4870_99e9_df69151984a0[5]:
[8-1] sql_error_code = 0 LOG: startup process (PID 7) exited with exit
code 1

Even though the transaction IDs are identical. It seems like the end of the
archive was reached (in fact the last transaction), and while we arrived at
the correct transaction id, somehow PG decided it wasn’t done replaying?
Perhaps because somehow the timestamps don’t line up? (Afaict we do not set
the recovery_target_time, just the recovery_target_xid)

We have the `recovery_target_inclusive` set to true, which is the default.
It really seems like the intent of that setting means that if the target
equals the current transaction ID, recovery should be marked as complete.
However we're seeing the opposite. While the current txn id == the target
transaction id, the server exits. This is surprising, and doesn't line up
with our expected behavior.

We have a workaround.

Right before promotion, if we increment the transaction of the leader
database (the original PG service that we're forking from) by running
`SELECT pg_catalog.txid_current()`, wait 120 seconds (double our archive
timeout value to allow for the WAL segment to be written / uploaded /
read), and wait until the current transaction is strictly greater than the
target transaction, then the promotion seems to work fine every time for
PG13. But this seems like an off by one error?

What do you think? Is this a bug? Is this expected? Is this user error on
our end?

Thanks!

Sean

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 03:03:41PM -0700, Peter Geoghegan wrote:
> Attached is my .git-blame-ignore-revs file, which has pgindent commits
> that I'd like to exclude from git blame. The file is helpful on its
> own. But what we really ought to do is commit the file (perhaps with
> some revisions) and require that it be maintained by the official
> project workflow documented at src/tools/pgindent/README.

It would be kind of nice if the file can be generated automatically.  I
have you checked if 'pgindent' being on the first line of the commit is
sufficient?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: a verbose option for autovacuum

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 2:08 PM Michael Paquier  wrote:
> Yes, I was waiting for Sawada-san to send an updated version, which he
> just did.

Great. This really seems worth having. I was hoping that somebody else
could pick this one up.

-- 
Peter Geoghegan

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 2:10 PM Bruce Momjian  wrote:
> Actually, Tom Lane runs pgindent usually now.  I do the copyright
> change, but I don't think we would ignore those since the single-line
> change is probably something we would want to blame.

The copyright change commits don't need to be considered here. In
practice they're just not a problem because nobody wants or expects
"git blame" to do anything more than attribute an affected line to a
copyright commit.

> It would certainly be very easy to pull pgindent commits out of git log
> and add them.  I do wish we could cause everyone to honor that file, but
> it seems each user has to configure their repository to honor it.

That doesn't seem like a huge problem. There is no reason why this
shouldn't be easy to use and to maintain going forward. There just
aren't very many commits involved.

Attached is my .git-blame-ignore-revs file, which has pgindent commits
that I'd like to exclude from git blame. The file is helpful on its
own. But what we really ought to do is commit the file (perhaps with
some revisions) and require that it be maintained by the official
project workflow documented at src/tools/pgindent/README.

-- 
Peter Geoghegan

pgindent-git-blame-ignore-revs
Description: Binary data

Re: GROUP BY DISTINCT

2021-03-18 Thread Thomas Munro

On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
 wrote:
> >> The only possibility I can think of is some sort of issue in the old-ish
> >> gcc release (4.7.2).
> >
> > No sure what's going on there, but data points: I tried a 32 bit build
> > here (that's the other special thing about lapwing) and didn't see the
> > warning.  GCC 10.  Also curculio (gcc 4.2) and snapper (gcc 4.7) are
> > also showing this warning, but they don't have -Werror so they don't
> > fail.  sidewinder (gcc 4.8) is not showing the warning.
> >
>
> Thanks for the info. So it's likely related to older gcc releases. The
> question is how to tweak the code to get rid of this ...

It's frustrating to have to do press-ups to fix a problem because a
zombie Debian 7 system is running with -Werror (though it's always
possible that it's telling us something interesting...).  Anyway, I
think someone with a GCC < 4.8 compiler would have to investigate.  I
was hoping to help, but none of my systems have one in easy-to-install
format...

Re: Getting better results from valgrind leak tracking

2021-03-18 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> There's plenty other hits, but I think I should get back to working on
>> making the shared memory stats patch committable. I really wouldn't want
>> it to slip yet another year.

> +1, so far there's little indication that we're finding any serious leaks
> here.  Might be best to set it all aside till there's more time.

Well, I failed to resist the temptation to keep poking at this issue,
mainly because I felt that it'd be a good idea to make sure we'd gotten
our arms around all of the detectable issues, even if we choose not to
fix them right away.  The attached patch, combined with your preceding
memory context patch, eliminates all but a very small number of the leak
complaints in the core regression tests.

The remaing leak warnings that I see are:

1. WaitForReplicationWorkerAttach leaks the BackgroundWorkerHandle it's
passed.  I'm not sure which function should clean that up, but in any
case it's only 16 bytes one time in one process, so it's hardly exciting.

2. There's lots of leakage in text search dictionary initialization
functions.  This is hard to get around completely, because the API for
those functions has them being called in the dictionary's long-lived
context.  In any case, the leaks are not large (modulo comments below),
and they would get cleaned up if the dictionary cache were thrown away.

3. partition_prune.sql repeatably produces this warning:

==00:00:44:39.764 3813570== 32,768 bytes in 1 blocks are possibly lost in loss 
record 2,084 of 2,096
==00:00:44:39.764 3813570==at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:44:39.764 3813570==by 0x94315A: AllocSetAlloc (aset.c:941)
==00:00:44:39.764 3813570==by 0x94B52F: MemoryContextAlloc (mcxt.c:804)
==00:00:44:39.764 3813570==by 0x8023DA: LockAcquireExtended (lock.c:845)
==00:00:44:39.764 3813570==by 0x7FFC7E: LockRelationOid (lmgr.c:116)
==00:00:44:39.764 3813570==by 0x5CB89F: findDependentObjects 
(dependency.c:962)
==00:00:44:39.764 3813570==by 0x5CBA66: findDependentObjects 
(dependency.c:1060)
==00:00:44:39.764 3813570==by 0x5CBA66: findDependentObjects 
(dependency.c:1060)
==00:00:44:39.764 3813570==by 0x5CCB72: performMultipleDeletions 
(dependency.c:409)
==00:00:44:39.764 3813570==by 0x66F574: RemoveRelations (tablecmds.c:1395)
==00:00:44:39.764 3813570==by 0x81C986: ProcessUtilitySlow.isra.8 
(utility.c:1698)
==00:00:44:39.764 3813570==by 0x81BCF2: standard_ProcessUtility 
(utility.c:1034)

which evidently is warning that some LockMethodLocalHash entry is losing
track of its lockOwners array.  But I sure don't see how that could
happen, nor why it'd only happen in this test.  Could this be a false
report?

As you can see from the patch's additions to src/tools/valgrind.supp,
I punted on the issues around pl/pgsql's function-compile-time leaks.
We know that's an issue, but again the leaks are fairly small and
they are confined within function cache entries, so I'm not sure
how hard we should work on that.

(An idea for silencing both this and the dictionary leak warnings is
to install an on-proc-exit callback to drop the cached objects'
contexts.)

Anyway, looking through the patch, I found several non-cosmetic issues:

* You were right to suspect that there was residual leakage in guc.c's
handling of error cases.  ProcessGUCArray and call_string_check_hook
are both guilty of leaking malloc'd strings if an error in a proposed
GUC setting is detected.

* I still haven't tried to wrap my brain around the question of what
RestoreGUCState really needs to be doing.  I have, however, found that
check-world passes just fine with the InitializeOneGUCOption calls
diked out entirely, so that's what's in this patch.

* libpqwalreceiver.c leaks a malloc'd error string when
libpqrcv_check_conninfo detects an erroneous conninfo string.

* spell.c leaks a compiled regular expression if an ispell dictionary
cache entry is dropped.  I fixed this by adding a memory context reset
callback to release the regex.  This is potentially a rather large
leak, if the regex is complex (though typically it wouldn't be).

* BuildEventTriggerCache leaks working storage into
EventTriggerCacheContext.

* Likewise, load_domaintype_info leaks working storage into
a long-lived cache context.

* RelationDestroyRelation leaks rd_statlist; the patch that added
that failed to emulate the rd_indexlist prototype very well.

* As previously noted, RelationInitTableAccessMethod causes leaks.

* I made some effort to remove easily-removable leakage in
lookup_ts_dictionary_cache itself, although given the leaks in
its callees, this isn't terribly exciting.

I am suspicious that there's something still not quite right in the
memory context changes, because I noticed that the sanity_check.sql
test (and no other ones) complained about row_description_context being
leaked.  I realized that the warning was because my compiler optimizes
away the row_description_context static variable

Re: cleanup temporary files after crash

2021-03-18 Thread Euler Taveira

On Thu, Mar 18, 2021, at 6:00 PM, Euler Taveira wrote:
> On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:
>> OK. Can you prepare a patch with the proposed test approach?
> I'm on it.
What do you think about this patch?


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/test/recovery/t/022_crash_temp_files.pl b/src/test/recovery/t/022_crash_temp_files.pl
index c37b227770..68e6fc20ad 100644
--- a/src/test/recovery/t/022_crash_temp_files.pl
+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -5,7 +5,6 @@ use PostgresNode;
 use TestLib;
 use Test::More;
 use Config;
-use Time::HiRes qw(usleep);
 
 plan tests => 9;
 
@@ -33,8 +32,7 @@ $node->safe_psql(
 # create table, insert rows
 $node->safe_psql(
 	'postgres',
-	q[CREATE TABLE tab_crash (a text);
-		INSERT INTO tab_crash (a) SELECT gen_random_uuid() FROM generate_series(1, 500);]);
+	q[CREATE TABLE tab_crash (a integer UNIQUE);]);
 
 # Run psql, keeping session alive, so we have an alive backend to kill.
 my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
@@ -62,6 +60,32 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Open a 2nd session that will block the 1st session. The UNIQUE constraint
+# will prevent the temporary file from the 1st session to be removed.
+my ($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+my $killme2 = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('postgres')
+	],
+	'<',
+	\$killme_stdin2,
+	'>',
+	\$killme_stdout2,
+	'2>',
+	\$killme_stderr2,
+	$psql_timeout);
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -69,22 +93,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 1000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 my $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');
@@ -118,6 +140,20 @@ chomp($pid);
 $killme_stdout = '';
 $killme_stderr = '';
 
+# Restart the 2nd psql session
+($killme_stdin2, $killme_stdout2, $killme_stderr2) = ('', '', '');
+$killme2->run();
+
+# Insert one tuple and leave the transaction open
+$killme_stdin2 .= q[
+BEGIN;
+SELECT $$insert-tuple-to-lock-next-insert$$;
+INSERT INTO tab_crash (a) VALUES(1);
+];
+pump_until($killme2, \$killme_stdout2, qr/insert-tuple-to-lock-next-insert/m);
+$killme_stdout2 = '';
+$killme_stderr2 = '';
+
 # Run the query that generates a temporary file and that will be killed before
 # it finishes. Since the query that generates the temporary file does not
 # return before the connection is killed, use a SELECT before to trigger
@@ -125,22 +161,20 @@ $killme_stderr = '';
 $killme_stdin .= q[
 BEGIN;
 SELECT $$in-progress-before-sigkill$$;
-WITH foo AS (SELECT a FROM tab_crash ORDER BY a) SELECT a, pg_sleep(1) FROM foo;
+INSERT INTO tab_crash (a) SELECT i FROM generate_series(1, 1000) s(i);
 ];
 ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigkill/m),
-	'select in-progress-before-sigkill');
+	'insert in-progress-before-sigkill');
 $killme_stdout = '';
 $killme_stderr = '';
 
-# Wait some time so the temporary file is generated by SELECT
-usleep(10_000);
-
 # Kill with SIGKILL
 $ret = TestLib::system_log('pg_ctl', 'kill', 'KILL', $pid);
 is($ret, 0, 'killed process with KILL');
 
 # Close psql session
 $killme->finish;
+$killme2->finish;
 
 # Wait till server restarts
 $node->poll_query_until('postgres', 'SELECT 1', '1');

Re: non-HOT update not looking at FSM for large tuple update

2021-03-18 Thread Matthias van de Meent

On Wed, 17 Mar 2021 at 21:52, John Naylor  wrote:
>
> On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent 
>  wrote:
> >
> > If this case isn't added, the lower else branch will fail to find
> > fitting pages for len > maxPaddedFsmRequest tuples; potentially
> > extending the relation when there is actually still enough space
> > available.
>
> Okay, with that it looks better to go back to using Max().
>
> Also in v4:
>
> - With the separate constant you suggested, I split up the comment into two 
> parts.

> + * The minimum space on a page for it to be considered "empty" regardless
> + * of fillfactor. We base this on MaxHeapTupleSize, minus a small amount
> + * of slack. We allow slack equal to 1/8 the maximum space that could be
> + * taken by line pointers, which is somewhat arbitrary.

> + * We want to allow inserting a large tuple into an empty page even if
> + * that would violate the fillfactor. Otherwise, we would unnecessarily
> + * extend the relation. Instead, ask the FSM for maxPaddedFsmRequest
> + * bytes. This will allow it to return a page that is not quite empty
> + * because of unused line pointers

How about

+* Because pages that have no items left can still have space allocated
+* to item pointers, we consider pages "empty" for FSM requests when they
+* have at most 1/8 of their MaxHeapTuplesPerPage item pointers' space
+* allocated. This is a somewhat arbitrary number, but should prevent
+* most unnecessary relation extensions due to not finding "empty" pages
+* while inserting combinations of large tuples with low fillfactors.

+* When the free space to be requested from the FSM is greater than
+* maxPaddedFsmRequest, this is considered equivalent to requesting
+* insertion on an "empty" page, so instead of failing to find a page
+* with more empty space than an "empty" page and extend the relation,
+* we try to find a page which is considered "emtpy".

This is slightly more verbose, but I think this clarifies the
reasoning why we need this a bit better. Feel free to reject or adapt
as needed.

> - I've added a regression test to insert.sql similar to your earlier test, 
> but we cannot use vacuum, since in parallel tests there could still be tuples 
> visible to other transactions. It's still possible to test almost-all-free by 
> inserting a small tuple.
> - Draft commit message

Apart from these mainly readability changes in comments, I think this is ready.

> --
> John Naylor
> EDB: http://www.enterprisedb.com

Re: GROUP BY DISTINCT

2021-03-18 Thread Tomas Vondra




On 3/18/21 10:02 PM, Thomas Munro wrote:
> On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
>  wrote:
>> Hmmm, this seems to fail on lapwing with this error:
>>
>> parse_agg.c: In function 'expand_grouping_sets':
>> parse_agg.c:1851:23: error: value computed is not used
>> [-Werror=unused-value]
>> cc1: all warnings being treated as errors
>>
>> That line is this:
>>
>> foreach_delete_current(result, cell);
>>
>> and I don't see how any of the values close by could be unused ...
>>
>> The only possibility I can think of is some sort of issue in the old-ish
>> gcc release (4.7.2).
> 
> No sure what's going on there, but data points: I tried a 32 bit build
> here (that's the other special thing about lapwing) and didn't see the
> warning.  GCC 10.  Also curculio (gcc 4.2) and snapper (gcc 4.7) are
> also showing this warning, but they don't have -Werror so they don't
> fail.  sidewinder (gcc 4.8) is not showing the warning.
> 

Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 01:46:49PM -0700, Peter Geoghegan wrote:
> Recent versions of git are capable of maintaining a list of commits
> for "git blame" to ignore:
> 
> https://www.moxio.com/blog/43/ignoring-bulk-change-commits-with-git-blame
> 
> I tried this out myself, using my own list of pgindent commits. It
> works very well -- much better than what you get when you ask git to
> heuristically ignore commits based on whitespace-only line changes.
> This is not surprising, in a way; I don't actually want to avoid
> whitespace. I just want to ignore pgindent commits.
> 
> Note that there are still a small number of pgindent line changes,
> even with this. That's because sometimes it's unavoidable -- some
> "substantively distinct lines of code" are actually created by
> pgindent. But these all seem to be  lines that are only shown
> because there is legitimately no more appropriate commit to attribute
> the line to. This seems like the ideal behavior to me.
> 
> I propose that we (I suppose I actually mean Bruce) start maintaining

Actually, Tom Lane runs pgindent usually now.  I do the copyright
change, but I don't think we would ignore those since the single-line
change is probably something we would want to blame.

> our own file for this in git. It can be configured to run without any
> extra steps via a once-off "git config blame.ignoreRevsFile
> .git-blame-ignore-revs". It would only need to be updated whenever
> Bruce or Tom runs pgindent.
> 
> It doesn't matter if this misses one or two smaller pgindent runs, it
> seems. Provided the really huge ones are in the file, everything works
> very well.

It would certainly be very easy to pull pgindent commits out of git log
and add them.  I do wish we could cause everyone to honor that file, but
it seems each user has to configure their repository to honor it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: a verbose option for autovacuum

2021-03-18 Thread Michael Paquier

On Thu, Mar 18, 2021 at 09:38:05AM -0700, Peter Geoghegan wrote:
> Were you going to take care of this, Michael?

Yes, I was waiting for Sawada-san to send an updated version, which he
just did.
--
Michael


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Robert Haas

On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan  wrote:
> Most anti-wraparound VACUUMs are really not emergencies, though.

That's true, but it's equally true that most of the time it's not
necessary to wear a seatbelt to avoid personal injury. The difficulty
is that it's hard to predict on which occasions it is necessary, and
therefore it is advisable to do it all the time. autovacuum decides
whether an emergency exists, in the first instance, by comparing
age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
for at least two reasons. First, what matters is not when the vacuum
starts, but when the vacuum finishes. A user who has no tables larger
than 100MB can set autovacuum_freeze_max_age a lot closer to the high
limit without risk of hitting it than a user who has a 10TB table. The
time to run vacuum is dependent on both the size of the table and the
applicable cost delay settings, none of which autovacuum knows
anything about. It also knows nothing about the XID consumption rate.
It's relying on the user to set autovacuum_freeze_max_age low enough
that all the anti-wraparound vacuums will finish before the system
crashes into a wall. Second, what happens to one table affects what
happens to other tables. Even if you have perfect knowledge of your
XID consumption rate and the speed at which vacuum will complete, you
can't just configure autovacuum_freeze_max_age to allow exactly enough
time for the vacuum to complete once it hits the threshold, unless you
have one autovacuum worker per table so that the work for that table
never has to wait for work on any other tables. And even then, as you
mention, you have to worry about the possibility that a vacuum was
already in progress on that table itself. Here again, we rely on the
user to know empirically how high they can set
autovacuum_freeze_max_age without cutting it too close.

Now, that's not actually a good thing, because most users aren't smart
enough to do that, and will either leave a gigantic safety margin that
they don't need, or will leave an inadequate safety margin and take
the system down. However, it means we need to be very, very careful
about hard-coded thresholds like 90% of the available XID space. I do
think that there is a case for triggering emergency extra safety
measures when things are looking scary. One that I think would help a
tremendous amount is to start ignoring the vacuum cost delay when
wraparound danger (and maybe even bloat danger) starts to loom.
Perhaps skipping index vacuuming is another such measure, though I
suspect it would help fewer people, because in most of the cases I
see, the system is throttled to use a tiny percentage of its actual
hardware capability. If you're running at 1/5 of the speed of which
the hardware is capable, you can only do better by skipping index
cleanup if that skips more than 80% of page accesses, which could be
true but probably isn't. In reality, I think we probably want both
mechanisms, because they complement each other. If one can save 3X and
the other 4X, the combination is a 12X improvement, which is a big
deal. We might want other things, too.

But ... should the thresholds for triggering these kinds of mechanisms
really be hard-coded with no possibility of being configured in the
field? What if we find out after the release is shipped that the
mechanism works better if you make it kick in sooner, or later, or if
it depends on other things about the system, which I think it almost
certainly does? Thresholds that can't be changed without a recompile
are bad news. That's why we have GUCs.

On another note, I cannot say enough bad things about the function
name two_pass_strategy(). I sincerely hope that you're not planning to
create a function which is a major point of control for VACUUM whose
name gives no hint that it has anything to do with vacuum.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: GROUP BY DISTINCT

2021-03-18 Thread Thomas Munro

On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
 wrote:
> Hmmm, this seems to fail on lapwing with this error:
>
> parse_agg.c: In function 'expand_grouping_sets':
> parse_agg.c:1851:23: error: value computed is not used
> [-Werror=unused-value]
> cc1: all warnings being treated as errors
>
> That line is this:
>
> foreach_delete_current(result, cell);
>
> and I don't see how any of the values close by could be unused ...
>
> The only possibility I can think of is some sort of issue in the old-ish
> gcc release (4.7.2).

No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning.  GCC 10.  Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail.  sidewinder (gcc 4.8) is not showing the warning.

Re: cleanup temporary files after crash

2021-03-18 Thread Euler Taveira

On Thu, Mar 18, 2021, at 5:51 PM, Tomas Vondra wrote:
> OK. Can you prepare a patch with the proposed test approach?
I'm on it.

> FWIW I can reproduce this on a 32-bit ARM system (rpi4), where 500 rows
> simply does not use a temp file, and with 1000 rows it works fine. On
> the x86_64 the temp file is created even with 500 rows. So there clearly
> is some platform dependency, not sure if it's due to 32/64 bits,
> alignment or something else. In any case, the 500 rows seems to be just
> on the threshold.
> 
> We need to do both - stop using the timing and increase the number of
> rows, to consistently get temp files.
Yeah.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/18/21 9:06 PM, Euler Taveira wrote:
> On Thu, Mar 18, 2021, at 4:20 PM, Tomas Vondra wrote:
>> I think a better way to test this would be to use a tuple lock:
> I predicated such issues with this test. Your suggestion works for me. Maybe
> you should use less rows in the session 2 query.
> 
>> setup:
>>
>>   create table t (a int unique);
>>
>> session 1:
>>
>>   begin;
>>   insert into t values (1);
>>   ... keep open ...
>>
>> session 2:
>>
>>    begin;
>>    set work_mem = '64kB';
>>    insert into t select i from generate_series(1,1) s(i);
>>    ... should block ...
>>
>> Then, once the second session gets waiting on the tuple, kill the
>> backend. We might as well test that there actually is a temp file first,
>> and then test that it disappeared.
> Your suggestion works for me. Maybe you could use less rows in the session 2
> query. I experimented with 1k rows and it generates a temporary file.
> 

OK. Can you prepare a patch with the proposed test approach?

FWIW I can reproduce this on a 32-bit ARM system (rpi4), where 500 rows
simply does not use a temp file, and with 1000 rows it works fine. On
the x86_64 the temp file is created even with 500 rows. So there clearly
is some platform dependency, not sure if it's due to 32/64 bits,
alignment or something else. In any case, the 500 rows seems to be just
on the threshold.

We need to do both - stop using the timing and increase the number of
rows, to consistently get temp files.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Maintaining a list of pgindent commits for "git blame" to ignore

2021-03-18 Thread Peter Geoghegan

Recent versions of git are capable of maintaining a list of commits
for "git blame" to ignore:

https://www.moxio.com/blog/43/ignoring-bulk-change-commits-with-git-blame

I tried this out myself, using my own list of pgindent commits. It
works very well -- much better than what you get when you ask git to
heuristically ignore commits based on whitespace-only line changes.
This is not surprising, in a way; I don't actually want to avoid
whitespace. I just want to ignore pgindent commits.

Note that there are still a small number of pgindent line changes,
even with this. That's because sometimes it's unavoidable -- some
"substantively distinct lines of code" are actually created by
pgindent. But these all seem to be  lines that are only shown
because there is legitimately no more appropriate commit to attribute
the line to. This seems like the ideal behavior to me.

I propose that we (I suppose I actually mean Bruce) start maintaining
our own file for this in git. It can be configured to run without any
extra steps via a once-off "git config blame.ignoreRevsFile
.git-blame-ignore-revs". It would only need to be updated whenever
Bruce or Tom runs pgindent.

It doesn't matter if this misses one or two smaller pgindent runs, it
seems. Provided the really huge ones are in the file, everything works
very well.

-- 
Peter Geoghegan

Re: default result formats setting

2021-03-18 Thread Peter Eisentraut


On 09.03.21 19:04, Tom Lane wrote:

The implementation feels weird though, mainly in that I don't like
Peter's choices for where to put the code.  pquery.c is not where
I would have expected to find the support for this, and I do not
have any confidence that applying the format conversion while
filling portal->formats[] is enough to cover all cases.  I'd have
thought that access/common/printtup.c or somewhere near there
would be where to do it.


done


Or we could drop all of that and go back to having it be a list
of type OIDs, which would remove a *whole lot* of the complexity,
and I'm not sure that it's materially less friendly.  Applications
have had to deal with type OIDs in the protocol since forever.


also done

The client driver needs to be able to interpret the OIDs that the 
RowDescription sends back, so it really needs to be able to deal in 
OIDs, and having the option to specify type names won't help it right now.



BTW, I wonder whether we still need to restrict the GUC to not
be settable from postgresql.conf.  The fact that the client has
to explicitly pass -1 seems to reduce any security issues quite
a bit.


There was no security concern, but I don't think it's useful.  The 
driver would specify "send int4 in binary, I know how to handle that". 
There doesn't seem to be a point in specifying that sort of thing globally.


From 3410ac02bc34b60b79c99ad96505dabe4362570f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Mar 2021 21:07:42 +0100
Subject: [PATCH v4] Add result_format_auto_binary_types setting

The current way binary results are requested in the extended query
protocol is too cumbersome for some practical uses.  Some clients, for
example the JDBC driver, have built-in support for handling specific
data types in binary.  Such a client would always have to request a
result row description (Describe statement) before sending a Bind
message, in order to be able to pick out the result columns it should
request in binary.  The feedback was that this extra round trip is
often not worth it in terms of performance, and so it is not done and
binary format is not used when it could be.

The solution is to allow a client to register for a session which
types it wants to always get in binary.  This is done by a new GUC
setting.  For example, to get int2, int4, int8 in binary by default,
you could set

SET result_format_auto_binary_types = 21,23,20;

To request result formats based on this setting, send format code
-1 (instead of 0 or 1) in the Bind message.

Discussion: 
https://www.postgresql.org/message-id/flat/40cbb35d-774f-23ed-3079-03f938aac...@2ndquadrant.com
---
 doc/src/sgml/config.sgml  | 40 
 doc/src/sgml/libpq.sgml   | 10 +-
 doc/src/sgml/protocol.sgml|  7 +-
 src/backend/access/common/printtup.c  | 94 +++
 src/backend/utils/misc/guc.c  | 12 +++
 src/include/access/printtup.h |  5 +
 src/test/modules/Makefile |  1 +
 src/test/modules/libpq_extended/.gitignore|  6 ++
 src/test/modules/libpq_extended/Makefile  | 20 
 src/test/modules/libpq_extended/README|  1 +
 .../libpq_extended/t/001_result_format.pl | 33 +++
 11 files changed, 222 insertions(+), 7 deletions(-)
 create mode 100644 src/test/modules/libpq_extended/.gitignore
 create mode 100644 src/test/modules/libpq_extended/Makefile
 create mode 100644 src/test/modules/libpq_extended/README
 create mode 100644 src/test/modules/libpq_extended/t/001_result_format.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8603cf3f94..0fed8abfd2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8671,6 +8671,46 @@ Statement Behavior
   
  
 
+ 
+  result_format_auto_binary_types 
(string)
+  
+   result_format_auto_binary_types
+   configuration parameter
+  
+  
+  
+   
+This parameter specifies which data types are to be sent from the
+server in binary format for rows returned in the extended query
+protocol when result format code -1 is specified in the Bind message.  It is intended
+to be used by client libraries that prefer to handle specific, but not
+all, data types in binary format.  The typical usage would be that the
+client library sets this value when it starts a connection.  (A client
+library that wants to handle all types in binary
+doesn't need to use this because it can just specify the format code
+for all types at once in the protocol message.)
+   
+
+   
+The value is a comma-separated list of type OIDs.  For example, if you
+want to automatically get values of the types int2,
+int4, and int8 in binary while leaving the
+rest in text, an appropriate setting would be
+21,23,20.  A non-existing type OIDs are ignored.
+

Re: cleanup temporary files after crash

2021-03-18 Thread Euler Taveira

On Thu, Mar 18, 2021, at 4:20 PM, Tomas Vondra wrote:
> I think a better way to test this would be to use a tuple lock:
I predicated such issues with this test. Your suggestion works for me. Maybe
you should use less rows in the session 2 query.

> setup:
> 
>   create table t (a int unique);
> 
> session 1:
> 
>   begin;
>   insert into t values (1);
>   ... keep open ...
> 
> session 2:
> 
>begin;
>set work_mem = '64kB';
>insert into t select i from generate_series(1,1) s(i);
>... should block ...
> 
> Then, once the second session gets waiting on the tuple, kill the
> backend. We might as well test that there actually is a temp file first,
> and then test that it disappeared.
Your suggestion works for me. Maybe you could use less rows in the session 2
query. I experimented with 1k rows and it generates a temporary file.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [HACKERS] Custom compression methods

2021-03-18 Thread Robert Haas

On Thu, Mar 18, 2021 at 10:22 AM Dilip Kumar  wrote:
> I just realized that in the last patch (0003) I forgot to remove 2
> unused functions, CompressionMethodToId and CompressionIdToMethod.
> Removed in the latest patch.

I spent a little time polishing 0001 and here's what I came up with. I
adjusted some comments, added documentation, fixed up the commit
message, etc.

I still don't quite like the approach in 0002. I feel that the
function should not construct the tuple but modify the caller's arrays
as a side effect. And if we're absolutely committed to the design
where it does that, the comments need to call it out clearly, which
they don't.

Regarding 0003:

I think it might make sense to change the names of the compression and
decompression functions to match the names of the callers more
closely. Like, toast_decompress_datum() calls either
pglz_cmdecompress() or lz4_cmdecompress(). But, why not
pglz_decompress_datum() or lz4_decompress_datum()? The "cm" thing
doesn't really mean anything, and because the varlena is allocated by
that function itself rather than the caller, this can't be used for
anything other than TOAST.

In toast_compress_datum(), if (tmp == NULL) return
PointerGetDatum(NULL) is duplicated. It would be better to move it
after the switch.

Instead of "could not compress data with lz4" I suggest "lz4
compression failed".

In catalogs.sgml, you shouldn't mention InvalidCompressionMethod, but
you should explain what the actual possible values mean. Look at the
way attidentity and attgenerated are documented and do it like that.

In pg_column_compression() it might be a bit more elegant to add a
char *result variable or similar, and have the switch cases just set
it, and then do PG_RETURN_TEXT_P(cstring_to_text(result)) at the
bottom.

In getTableAttrs(), if the remoteVersion is new, the column gets a
different alias than if the column is old.

In dumpTableSchema(), the condition tbinfo->attcompression[j] means
exactly the thing as the condition tbinfo->attcompression[j] != '\0',
so it can't be right to test both. I think that there's some confusion
here about the data type of tbinfo->attcompression[j]. It seems to be
char *. Maybe you intended to test the first character in that second
test, but that's not what this does. But you don't need to test that
anyway because the switch already takes care of it. So I suggest (a)
removing tbinfo->attcompression[j] != '\0' from this if-statement and
(b) adding != NULL to the previous line for clarity. I would also
suggest concluding the switch with a break just for symmetry.

The patch removes 11 references to va_extsize and leaves behind 4.
None of those 4 look like things that should have been left.

The comment which says "When fetching a prefix of a compressed
external datum, account for the rawsize tracking amount of raw data,
which is stored at the beginning as an int32 value)" is no longer 100%
accurate. I suggest changing it to say something like "When fetching a
prefix of a compressed external datum, account for the space required
by va_tcinfo" and leave out the rest.

In describeOneTableDetails, the comment "compresssion info" needs to
be compressed by removing one "s".

It seems a little unfortunate that we need to include
access/toast_compression.h in detoast.h.  It seems like the reason we
need to do that is because otherwise we won't have ToastCompressionId
defined and so we won't be able to prototype toast_get_compression_id.
But I think we should solve that problem by moving that file to
toast_compression.c. (I'm OK if you want to keep the files separate,
or if you want to reverse course and combine them I'm OK with that
too, but the extra header dependency is clearly a sign of a problem
with the split.)

Regarding 0005:

I think ApplyChangesToIndexes() should be renamed to something like
SetIndexStorageProperties(). It's too generic right now.

I think 0004 and 0005 should just be merged into 0003. I can't see
committing them separately. I know I was the one who made you split
the patch up in the first place, but those patches are quite small and
simple now, so it makes more sense to me to combine them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v39-0001-Invent-HeapTupleGetRawDatum-and-friends.patch
Description: Binary data

Re: libpq compression

2021-03-18 Thread Daniil Zakhlystov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,

I've compared the different libpq compression approaches in the streaming 
physical replication scenario.

Test setup
Three hosts: first is used for pg_restore run, second is master, third is the 
standby replica.
In each test run, I've run the pg_restore of the IMDB database 
(https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2QYZBT)
 
and measured the received traffic on the standby replica.

Also, I've enlarged the ZPQ_BUFFER_SIZE buffer in all versions because too 
small buffer size (8192 bytes) lead to more 
system calls to socket read/write and poor compression in the chunked-reset 
scenario.

Scenarios:

chunked
use streaming compression, wrap compressed data into CompressedData messages 
and preserve the compression context between multiple CompressedData messages.
https://github.com/usernamedt/libpq_compression/tree/chunked-compression

chunked-reset
use streaming compression, wrap compressed data into CompressedData messages 
and reset the compression context on each CompressedData message.
https://github.com/usernamedt/libpq_compression/tree/chunked-reset

permanent
use streaming compression, send raw compressed stream without any wrapping
https://github.com/usernamedt/libpq_compression/tree/permanent-w-enlarged-buffer

Tested compression levels
ZSTD, level 1
ZSTD, level 5
ZSTD, level 9

ScenarioReplica rx, mean, MB
uncompressed6683.6


ZSTD, level 1
ScenarioReplica rx, mean, MB
chunked-reset   2726
chunked 2694
permanent   2694.3

ZSTD, level 5
ScenarioReplica rx, mean, MB
chunked-reset   2234.3
chunked 2123
permanent   2115.3

ZSTD, level 9
ScenarioReplica rx, mean, MB
chunked-reset   2153.6
chunked 1943
permanent   1941.6

Full report with additional data and resource usage graphs is available here
https://docs.google.com/document/d/1a5bj0jhtFMWRKQqwu9ag1PgDF5fLo7Ayrw3Uh53VEbs

Based on these results, I suggest sticking with chunked compression approach
which introduces more flexibility and contains almost no overhead compared to 
permanent compression.
Also, later we may introduce some setting to control should we reset the 
compression context in each message without
breaking the backward compatibility.

--
Daniil Zakhlystov

The new status of this patch is: Ready for Committer

Re: GROUP BY DISTINCT

2021-03-18 Thread Tomas Vondra




On 3/18/21 6:25 PM, Tomas Vondra wrote:
> On 3/16/21 3:52 PM, Tomas Vondra wrote:
>>
>>
>> On 3/16/21 9:21 AM, Vik Fearing wrote:
>>> On 3/13/21 12:33 AM, Tomas Vondra wrote:
 Hi Vik,

 The patch seems quite ready, I have just two comments.
>>>
>>> Thanks for taking a look.
>>>
 1) Shouldn't this add another  for DISTINCT, somewhere in the
 documentation? Now the index points just to the SELECT DISTINCT part.
>>>
>>> Good idea; I never think about the index.
>>>
 2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
 in order to stash it in the group lists is rather ugly, IMHO. It forces
 all the places handling the list to be aware of this (there are not
 many, but still ...). And there are no other places doing (bool) intVal
 so it's not like there's a precedent for this.
>>>
>>> There is kind of a precedent for it, I was copying off of TriggerEvents
>>> and func_alias_clause.
>>>
>>
>> I see. I was looking for "(bool) intVal" but you're right TriggerEvents
>> code does something similar.
>>
 I think the clean solution is to make group_clause produce a struct with
 two fields, and just use that. Not sure how invasive that will be
 outside gram.y, though.
>>>
>>> I didn't want to create a whole new parse node for it, but Andrew Gierth
>>> pointed me towards SelectLimit so I did it like that and I agree it is
>>> much cleaner.
>>>
>>
>> I agree, that's much cleaner.
>>
 Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
 wonder if we can come up with some clearer names, describing the context
 of those types.
>>>
>>> I turned this into an enum for ALL/DISTINCT/default and the caller can
>>> choose what it wants to do with default.  I think that's a lot cleaner,
>>> too.  Maybe DISTINCT ON should be changed to fit in that?  I left it
>>> alone for now.
>>>
>>
>> I think DISTINCT ON is a different kind of animal, because that is a
>> list of expressions, not just a simple enum state.
>>
>>> I also snuck in something that all of us overlooked which is outputting
>>> the DISTINCT in ruleutils.c.  I didn't add a test for it but that would
>>> have been an unfortunate bug.
>>>
>>
>> Oh!
>>
>>> New patch attached, rebased on 15639d5e8f.
>>>
>>
>> Thanks. At this point it seems fine to me, no further comments.
>>
> 
> Pushed. Thanks for the patch.
> 

Hmmm, this seems to fail on lapwing with this error:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errors

That line is this:

foreach_delete_current(result, cell);

and I don't see how any of the values close by could be unused ...

The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-18 Thread Bruce Momjian

On Fri, Mar 19, 2021 at 02:06:56AM +0800, Julien Rouhaud wrote:
> On Thu, Mar 18, 2021 at 09:47:29AM -0400, Bruce Momjian wrote:
> > On Thu, Mar 18, 2021 at 07:29:56AM +0800, Julien Rouhaud wrote:
> > > Note exactly.  Right now a custom queryid can be computed even if
> > > compute_queryid is off, if some extension does that in 
> > > post_parse_analyze_hook.

The above text is the part that made me think an extension could display
a query id even if disabled by the GUC.

> > The docs are going to say that you have to enable compute_queryid to see
> > the query id in pg_stat_activity and log_line_prefix, but if you install
> > an extension, the query id will be visible even if you don't have
> > compute_queryid enabled.  I think you need to only honor the hook if
> > compute_queryid is enabled, and update the pg_stat_statements docs to
> > say you have to enable compute_queryid for pg_stat_statements to work.
> 
> I'm confused, what you described really looks like what I described.
> 
> Let me try to clarify:
> 
> - if compute_queryid is off, a queryid should never be seen no matter how hard
>   an extension tries

Oh, OK.  I can see an extension setting the query id on its own --- we
can't prevent that from happening.  It is probably enough to tell
extensions to honor the GUC, since they would want it enabled so it
displays in pg_stat_activity and log_line_prefix.

> - if compute_queryid is on, the calculation will be done by the core
>   (using pgss JumbeQuery) unless an extension computed one already.  The only
>   way to know what algorithm is used is to check the list of extension loaded.

OK.

> - if some extension calculates a queryid during post_parse_analyze_hook, we
>   will always reset it.

OK, good.

> Is that the approach you want?

Yes, I think so.

> Note that the only way to not honor the hook is iff the new GUC is disabled is
> to have a new queryid_hook, as we can't stop calling post_parse_analyze_hook 
> if
> the new GUC is off, and we don't want to pay the queryid calculation overhead
> if the admin explicitly said it wasn't needed.

Right, let's just get the extensions to honor the GUC --- we don't need
to block them or anything.

> > Also, should it be compute_queryid or compute_query_id?
> 
> Maybe compute_query_identifier?

I think compute_query_id works, and is shorter.

> > Also, the overhead of computing the query id was reported as 2% --- that
> > seems quite high for what it does.  Do we know why it is so high?
> 
> The 2% was a worst case scenario, for a query with a single join over
> ridiculously small pg_class and pg_attribute, in read only.  The whole 
> workload
> was in shared buffers so the planning and execution is quite fast.  Adding 
> some
> complexity in the query really limited the overhead.
> 
> Note that this was done on an old laptop with quite slow CPU.  Maybe
> someone with a better hardware than a 5/6yo laptop could get some more
> realistic results (I unfortunately don't have anything to try on).

OK, good to know.  I can run some tests here if people would like me to.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

On 3/18/21 6:51 PM, Tomas Vondra wrote:
> Hmm,
> 
> crake and florican seem to have failed because of this commit, with this
> error in the new TAP test:
> 
> error running SQL: 'psql::1: ERROR:  could not open directory
> "base/pgsql_tmp": No such file or directory'
> while running 'psql -XAtq -d port=64336 host=/tmp/sv1WjSvj3P
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT COUNT(1)
> FROM pg_ls_dir($$base/pgsql_tmp$$)' at
> /home/andrew/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm
> line 1572.
> 
> So it seems the pgsql_tmp directory does not exist, for some reason.
> Considering the directory should be created for the first temp file,
> that either means the query in the TAP test does not actually create a
> temp file on those machines, or it gets killed too early.
> 
> The 500 rows used by the test seems fairly low, so maybe those machines
> can do the sort entirely in memory?
> 
> The other option is that the sleep in the TAP test is a bit too short,
> but those machines don't seem to be that slow.
> 
> Anyway, TAP test relying on timing like this may not be the best idea,
> so I wonder how else to test this ...
> 

I think a better way to test this would be to use a tuple lock:

setup:

  create table t (a int unique);

session 1:

  begin;
  insert into t values (1);
  ... keep open ...

session 2:

   begin;
   set work_mem = '64kB';
   insert into t select i from generate_series(1,1) s(i);
   ... should block ...

Then, once the second session gets waiting on the tuple, kill the
backend. We might as well test that there actually is a temp file first,
and then test that it disappeared.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Key management with tests

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 01:46:28PM -0400, Stephen Frost wrote:
> * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> > This caught my attention because a comment says "encryption does not
> > support WAL-skipped relations", but there's no direct change to the
> > definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> > am just overlooking something, since I'm just skimming anyway.
> 
> This is relatively current activity and so it's entirely possible
> comments and perhaps code need further updating in this area, but to
> explain what's going on in a bit more detail- 
> 
> Ultimately, we need to make sure that LSNs aren't re-used.  There's two
> sources of LSNs today: those for relations which are being written into
> the WAL and those for relations which are not (UNLOGGED relations,
> specifically).  The 'minimal' WAL level introduces complications with

Well, the story is a little more complex than that --- we currently have
four LSN uses:

1.  real LSNs for WAL-logged relfilenodes
2.  real LSNs for GiST indexes for non-WAL-logged relfilenodes of permanenet 
relations
3.  fake LSNs for GiST indexes for relfilenodes of non-permanenet relations
4.  zero LSNs for non-GiST non-permanenet relations

This patch changes it so #4 gets fake LSNs, and slightly adjusts #2 & #3
so the LSNs are always unique.

> I'm not sure if it's been explicitly done yet but I believe the idea is,
> based on my last discussion with Bruce, at least initially, simply
> disallow encrypted clusters from running with wal_level=minimal to avoid
> this issue.

I adjusted the hint bit code so it potentially could work with wal_level
minimal (just for safety), but the code disallows wal_level minimal, and
is documented as such.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Key management with tests

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 02:37:43PM -0300, Álvaro Herrera wrote:
> On 2021-Mar-18, Stephen Frost wrote:
> > This is discussed in src/backend/access/transam/README, specifically the
> > section that talks about Skipping WAL for New RelFileNode.  Basically,
> > it's the 'wal_level=minimal' optimization which allows WAL to be
> > skipped.
> 
> Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
> *relations*.  I thought WAL-skipping meant unlogged relations, but
> I understand now that that's unrelated.  In the transam/README, WAL-skip
> means a change in a transaction in a relfilenode that, if rolled back,
> would disappear; and I'm not sure I understand how the code is handling
> the case that a relation is under that condition.
> 
> This caught my attention because a comment says "encryption does not
> support WAL-skipped relations", but there's no direct change to the
> definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> am just overlooking something, since I'm just skimming anyway.

First, thanks for looking at these patches --- I know it isn't easy.

Second, you are right that I equated WAL-skipping relfilenodes with
relations, and this was wrong.  I have updated the attached patch to use
the term WAL-skipping "relfilenodes", and checked the rest of the
patches for any incorrect 'skipping' term, but didn't find any.

If "WAL-skipping relfilenodes" is not clear enough, we should probably
rename RelFileNodeSkippingWAL().

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



cfe-10-hint_over_cfe-09-test.diff.gz
Description: application/gzip

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada  wrote:
> If we have the constant threshold of 1 billion transactions, a vacuum
> operation might not be an anti-wraparound vacuum and even not be an
> aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
> the purpose of skipping index vacuuming in this case, I think it
> doesn't make sense to have non-aggressive vacuum skip index vacuuming
> since it might not be able to advance relfrozenxid. If we have a
> constant threshold, 2 billion transactions, maximum value of
> autovacuum_freeze_max_age, seems to work.

I like the idea of not making the behavior a special thing that only
happens with a certain variety of VACUUM operation (non-aggressive or
anti-wraparound VACUUMs). Just having a very high threshold should be
enough.

Even if we're not going to be able to advance relfrozenxid, we'll
still finish much earlier and let a new anti-wraparound vacuum take
place that will do that -- and will be able to reuse much of the work
of the original VACUUM. Of course this anti-wraparound vacuum will
also skip index vacuuming from the start (whereas the first VACUUM may
well have done some index vacuuming before deciding to end index
vacuuming to hurry with finishing).

There is a risk in having the limit be too high, though. We need to
give VACUUM time to reach two_pass_strategy() to notice the problem
and act (maybe each call to lazy_vacuum_all_indexes() takes a long
time). Also, while it's possible (any perhaps even likely) that cases
that use this emergency mechanism will be able to end the VACUUM
immediately (because there is enough maintenance_work_mem() to make
the first call to two_pass_strategy() also the last call), that won't
always be how it works. Even deciding to stop index vacuuming (and
heap vacuuming) may not be enough to avert disaster if left too late
-- because we may still have to do a lot of table pruning. In cases
where there is not nearly enough maintenance_work_mem we will get
through the table a lot faster once we decide to skip indexes, but
there is some risk that even this will not be fast enough.

How about 1.8 billion XIDs? That's the maximum value of
autovacuum_freeze_max_age (2 billion) minus the default value (200
million). That is high enough that it seems almost impossible for this
emergency mechanism to hurt rather than help. At the same time it is
not so high that there isn't some remaining time to finish off work
which is truly required.

> > This seems like a good idea because we should try to avoid changing
> > back to index vacuuming having decided to skip it once.
>
> Once decided to skip index vacuuming due to too old relfrozenxid
> stuff, the decision never be changed within the same vacuum operation,
> right? Because the relfrozenxid is advanced at the end of vacuum.

I see no reason why it would be fundamentally incorrect to teach
two_pass_strategy() to make new and independent decisions about doing
index vacuuming on each call. I just don't think that that makes any
sense to do so, practically speaking. Why would we even *want* to
decide to not do index vacuuming, and then change our mind about it
again (resume index vacuuming again, for later heap blocks)? That
sounds a bit too much like me!

There is another reason to never go back to index vacuuming: we should
have an ereport() at the point that we decide to not do index
vacuuming (or not do additional index vacuuming) inside
two_pass_strategy(). This should deliver an unpleasant message to the
DBA. The message is (in my own informal language): An emergency
failsafe mechanism made VACUUM skip index vacuuming, just to avoid
likely XID wraparound failure. This is not supposed to happen.
Consider tuning autovacuum settings, especially if you see this
message regularly.

Obviously the reason to delay the decision is that we cannot easily
predict how long any given VACUUM will take (or just to reach
two_pass_strategy()). Nor can we really hope to understand how many
XIDs will be consumed in that time. So rather than trying to
understand all that, we can instead just wait until we have reliable
information. It is true that the risk of waiting until it's too late
to avert disaster exists (which is why 1.8 billion XIDs seems like a
good threshold to me), but there is only so much we can do about that.
We don't need it to be perfect, just much better.

In my experience, anti-wraparound VACUUM scenarios all have an
"accident chain", which is a concept from the world of aviation and
safety-critical systems:

https://en.wikipedia.org/wiki/Chain_of_events_(accident_analysis)

They usually involve some *combination* of Postgres problems,
application code problems, and DBA error. Not any one thing. I've seen
problems with application code that runs DDL at scheduled intervals,
which interacts badly with vacuum -- but only really on the rare
occasions when freezing is required! I've also seen a buggy
hand-written upsert function that artificially burned through XIDs

Re: Perform COPY FROM encoding conversions in larger chunks

2021-03-18 Thread John Naylor

On Thu, Mar 18, 2021 at 2:05 PM John Naylor 
wrote:
>
> I wrote:
>
> > I went ahead and rebased these.
>
> It looks like FreeBSD doesn't like this for some reason.

On closer examination, make check was "terminated", not that the tests
failed...

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-18 Thread Julien Rouhaud

On Thu, Mar 18, 2021 at 09:47:29AM -0400, Bruce Momjian wrote:
> On Thu, Mar 18, 2021 at 07:29:56AM +0800, Julien Rouhaud wrote:
> > On Wed, Mar 17, 2021 at 06:32:16PM -0400, Bruce Momjian wrote:
> > > OK, is that what everyone wants?  I think that is what the patch already
> > > does.
> > 
> > Note exactly.  Right now a custom queryid can be computed even if
> > compute_queryid is off, if some extension does that in 
> > post_parse_analyze_hook.
> > 
> > I'm assuming that what Robert was thinking was more like:
> > 
> > if (compute_queryid)
> > {
> > if (queryid_hook)
> > queryId = queryid_hook(...);
> > else
> > queryId = JumbeQuery(...);
> > }
> > else
> > queryId = 0;
> > 
> > And that should be done *after* post_parse_analyse_hook so that it's clear 
> > that
> > this hook is no longer the place to compute queryid.
> > 
> > Is that what should be done?
> 
> No, I don't think so.  I think having extensions change behavior
> controlled by GUCs is a bad interface.
> 
> The docs are going to say that you have to enable compute_queryid to see
> the query id in pg_stat_activity and log_line_prefix, but if you install
> an extension, the query id will be visible even if you don't have
> compute_queryid enabled.  I think you need to only honor the hook if
> compute_queryid is enabled, and update the pg_stat_statements docs to
> say you have to enable compute_queryid for pg_stat_statements to work.

I'm confused, what you described really looks like what I described.

Let me try to clarify:

- if compute_queryid is off, a queryid should never be seen no matter how hard
  an extension tries

- if compute_queryid is on, the calculation will be done by the core
  (using pgss JumbeQuery) unless an extension computed one already.  The only
  way to know what algorithm is used is to check the list of extension loaded.

- if some extension calculates a queryid during post_parse_analyze_hook, we
  will always reset it.

Is that the approach you want?

Note that the only way to not honor the hook is iff the new GUC is disabled is
to have a new queryid_hook, as we can't stop calling post_parse_analyze_hook if
the new GUC is off, and we don't want to pay the queryid calculation overhead
if the admin explicitly said it wasn't needed.

> Also, should it be compute_queryid or compute_query_id?

Maybe compute_query_identifier?

> Also, the overhead of computing the query id was reported as 2% --- that
> seems quite high for what it does.  Do we know why it is so high?

The 2% was a worst case scenario, for a query with a single join over
ridiculously small pg_class and pg_attribute, in read only.  The whole workload
was in shared buffers so the planning and execution is quite fast.  Adding some
complexity in the query really limited the overhead.

Note that this was done on an old laptop with quite slow CPU.  Maybe
someone with a better hardware than a 5/6yo laptop could get some more
realistic results (I unfortunately don't have anything to try on).

Re: Perform COPY FROM encoding conversions in larger chunks

2021-03-18 Thread John Naylor

I wrote:

> I went ahead and rebased these.

It looks like FreeBSD doesn't like this for some reason.

I also wanted to see if this patch set had any performance effect, with and
without changing how UTF-8 is validated, using the blackhole am from
https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.

create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"

where copy-test.txt is made by

for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ;  done;

On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):

master:
109ms

v6 do encoding in larger chunks:
109ms

v7 utf8 SIMD:
98ms

--
John Naylor
EDB: http://www.enterprisedb.com

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra

Hmm,

crake and florican seem to have failed because of this commit, with this
error in the new TAP test:

error running SQL: 'psql::1: ERROR:  could not open directory
"base/pgsql_tmp": No such file or directory'
while running 'psql -XAtq -d port=64336 host=/tmp/sv1WjSvj3P
dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT COUNT(1)
FROM pg_ls_dir($$base/pgsql_tmp$$)' at
/home/andrew/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm
line 1572.

So it seems the pgsql_tmp directory does not exist, for some reason.
Considering the directory should be created for the first temp file,
that either means the query in the TAP test does not actually create a
temp file on those machines, or it gets killed too early.

The 500 rows used by the test seems fairly low, so maybe those machines
can do the sort entirely in memory?

The other option is that the sleep in the TAP test is a bit too short,
but those machines don't seem to be that slow.

Anyway, TAP test relying on timing like this may not be the best idea,
so I wonder how else to test this ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Key management with tests

2021-03-18 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2021-Mar-18, Stephen Frost wrote:
> 
> > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> > > Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> > > it "relations that are not WAL-logged"?  I suppose we already have a
> > > term for this; I'm not sure it's a good idea to invent a different term
> > > that is only used in this new place.
> > 
> > This is discussed in src/backend/access/transam/README, specifically the
> > section that talks about Skipping WAL for New RelFileNode.  Basically,
> > it's the 'wal_level=minimal' optimization which allows WAL to be
> > skipped.
> 
> Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
> *relations*.  I thought WAL-skipping meant unlogged relations, but
> I understand now that that's unrelated.  In the transam/README, WAL-skip
> means a change in a transaction in a relfilenode that, if rolled back,
> would disappear; and I'm not sure I understand how the code is handling
> the case that a relation is under that condition.
> 
> This caught my attention because a comment says "encryption does not
> support WAL-skipped relations", but there's no direct change to the
> definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
> am just overlooking something, since I'm just skimming anyway.

This is relatively current activity and so it's entirely possible
comments and perhaps code need further updating in this area, but to
explain what's going on in a bit more detail- 

Ultimately, we need to make sure that LSNs aren't re-used.  There's two
sources of LSNs today: those for relations which are being written into
the WAL and those for relations which are not (UNLOGGED relations,
specifically).  The 'minimal' WAL level introduces complications with
this requirement because tables created (or truncated) inside a
transaction are considered permanent once they're committed, but the
data pages in those relations don't go into the WAL and the LSNs on the
pages of those relations isn't guaranteed to be either unique or even
necessarily set, and if we were to generate LSNs for those it would be
required to be done by actually advancing the WAL LSN, which would
require writing into the WAL and therefore wouldn't be quite the
optimization that's expected.

I'm not sure if it's been explicitly done yet but I believe the idea is,
based on my last discussion with Bruce, at least initially, simply
disallow encrypted clusters from running with wal_level=minimal to avoid
this issue.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Key management with tests

2021-03-18 Thread Alvaro Herrera

On 2021-Mar-18, Stephen Frost wrote:

> * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> > Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> > it "relations that are not WAL-logged"?  I suppose we already have a
> > term for this; I'm not sure it's a good idea to invent a different term
> > that is only used in this new place.
> 
> This is discussed in src/backend/access/transam/README, specifically the
> section that talks about Skipping WAL for New RelFileNode.  Basically,
> it's the 'wal_level=minimal' optimization which allows WAL to be
> skipped.

Hmm ... that talks about WAL-skipping *changes*, not WAL-skipping
*relations*.  I thought WAL-skipping meant unlogged relations, but
I understand now that that's unrelated.  In the transam/README, WAL-skip
means a change in a transaction in a relfilenode that, if rolled back,
would disappear; and I'm not sure I understand how the code is handling
the case that a relation is under that condition.

This caught my attention because a comment says "encryption does not
support WAL-skipped relations", but there's no direct change to the
definition of RelFileNodeSkippingWAL() to account for that.  Perhaps I
am just overlooking something, since I'm just skimming anyway.

-- 
Álvaro Herrera   Valdivia, Chile

Re: GROUP BY DISTINCT

2021-03-18 Thread Tomas Vondra

On 3/16/21 3:52 PM, Tomas Vondra wrote:
> 
> 
> On 3/16/21 9:21 AM, Vik Fearing wrote:
>> On 3/13/21 12:33 AM, Tomas Vondra wrote:
>>> Hi Vik,
>>>
>>> The patch seems quite ready, I have just two comments.
>>
>> Thanks for taking a look.
>>
>>> 1) Shouldn't this add another  for DISTINCT, somewhere in the
>>> documentation? Now the index points just to the SELECT DISTINCT part.
>>
>> Good idea; I never think about the index.
>>
>>> 2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
>>> in order to stash it in the group lists is rather ugly, IMHO. It forces
>>> all the places handling the list to be aware of this (there are not
>>> many, but still ...). And there are no other places doing (bool) intVal
>>> so it's not like there's a precedent for this.
>>
>> There is kind of a precedent for it, I was copying off of TriggerEvents
>> and func_alias_clause.
>>
> 
> I see. I was looking for "(bool) intVal" but you're right TriggerEvents
> code does something similar.
> 
>>> I think the clean solution is to make group_clause produce a struct with
>>> two fields, and just use that. Not sure how invasive that will be
>>> outside gram.y, though.
>>
>> I didn't want to create a whole new parse node for it, but Andrew Gierth
>> pointed me towards SelectLimit so I did it like that and I agree it is
>> much cleaner.
>>
> 
> I agree, that's much cleaner.
> 
>>> Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
>>> wonder if we can come up with some clearer names, describing the context
>>> of those types.
>>
>> I turned this into an enum for ALL/DISTINCT/default and the caller can
>> choose what it wants to do with default.  I think that's a lot cleaner,
>> too.  Maybe DISTINCT ON should be changed to fit in that?  I left it
>> alone for now.
>>
> 
> I think DISTINCT ON is a different kind of animal, because that is a
> list of expressions, not just a simple enum state.
> 
>> I also snuck in something that all of us overlooked which is outputting
>> the DISTINCT in ruleutils.c.  I didn't add a test for it but that would
>> have been an unfortunate bug.
>>
> 
> Oh!
> 
>> New patch attached, rebased on 15639d5e8f.
>>
> 
> Thanks. At this point it seems fine to me, no further comments.
> 

Pushed. Thanks for the patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Key management with tests

2021-03-18 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
> it "relations that are not WAL-logged"?  I suppose we already have a
> term for this; I'm not sure it's a good idea to invent a different term
> that is only used in this new place.

This is discussed in src/backend/access/transam/README, specifically the
section that talks about Skipping WAL for New RelFileNode.  Basically,
it's the 'wal_level=minimal' optimization which allows WAL to be
skipped.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Key management with tests

2021-03-18 Thread Alvaro Herrera

Patch 10 uses the term "WAL-skip relations".  What does that mean?  Is
it "relations that are not WAL-logged"?  I suppose we already have a
term for this; I'm not sure it's a good idea to invent a different term
that is only used in this new place.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Key management with tests

2021-03-18 Thread Bruce Momjian

On Thu, Mar 18, 2021 at 11:31:34AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> > > I have made significant progress on the cluster file encryption feature so
> > > it is time for me to post a new set of patches.
> > 
> > Here is a rebase, to keep the cfbot green.
> 
> Good stuff.

Yes, I was happy I got to a stage where the encryption actually did
something useful.

> > >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001
> > From: Bruce Momjian 
> > Date: Mon, 15 Mar 2021 10:20:32 -0400
> > Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit
> > 
> > ---
> >  src/backend/access/gist/gistutil.c   |  2 +-
> >  src/backend/access/heap/heapam_handler.c |  2 +-
> >  src/backend/catalog/pg_publication.c |  2 +-
> >  src/backend/commands/tablecmds.c | 10 +-
> >  src/backend/optimizer/util/plancat.c |  3 +--
> >  src/backend/utils/cache/relcache.c   |  2 +-
> >  src/include/utils/rel.h  | 10 --
> >  src/include/utils/snapmgr.h  |  3 +--
> >  8 files changed, 19 insertions(+), 15 deletions(-)
> 
> This particular patch (introducing the RelationIsPermanent() macro)
> seems like it'd be a nice thing to commit independently of the rest,
> reducing the size of this patch set..? 

OK, if no one objects I will apply it in the next few days. The macro is
used more in my later patches, which I will not apply now.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: fdatasync performance problem with large number of DB files

2021-03-18 Thread Fujii Masao





On 2021/03/18 23:03, Bruce Momjian wrote:

Are we sure we want to use the word "crash" here?  I don't remember
seeing it used anywhere else in our user interface.


We have GUC restart_after_crash.



 I guess it is

"crash recovery".


Maybe call it "recovery_sync_method"

+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Disable WAL logging to speed up data loading

2021-03-18 Thread David Steele


On 1/17/21 12:16 AM, tsunakawa.ta...@fujitsu.com wrote:

From: Osumi, Takamichi/大墨 昂道 

I updated my patch to take in those feedbacks !
Have a look at the latest patch.


Looks good to me (the status remains ready for committer.)


After reading through the thread (but not reading the patch) I am -1 on  
this proposal.


The feature seems ripe for abuse and misunderstanding, and as has been  
noted in the thread, there are a variety of alternatives that can  
provide a similar effect.


It doesn't help that at several points along the way new WAL records  
have been found that still need to be included even when wal_level =  
none. It's not clear to me how we know when we have found them all.


The patch is marked Ready for Committer but as far as I can see there  
are no committers in favor of it and quite a few who are not.


Perhaps it would be better to look at some of the more targeted  
approaches mentioned in the thread and see if any of them can be  
used/improved to achieve the desired result?


Regards,
--
-David
da...@pgmasters.net

Re: cleanup temporary files after crash

2021-03-18 Thread Tomas Vondra




On 3/17/21 2:34 AM, Euler Taveira wrote:
> On Sun, Mar 14, 2021, at 11:01 PM, Thomas Munro wrote:
>> On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier > > wrote:
>> > On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:
>> > > Let's move this patch forward. Based on the responses, I agree the
>> > > default behavior should be to remove the temp files, and I think we
>> > > should have the GUC (on the off chance that someone wants to preserve
>> > > the temporary files for debugging or whatever other reason).
>> >
>> > Thanks for taking care of this.  I am having some second-thoughts
>> > about changing this behavior by default, still that's much more useful
>> > this way.
>>
>> +1 for having it on by default.
>>
>> I was also just looking at this patch and came here to say LGTM except
>> for two cosmetic things, below.
> Thanks for taking a look at this patch. I'm not sure Tomas is preparing
> a new
> patch that includes the suggested modifications but I decided to do it. This
> new version has the new GUC name (using "remove"). I also replaced "cleanup"
> with "remove" in the all the remain places. As pointed by Thomas, I reworded
> the paragraph that describes the GUC moving the default information to the
> beginning of the sentence. I also added the "literal" as suggested by
> Michael.
> The postgresql.conf.sample was fixed. The tests was slightly modified. I
> reworded some comments and added a hack to avoid breaking the temporary file
> test in slow machines. A usleep() after sending the query provides some time
> for the query to create the temporary file. I used an arbitrarily sleep
> (10ms)
> that seems to be sufficient.
> 

Thanks. Pushed with some minor changes to docs wording.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: invalid data in file backup_label problem on windows

2021-03-18 Thread David Steele


On 1/14/21 10:50 PM, Wang, Shenhao wrote:



Please feel free to, under the section "Bug fixes".  This way, it
won't get lost in the traffic of this list.
--
Michael


Thank you for your advise, added it


I'm not sure how I feel about this patch (not really a Windows person)  
but I do think that you shouldn't modify the backup_label when writing  
it, i.e. you should be writing backup_label in binary mode to avoid this  
issue.


No objections from me if it gets committed but I'm not sure it should be  
classified as a "bug fix" since the backup_label was modified from what  
postgres provided, unless I am misunderstanding.


Regards,
--
-David
da...@pgmasters.net

Re: a verbose option for autovacuum

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 5:14 AM Masahiko Sawada  wrote:
> Okay, I've updated the patch accordingly. If we add
> IndexBulkDeleteResult to LVRelStats I think we can remove
> IndexBulkDeleteResult function argument also from some other functions
> such as lazy_parallel_vacuum_indexes() and vacuum_indexes_leader().
> Please review the attached patch.

That seems much clearer.

Were you going to take care of this, Michael?

-- 
Peter Geoghegan

Re: Implement for window functions

2021-03-18 Thread David Steele


On 3/18/21 12:03 PM, Vik Fearing wrote:

On 3/18/21 4:12 PM, David Steele wrote:

Hi Vik,

On 1/11/21 5:00 PM, Tom Lane wrote:

I started to look through this patch...


I see you moved this patch to PG15. If you won't be working on the patch
in this CF


Correct.  I won't be working on this again until I finish my review of
the system versioning patch.


perhaps it would be better to close it as Returned with
Feedback for now and reopen it when you have a new patch?


If that is preferred over moving it to PG15, then no objection.  


It is, because it means it doesn't need to be looked at again until you 
have had time to work on it.



As long
as people don't think I've abandoned it.


This declaration should prevent that.

Regards,
--
-David
da...@pgmasters.net

Re: Implement for window functions

2021-03-18 Thread Vik Fearing

On 3/18/21 4:12 PM, David Steele wrote:
> Hi Vik,
> 
> On 1/11/21 5:00 PM, Tom Lane wrote:
>> I started to look through this patch...
> 
> I see you moved this patch to PG15. If you won't be working on the patch
> in this CF

Correct.  I won't be working on this again until I finish my review of
the system versioning patch.

> perhaps it would be better to close it as Returned with
> Feedback for now and reopen it when you have a new patch?

If that is preferred over moving it to PG15, then no objection.  As long
as people don't think I've abandoned it.
-- 
Vik Fearing

Re: Rename of triggers for partitioned tables

2021-03-18 Thread David Steele


On 1/15/21 5:26 PM, Alvaro Herrera wrote:

On 2020-Nov-27, Arne Roland wrote:


I got too annoyed at building queries for gexec all the time. So wrote
a patch to fix the issue that the rename of partitioned trigger
doesn't affect children.


As you say, triggers on children don't necessarily have to have the same
name as on parent; this already happens when the trigger is renamed in
the child table but not on parent.  In that situation the search on the
child will fail, which will cause the whole thing to fail I think.

We now have the column pg_trigger.tgparentid, and I think it would be
better (more reliable) to search for the trigger in the child by the
tgparentid column instead, rather than by name.

Also, I think it would be good to have
  ALTER TRIGGER .. ON ONLY parent RENAME TO ..
to avoid recursing to children.  This seems mostly pointless, but since
we've allowed changing the name of the trigger in children thus far,
then we've implicitly made it supported to have triggers that are named
differently.  (And it's not entirely academic, since the trigger name
determines firing order.)

Alternatively to this last point, we could decide to disallow renaming
of triggers on children (i.e. if trigger has tgparentid set, then
renaming is disallowed).  I don't have a problem with that, but it would
have to be an explicit decision to take.


Arne, thoughts on Álvaro's comments?

Marking this patch as Waiting for Author.

Regards,
--
-David
da...@pgmasters.net

Re: pg_stat_statements and "IN" conditions

2021-03-18 Thread Dmitry Dolgov

> On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> On 1/5/21 10:51 AM, Zhihong Yu wrote:
> >
> > +   int         lastExprLenght = 0;
> >
> > Did you mean to name the variable lastExprLenghth ?
> >
> > w.r.t. extracting to helper method, the second and third
> > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > It is up to you whether to create the helper method.
> > I am fine with the current formation.
>
> Dmitry, thoughts on this review?

Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
the helper is not strictly necessary I wanted to wait a bit hoping for
more feedback and eventually to post an accumulated patch. Doesn't make
sense to post another version only to fix one typo :)

Re: Reduce the time required for a database recovery from archive.

2021-03-18 Thread David Steele


On 3/18/21 11:37 AM, Andrey Borodin wrote:




18 марта 2021 г., в 20:04, David Steele  написал(а):
it would be nice to support an interface that simply says to the restore_command, 
"go get 1gb of WAL and write the files here."


+1 to redesigning restore_command and archive_command.


Indeed, archive_command would benefit from the same treatment. The need 
to call archive_command for each WAL segment even when parallel 
processing is going on behind the scenes is a bottleneck.


Larger WAL segments sizes can be used to mitigate this issue but an 
improvement would be welcome.


Regards,
--
-David
da...@pgmasters.net

Re: Reduce the time required for a database recovery from archive.

2021-03-18 Thread Andrey Borodin




> 18 марта 2021 г., в 20:04, David Steele  написал(а):
> it would be nice to support an interface that simply says to the 
> restore_command, "go get 1gb of WAL and write the files here."

+1 to redesigning restore_command and archive_command.

Best regards, Andrey Borodin.

Re: [patch] [doc] Further note required activity aspect of automatic checkpoint and archving

2021-03-18 Thread David Steele


Hi David,

On 1/15/21 2:50 PM, David G. Johnston wrote:


If the above wants to be made more explicit in this change maybe:

"This is mitigated by the fact that archiving, and thus filling, the 
active WAL segment will not happen if that segment is empty; it will 
continue as the active segment."


"archiving, and thus filling" seems awkward to me. Perhaps:

This is mitigated by the fact that WAL segments will not be archived 
until they have been filled with some data, even if the archive_timeout 
period has elapsed.


Consistency is good; and considering it further the skipped wording is 
generally better anyway.


"The automatic checkpoint will be skipped if no new WAL has been written 
since the last recorded checkpoint."

Looks good to me.

Could you produce a new patch so Peter has something complete to look at?

Regards,
--
-David
da...@pgmasters.net

Re: Key management with tests

2021-03-18 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Thu, Mar 11, 2021 at 10:31:28PM -0500, Bruce Momjian wrote:
> > I have made significant progress on the cluster file encryption feature so
> > it is time for me to post a new set of patches.
> 
> Here is a rebase, to keep the cfbot green.

Good stuff.

> >From 110358c9ce8764f0c41c12dd37dabde57a92cf1f Mon Sep 17 00:00:00 2001
> From: Bruce Momjian 
> Date: Mon, 15 Mar 2021 10:20:32 -0400
> Subject: [PATCH] cfe-11-persistent_over_cfe-10-hint squash commit
> 
> ---
>  src/backend/access/gist/gistutil.c   |  2 +-
>  src/backend/access/heap/heapam_handler.c |  2 +-
>  src/backend/catalog/pg_publication.c |  2 +-
>  src/backend/commands/tablecmds.c | 10 +-
>  src/backend/optimizer/util/plancat.c |  3 +--
>  src/backend/utils/cache/relcache.c   |  2 +-
>  src/include/utils/rel.h  | 10 --
>  src/include/utils/snapmgr.h  |  3 +--
>  8 files changed, 19 insertions(+), 15 deletions(-)

This particular patch (introducing the RelationIsPermanent() macro)
seems like it'd be a nice thing to commit independently of the rest,
reducing the size of this patch set..? 

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: should INSERT SELECT use a BulkInsertState?

2021-03-18 Thread Zhihong Yu

Hi,
+   mtstate->ntuples > bulk_insert_ntuples &&
+   bulk_insert_ntuples >= 0)

I wonder why bulk_insert_ntuples == 0 is included in the above. It
seems bulk_insert_ntuples having value of 0 should mean not enabling bulk
insertions.

+   }
+   else
+   {

nit: the else should be on the same line as the right brace.

Cheers

On Thu, Mar 18, 2021 at 6:38 AM Ibrar Ahmed  wrote:

>
>
> On Mon, Mar 8, 2021 at 2:18 PM houzj.f...@fujitsu.com <
> houzj.f...@fujitsu.com> wrote:
>
>> > > > I am very interested in this patch, and I plan to do some
>> > > > experiments with the
>> > > patch.
>> > > > Can you please rebase the patch because it seems can not applied to
>> > > > the
>> > > master now.
>> > >
>> > > Thanks for your interest.
>> > >
>> > > I was sitting on a rebased version since the bulk FDW patch will cause
>> > > conflicts, and since this should maybe be built on top of the
>> table-am patch
>> > (2871).
>> > > Have fun :)
>> > Hi,
>> >
>> > When I testing with the patch, I found I can not use "\d tablename".
>> > It reports the following error, it this related to the patch?
>>
>> Sorry, solved by re-initdb.
>>
>> Best regards,
>> houzj
>>
>>
>> One of the patch
> (v10-0001-INSERT-SELECT-to-use-BulkInsertState-and-multi_i.patch) from the
> patchset does not apply.
>
> http://cfbot.cputube.org/patch_32_2553.log
>
> 1 out of 13 hunks FAILED -- saving rejects to file
> src/backend/commands/copyfrom.c.rej
>
> It is a minor change, therefore I fixed that to make cfbot happy. Please
> take a look if that works for you.
>
>
>
>
> --
> Ibrar Ahmed
>

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Pavel Stehule

čt 18. 3. 2021 v 15:59 odesílatel Hannu Krosing  napsal:

> On Thu, Mar 18, 2021 at 3:45 PM Pavel Stehule 
> wrote:
> >
> >
> >
> > čt 18. 3. 2021 v 15:19 odesílatel Hannu Krosing 
> napsal:
> ...
> >> Variation could be
> >>
> >> DECLARE
> >>fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;
> >>
> >> so it is clear that we are aliasing current function name
> >>
> >>
> >> or even make the function name optional, as there can only be one, so
> >>
> >> DECLARE
> >>fnarg ALIAS FOR FUNCTION;
> >
> >
> > Why you think so it is better than just
> >
> > #routine_label fnarg
> >
> > ?
>
> It just adds already a *third* way to (re)name things, in addition to
> << blocklabel >>
> and ALIAS FOR
>
> For some reason it feels to me similar to having completely different
> syntax rules
> for function names, argument names and variable names instead of all
> having to
> follow rules for "identifiers"
>
> For cleanness/orthogonality I would also prefer the blocklables to be in
> DECLARE
> for each block, but this train has already left :)
> Though we probably could add alternative syntax ALIAS FOR BLOCK ?
>

why? Is it a joke?

you are defining a block label, and you want to in the same block redefine
some outer label? I don't think it is a good idea.



> > Maybe the name of the option can be routine_alias? Is it better for you?
>
> I dont think the name itself is bad, as it is supposed to be used for
> both FUNCTION
> and PROCEDURE renaming
>
> > My objection to DECLARE is freedom of this clause. It can be used
> everywhere.
> > the effect of DECLARE is block limited. So theoretically somebody can
> write
> >
> > BEGIN
> >   ..
> >   DECLARE fnarg ALIAS FOR FUNCTION;
> >   BEGIN
> > ..
> >   END;
> >
> > END;
> >
> > It disallows simple implementation.
>
> We could limit ALIAS FOR FUNCTION to outermost block only, and revisit
> this later
> if there are loud complaints (which I don't think there will be :) )
>

Inside the parser you don't know so you are in the outermost block only.

I play with it and I see two objections against DECLARE syntax

1. the scope is in block. So you cannot use aliases for default arguments
(or maybe yes, but is it correct?)

CREATE OR REPLACE FUNCTION fx(a int, b int)
RETURNS ... AS $$
DECLARE
  x1 int := fx.a;
  fnarg AS ALIAS FOR FUNCTION;
  x2 int := fnarg.a; -- should be this alias active now?
BEGIN

2. "ALIAS FOR FUNCTION" is not well named. In some languages, and in ADA
language too. You can rename an external function just for current scope.
You can find some examples like

DECLARE
  s ALIAS FOR FUNCTION sin;
BEGIN
  s(1); -- returns result of function sin

other example - recursive function

CREATE OR REPLACE FUNCTION some_very_long()
RETURNS int AS $$
DECLARE
  thisfx ALIAS FOR FUNCTION;
BEGIN
  var := thisfx(...);

But we don't support this feature. We are changing just a top scope's
label. So syntax "ALIAS FOR FUNCTION is not good. The user can have false
hopes







>
>
> Cheers
> Hannu
>

Re: Implement for window functions

2021-03-18 Thread David Steele


Hi Vik,

On 1/11/21 5:00 PM, Tom Lane wrote:

I started to look through this patch...


I see you moved this patch to PG15. If you won't be working on the patch 
in this CF perhaps it would be better to close it as Returned with 
Feedback for now and reopen it when you have a new patch?


I'll do that on March 23 unless I hear arguments to the contrary.

Regards,
--
-David
da...@pgmasters.net

Re: Force lookahead in COPY FROM parsing

2021-03-18 Thread John Naylor

The cfbot thinks this patch no longer applies, but it works for me, so
still set to RFC. It looks like it's because the thread to remove the v2
FE/BE protocol was still attached to the commitfest entry. I've deleted
that, so let's see if that helps.

To answer the side question of whether it makes any difference in
performance, I used the blackhole AM [1] to isolate the copy code path as
much as possible. Forcing lookahead seems to not make a noticeable
difference (min of 5 runs):

master:
306ms

force lookahead:
304ms

[1] https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am
-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Reduce the time required for a database recovery from archive.

2021-03-18 Thread David Steele


Hi Dimtry,

On 1/11/21 2:51 AM, Dmitry Shulga wrote:

Hi Stephen

Based on our last discussion I redesigned the implementation of WAL 
archive recovery speed-up. 


Seems like there should have been a patch attached? In any case the 
current patch no longer applies so marked Waiting on Author.


Personally, I'm not too keen on this patch as implemented. Several 
third-party backup solutions support parallel archive get so it would be 
nice to support an interface that simply says to the restore_command, 
"go get 1gb of WAL and write the files here." This patch still assumes 
that the user has written their own restore command, which is 
third-party by definition, so I can't see how interfacing with 
third-party software is an issue here.


Also, having multiple workers blindly asking for WAL can cause quite a 
bit of traffic and cost because PG knows what WAL it wants but it 
doesn't know what exists. On the other hand, a backup solution can 
cheaply determine what is available to prevent hammering the archive 
with requests for files that don't exist.


Regards,
--
-David
da...@pgmasters.net

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Pavel Stehule

čt 18. 3. 2021 v 15:49 odesílatel Hannu Krosing  napsal:

> I went to archives and read the whole discussion for this from the
> beginning
>
> I did not really understand, why using the ALIAS FOR syntax is
> significantly
> harder to implement then #routine_label
>

just because it can be used in different places - for example in nested
blocks. And then you need to create really new aliases for these object in
current namespace. And it is not easy because we cannot to iterate in
ns_items tree from root



> The things you mentioned as currently using #OPTION seem to be in a
> different
> category from the aliases and block labels - they are more like pragmas -
> so to
> me it still feels inconsistent to use #routine_label for renaming the
> outer namespace.
>

I think this feature allows for more concepts.



>
> Also checked code and indeed there is support for #OPTION DUMP and
> #VARIABLE_CONFLICT ERROR
> Is it documented and just hard to find ?
>

please, see my previous mail. There was links

>
> On Thu, Mar 18, 2021 at 3:19 PM Hannu Krosing  wrote:
> >
> > On Thu, Mar 18, 2021 at 5:27 AM Pavel Stehule 
> wrote:
> > >
> > >
> > > There are few main reasons:
> > >
> > > a) compile options are available already - so I don't need invent new
> syntax - #OPTION DUMP, #PRINT_STRICT ON, #VARIABLE_CONFLICT ERROR
> >
> > Are these documented anywhere ?
> >
> > At least a quick search for pl/pgsql + #OPTION DUMP, #PRINT_STRICT ON,
> > #VARIABLE_CONFLICT ERROR returned nothing relevant and also searching
> > in
> >
> > If pl/pgsql actually supports any of these, these should get documented!
> >
> >
> >
> > For me the most logical place for declaring a function name alias
> > would be to allow  ALIAS FOR the function name in DECLARE section.
> >
> > plpgsql_test=# CREATE FUNCTION
> > a_function_with_an_inconveniently_long_name(i int , OUT o int)
> > LANGUAGE plpgsql AS $plpgsql$
> > DECLARE
> >fnarg ALIAS FOR a_function_with_an_inconveniently_long_name;
> > BEGIN
> >fnarg.o = 2 * fnarg.i;
> > END;
> > $plpgsql$;
> >
> > but unfortunately this currently returns
> >
> > ERROR:  42704: variable "a_function_with_an_inconveniently_long_name"
> > does not exist
> > LINE 4:fnarg ALIAS FOR a_function_with_an_inconveniently_long_na...
> >^
> > LOCATION:  plpgsql_yyparse, pl_gram.y:677
> >
> > Variation could be
> >
> > DECLARE
> >fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;
> >
> > so it is clear that we are aliasing current function name
> >
> > or even make the function name optional, as there can only be one, so
> >
> > DECLARE
> >fnarg ALIAS FOR FUNCTION;
> >
> >
> > Cheers
> > Hannu
>

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-18 Thread Hannu Krosing

On Thu, Mar 18, 2021 at 3:45 PM Pavel Stehule  wrote:
>
>
>
> čt 18. 3. 2021 v 15:19 odesílatel Hannu Krosing  napsal:
...
>> Variation could be
>>
>> DECLARE
>>fnarg ALIAS FOR FUNCTION a_function_with_an_inconveniently_long_name;
>>
>> so it is clear that we are aliasing current function name
>>
>>
>> or even make the function name optional, as there can only be one, so
>>
>> DECLARE
>>fnarg ALIAS FOR FUNCTION;
>
>
> Why you think so it is better than just
>
> #routine_label fnarg
>
> ?

It just adds already a *third* way to (re)name things, in addition to
<< blocklabel >>
and ALIAS FOR

For some reason it feels to me similar to having completely different
syntax rules
for function names, argument names and variable names instead of all having to
follow rules for "identifiers"

For cleanness/orthogonality I would also prefer the blocklables to be in DECLARE
for each block, but this train has already left :)
Though we probably could add alternative syntax ALIAS FOR BLOCK ?

> Maybe the name of the option can be routine_alias? Is it better for you?

I dont think the name itself is bad, as it is supposed to be used for
both FUNCTION
and PROCEDURE renaming

> My objection to DECLARE is freedom of this clause. It can be used everywhere.
> the effect of DECLARE is block limited. So theoretically somebody can write
>
> BEGIN
>   ..
>   DECLARE fnarg ALIAS FOR FUNCTION;
>   BEGIN
> ..
>   END;
>
> END;
>
> It disallows simple implementation.

We could limit ALIAS FOR FUNCTION to outermost block only, and revisit
this later
if there are loud complaints (which I don't think there will be :) )



Cheers
Hannu

1 2 >

1 - 100 of 167 matches

Mail list logo