date:20210126

Re: Is Recovery actually paused?

2021-01-26 Thread Dilip Kumar

On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 26, 2021 at 2:00 AM Robert Haas  wrote:
> >
> > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> >  wrote:
> > > +1 to just show the recovery pause state in the output of
> > > pg_is_wal_replay_paused. But, should the function name
> > > "pg_is_wal_replay_paused" be something like
> > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > > in a function, I expect a boolean output. Others may have better
> > > thoughts.
> >
> > Maybe we should leave the existing function pg_is_wal_replay_paused()
> > alone and add a new one with the name you suggest that returns text.
> > That would create less burden for tool authors.
>
> +1
>

Yeah, we can do that, I will send an updated patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: simplifying foreign key/RI checks

2021-01-26 Thread Keisuke Kuroda

Hi Amit-san,

Thanks for the answer!

> If you only tested insert/update on the referencing table, I would've
> expected to see nothing in the result of that query, because the patch
> eliminates all use of SPI in that case.  I suspect the CachedPlan*
> memory contexts you are seeing belong to some early activity in the
> session.  So if you try the insert/update in a freshly started
> session, you would see 0 rows in the result of that query.

That's right.
CREATE PARTITION TABLE included in the test script(rep.sql) was using SPI.
In a new session, I confirmed that CachedPlan is not generated when only
execute INSERT.

# only execute INSERT

postgres=# INSERT INTO ps SELECT generate_series(1,4999);
INSERT 0 4999
postgres=#
postgres=# INSERT INTO pr SELECT i, i from generate_series(1,4999)i;
INSERT 0 4999

postgres=# SELECT name, sum(used_bytes) as bytes,
pg_size_pretty(sum(used_bytes)) FROM pg_backend_memory_contexts
WHERE name LIKE 'Cached%' GROUP BY name;

 name | bytes | pg_size_pretty
--+---+
(0 rows) ★ No CachedPlan

> Hmm, the patch tries to solve a general problem that SPI plans are not
> being shared among partitions whereas they should be.   So I don't
> think that it's necessarily specific to DELETE.  Until we have a
> solution like the patch on this thread for DELETE, it seems fine to
> consider the other patch as a stopgap solution.

I see.
So this is a solution to the problem of using SPI plans in partitions,
not just DELETE.
I agree with you, I think this is a solution to the current problem.

Best Regards,



-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com

RE: Parallel Inserts in CREATE TABLE AS

2021-01-26 Thread Tang, Haiying

Hi Bharath,



I choose 5 cases which pick parallel insert plan in CTAS to measure the patched 
performance. Each case run 30 times.

Most of the tests execution become faster with this patch.

However, Test NO 4(create table xxx as table xxx.) appears performance 
degradation. I tested various table size(2/10/20 millions), they all have a 
6%-10% declines. I think it may need some check at this problem.



Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached 
test_ctas.sql file.

reg%=(patched-master)/master

Test NO |  Test Case |reg%  | patched(ms)  | master(ms)

||--|--|-

1   |  CTAS select from table| -9%  | 16709.50477  | 18370.76660

2   |  Append plan   | -14% | 16542.97807  | 19305.86600

3   |  initial plan under Gather node| -5%  | 13374.27187  | 14120.02633

4   |  CTAS table| 10%  | 20835.48800  | 18986.40350

5   |  CTAS select from execute  | -6%  | 16973.73890  | 18008.59789



About Test NO 4:

In master(HEAD), this test case picks serial seq scan.

query plan likes:

--

Seq Scan on public.tenk1  (cost=0.00..444828.12 rows=1012 width=244) 
(actual time=0.005..1675.268 rows=1000 loops=1)

   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4  
Planning Time: 0.053 ms  Execution Time: 20165.023 ms



With this patch, it will choose parallel seq scan and parallel insert.

query plan likes:

--

Gather  (cost=1000.00..370828.03 rows=1012 width=244) (actual 
time=20428.823..20437.143 rows=0 loops=1)

   Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4

   Workers Planned: 4

   Workers Launched: 4

->  Create test

   ->  Parallel Seq Scan on public.tenk1  (cost=0.00..369828.03 rows=253 
width=244) (actual time=0.021..411.094 rows=200 loops=5)

 Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4

 Worker 0:  actual time=0.023..390.856 rows=1858407 loops=1

 Worker 1:  actual time=0.024..468.587 rows=2264494 loops=1

 Worker 2:  actual time=0.023..473.170 rows=2286580 loops=1

 Worker 3:  actual time=0.027..373.727 rows=1853216 loops=1  Planning 
Time: 0.053 ms  Execution Time: 20437.643 ms



test machine spec:

CentOS 8.2, 128G RAM, 40 processors, disk SAS



Regards,

Tang




test_ctas.sql
Description: test_ctas.sql

Re: Support for NSS as a libpq TLS backend

2021-01-26 Thread Michael Paquier

On Wed, Jan 20, 2021 at 05:07:08PM +, Jacob Champion wrote:
> Lovely. I didn't expect *removing* an extension to effectively *add*
> more, but I'm glad it works now.

My apologies for chiming in.  I was looking at your patch set here,
and while reviewing the strong random and cryptohash parts I have
found a couple of mistakes in the ./configure part.  I think that the
switch from --with-openssl to --with-ssl={openssl} could just be done
independently as a building piece of the rest, then the first portion
based on NSS could just add the minimum set in configure.ac.

Please note that the patch set has been using autoconf from Debian, or
something forked from upstream.  There were also missing updates in
several parts of the code base, and a lack of docs for the new
switch.  I have spent time checking that with --with-openssl to make
sure that the obsolete grammar is still compatible, --with-ssl=openssl
and also without it.

Thoughts?
--
Michael
From b3e18564697bdb36a8eba4afe6434294676b528b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 27 Jan 2021 16:36:45 +0900
Subject: [PATCH] Introduce --with-ssl={openssl} in configure options

---
 src/include/pg_config.h.in|   2 +-
 src/backend/libpq/Makefile|   2 +-
 src/backend/libpq/hba.c   |   2 +-
 src/common/Makefile   |   2 +-
 src/interfaces/libpq/Makefile |   2 +-
 src/test/Makefile |   2 +-
 src/test/modules/Makefile |   2 +-
 .../modules/ssl_passphrase_callback/Makefile  |   2 +-
 .../ssl_passphrase_callback/t/001_testfunc.pl |   2 +-
 src/test/ssl/Makefile |   2 +-
 src/test/ssl/t/001_ssltests.pl|   2 +-
 src/test/ssl/t/002_scram.pl   |   2 +-
 doc/src/sgml/installation.sgml|   5 +-
 doc/src/sgml/pgcrypto.sgml|   2 +-
 doc/src/sgml/sslinfo.sgml |   2 +-
 contrib/Makefile  |   2 +-
 contrib/pgcrypto/Makefile |   4 +-
 configure | 110 +++---
 configure.ac  |  31 +++--
 src/Makefile.global.in|   2 +-
 src/tools/msvc/Solution.pm|   2 +-
 src/tools/msvc/config_default.pl  |   2 +-
 22 files changed, 113 insertions(+), 73 deletions(-)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f4d9f3b408..55cab4d2bf 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -899,7 +899,7 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
-/* Define to build with OpenSSL support. (--with-openssl) */
+/* Define to build with OpenSSL support. (--with-ssl=openssl) */
 #undef USE_OPENSSL
 
 /* Define to 1 to build with PAM support. (--with-pam) */
diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile
index efc5ef760a..8d1d16b0fc 100644
--- a/src/backend/libpq/Makefile
+++ b/src/backend/libpq/Makefile
@@ -28,7 +28,7 @@ OBJS = \
 	pqmq.o \
 	pqsignal.o
 
-ifeq ($(with_openssl),yes)
+ifeq ($(with_ssl),openssl)
 OBJS += be-secure-openssl.o
 endif
 
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 371dccb852..20bf1461ce 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -1041,7 +1041,7 @@ parse_hba_line(TokenizedLine *tok_line, int elevel)
 			ereport(elevel,
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
 	 errmsg("hostssl record cannot match because SSL is not supported by this build"),
-	 errhint("Compile with --with-openssl to use SSL connections."),
+	 errhint("Compile with --with-ssl=openssl to use SSL connections."),
 	 errcontext("line %d of configuration file \"%s\"",
 line_num, HbaFileName)));
 			*err_msg = "hostssl record cannot match because SSL is not supported by this build";
diff --git a/src/common/Makefile b/src/common/Makefile
index 1a1d0d3406..5422579a6a 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -80,7 +80,7 @@ OBJS_COMMON = \
 	wait_error.o \
 	wchar.o
 
-ifeq ($(with_openssl),yes)
+ifeq ($(with_ssl),openssl)
 OBJS_COMMON += \
 	protocol_openssl.o \
 	cryptohash_openssl.o
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index c4fde3f93d..2b7ace52e4 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -45,7 +45,7 @@ OBJS = \
 	pqexpbuffer.o \
 	fe-auth.o
 
-ifeq ($(with_openssl),yes)
+ifeq ($(with_ssl),openssl)
 OBJS += \
 	fe-secure-common.o \
 	fe-secure-openssl.o
diff --git a/src/test/Makefile b/src/test/Makefile
index ab1ef9a475..f7859c2fd5 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -28,7 +28,7 @@ ifneq (,$(filter ldap,$(PG_TEST_EXTRA)))
 SUBDIRS += ldap
 endif
 endif
-ifeq ($(with_openssl),yes)
+ifeq ($(with_ssl),openssl)
 ifneq (,$(filter ssl,$(PG_TEST_EXTRA)))

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-26 Thread Masahiko Sawada

On Sat, Jan 16, 2021 at 1:39 AM Zhihong Yu  wrote:
>
> Hi,

Thank you for reviewing the patch!

> For v32-0004-Add-PrepareForeignTransaction-API.patch :
>
> + * Whenever a foreign transaction is processed, the corresponding FdwXact
> + * entry is update. To avoid holding the lock during transaction 
> processing
> + * which may take an unpredicatable time the in-memory data of foreign
>
> entry is update -> entry is updated
>
> unpredictable -> unpredictable

Fixed.
¨
>
> +   int nlefts = 0;
>
> nlefts -> nremaining
>
> +   elog(DEBUG1, "left %u foreign transactions", nlefts);
>
> The message can be phrased as "%u foreign transactions remaining"

Fixed.

>
> +FdwXactResolveFdwXacts(int *fdwxact_idxs, int nfdwxacts)
>
> Fdw and Xact are repeated. Seems one should suffice. How about naming the 
> method FdwXactResolveTransactions() ?
> Similar comment for FdwXactResolveOneFdwXact(FdwXact fdwxact)

Agreed. I changed to ResolveFdwXacts() and ResolveOneFdwXact()
respectively to avoid a long function name.

>
> For get_fdwxact():
>
> +   /* This entry matches the condition */
> +   found = true;
> +   break;
>
> Instead of breaking and returning, you can return within the loop directly.

Fixed.

Those changes are incorporated into the latest version patches[1] I
submitted today.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoBYyA5O%2BFPN4Cs9YWiKjq319BvF5fYmKNsFTZfwTcWjQw%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Is Recovery actually paused?

2021-01-26 Thread Masahiko Sawada

On Tue, Jan 26, 2021 at 2:00 AM Robert Haas  wrote:
>
> On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
>  wrote:
> > +1 to just show the recovery pause state in the output of
> > pg_is_wal_replay_paused. But, should the function name
> > "pg_is_wal_replay_paused" be something like
> > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > in a function, I expect a boolean output. Others may have better
> > thoughts.
>
> Maybe we should leave the existing function pg_is_wal_replay_paused()
> alone and add a new one with the name you suggest that returns text.
> That would create less burden for tool authors.

+1

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: fix typo in reorderbuffer.c

2021-01-26 Thread Hou, Zhijie

> >> Combocid's ==>> Combocids
> >
> > Ping again, just in case it’s missed.
> 
> There are in the comments "ComboCIDs", "combocids" and "ComboCids" on top
> of the existing Combocid's.  Your patch introduces a fourth way of writing
> that.  It may be better to just have one way to govern them all.

Hi Michael,

Thanks for the review.

Actually, "Combocid's" in the patch is the first word of the sentence, so the 
first char is uppercase(may be not a new style).

I agree that it’s better to have a common way, but some different style of 
combocid follow the variable or functionname[1].
We may need to change the var name or function name too.

Personally , I prefer "combocids".
But do you think we can change that in a separate patch apart from this typo 
patch ?


[1]:
void
SerializeComboCIDState(Size maxsize, char *start_address)

static int  usedComboCids = 0;  /* number of elements in comboCids */
...

Best regards,
houzj

Re: FailedAssertion in heap_index_delete_tuples at heapam.c:7220

2021-01-26 Thread Peter Geoghegan

On Tue, Jan 26, 2021 at 10:52 PM Jaime Casanova
 wrote:
> ${subject} happened while executing ${attached query} at regresssion
> database, using 14dev (commit
> d5a83d79c9f9b660a6a5a77afafe146d3c8c6f46) and produced ${attached
> stack trace}.

I see the bug: gistprunepage() calls
index_compute_xid_horizon_for_tuples() (which ultimately calls the
heapam.c callback for heap_index_delete_tuples()) with an empty array,
which we don't expect. The similar code within _hash_vacuum_one_page()
already only calls index_compute_xid_horizon_for_tuples() when
ndeletable > 0.

The fix is obvious: Bring gistprunepage() in line with
_hash_vacuum_one_page(). I'll go push a fix for that now.

Thanks for the report!
-- 
Peter Geoghegan

Re: simplifying foreign key/RI checks

2021-01-26 Thread Tatsuro Yamada


Hi Amit-san,


+   case TM_Invisible:
+   elog(ERROR, "attempted to lock invisible tuple");
+   break;
+
+   case TM_SelfModified:
+   case TM_BeingModified:
+   case TM_WouldBlock:
+   elog(ERROR, "unexpected table_tuple_lock status: %u", res);
+   break;

+   default:
+   elog(ERROR, "unrecognized table_tuple_lock status: %u", res);

All of these are meant as debugging elog()s for cases that won't
normally occur.  IIUC, the discussion at the linked thread excludes
those from consideration.


Thanks for your explanation.
Ah, I reread the thread, and I now realized that user visible log messages
are the target to replace. I understood that that elog() for the cases won't
normally occur. Sorry for the noise.

Regards,
Tatsuro Yamada

Re: New IndexAM API controlling index vacuum strategies

2021-01-26 Thread Masahiko Sawada

On Fri, Jan 22, 2021 at 2:00 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan  wrote:
> > Looks good. I'll give this version a review now. I will do a lot more
> > soon. I need to come up with a good benchmark for this, that I can
> > return to again and again during review as needed.
>
> I performed another benchmark, similar to the last one but with the
> latest version (v2), and over a much longer period. Attached is a
> summary of the whole benchmark, and log_autovacuum output from the
> logs of both the master branch and the patch.

Thank you for performing the benchmark!

>
> This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
> transaction with one update and two selects. Each run was 4 hours, and
> we alternate between patch and master for each run, and alternate
> between 16 and 32 clients. There were 8 4 hour runs in total, meaning
> the entire set of runs took 8 * 4 hours = 32 hours (not including
> initial load time and a few other small things like that). I used a
> 10k TPS rate limit, so TPS isn't interesting here. Latency is
> interesting -- we see a nice improvement in latency (i.e. a reduction)
> for the patch (see all.summary.out).

What value is set to fillfactor?

>
> The benefits of the patch are clearly visible when I drill down and
> look at the details. Each pgbench_accounts autovacuum VACUUM operation
> can finish faster with the patch because they can often skip at least
> some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
> it's more subtle than some might assume. We're skipping indexes that
> VACUUM actually would have deleted *some* index tuples from, which is
> very good. Bottom-up index deletion is usually lazy, and only
> occasionally very eager, so you still have plenty of "floating
> garbage" index tuples in most pages. And now we see VACUUM behave a
> little more like bottom-up index deletion -- it is lazy when that is
> appropriate (with indexes that really only have floating garbage that
> is spread diffusely throughout the index structure), and eager when
> that is appropriate (with indexes that get much more garbage).

That's very good. I'm happy that this patch efficiently utilizes
bottom-up index deletion feature.

Looking at the relation size growth, there is almost no difference
between master and patched in spite of skipping some vacuums in the
patched test, which is also good.

>
> The benefit is not really that we're avoiding doing I/O for index
> vacuuming (though that is one of the smaller benefits here). The real
> benefit is that VACUUM is not dirtying pages, since it skips indexes
> when it would be "premature" to vacuum them from an efficiency point
> of view. This is important because we know that Postgres throughput is
> very often limited by page cleaning. Also, the "economics" of this new
> behavior make perfect sense -- obviously it's more efficient to delay
> garbage cleanup until the point when the same page will be modified by
> a backend anyway -- in the case of this benchmark via bottom-up index
> deletion (which deletes all garbage tuples in the leaf page at the
> point that it runs for a subset of pointed-to heap pages -- it's not
> using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
> a page, we now get more value per additional-page-dirtied.
>
> I believe that controlling the number of pages dirtied by VACUUM is
> usually much more important than reducing the amount of read I/O from
> VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
> default value and modern hardware" thread. As a further consequence of
> all this, VACUUM can go faster safely and sustainably (since the cost
> limit is not affected so much by vacuum_cost_page_miss), which has its
> own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
> end).
>
> Another closely related huge improvement that we see here is that the
> number of FPIs generated by VACUUM can be significantly reduced. This
> cost is closely related to the cost of dirtying pages, but it's worth
> mentioning separately. You'll see some of that in the log_autovacuum
> log output I attached.

Makes sense.

>
> There is an archive with much more detailed information, including
> dumps from most pg_stat_* views at key intervals. This has way more
> information than anybody is likely to want:
>
> https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing
>
> I did notice a problem, though. I now think that the criteria for
> skipping an index vacuum in the third patch from the series is too
> conservative, and that this led to an excessive number of index
> vacuums with the patch.

Maybe that's why there are 5 autovacuum runs on pgbench_accounts in
the master branch whereas there are 7 runs in the patched?

>  This is probably because there was a tiny
> number of page splits in some of the indexes that were not really
> supposed to grow. I believe that this is caused by ANALYZE

FailedAssertion in heap_index_delete_tuples at heapam.c:7220

2021-01-26 Thread Jaime Casanova

Hi,

${subject} happened while executing ${attached query} at regresssion
database, using 14dev (commit
d5a83d79c9f9b660a6a5a77afafe146d3c8c6f46) and produced ${attached
stack trace}.

Sadly just loading the regression database and executing this query is
not enough to reproduce. Not sure what else I can do to help with this
one.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {0, 140734235450160, 2, 6, 6456150, 94891112648656, 
4611686018427388799, 140604537363110, 0, 281470681751457, 0, 0, 0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7fe10b488535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x3feb0800, 
sa_sigaction = 0x3feb0800}, sa_mask = {__val = {0, 4605253854145256240, 
0, 13781012405103410633, 0, 140604535119861, 2, 
  7220455706289059512, 7018409646405740088, 0, 7003723482445100592, 
0, 3166617362634023936, 140734235450400, 684527088, 140734235451264}}, sa_flags 
= 0, sa_restorer = 0x28cd0df0}
sigs = {__val = {32, 0 }}
#2  0x564d8f7748be in ExceptionalCondition (conditionName=0x564d8f800559 
"delstate->ndeltids > 0", errorType=0x564d8f7fe869 "FailedAssertion", 
fileName=0x564d8f7fe948 "heapam.c", lineNumber=7220)
at assert.c:69
No locals.
#3  0x564d8f176825 in heap_index_delete_tuples (rel=0x7fe1022847c8, 
delstate=0x7fff3e1ca120) at heapam.c:7220
latestRemovedXid = 0
blkno = 4294967295
buf = 0
page = 0x0
maxoff = 0
priorXmax = 2446144864
prefetch_state = {cur_hblkno = 4294967295, next_item = 0, ndeltids = 0, 
deltids = 0x564d91cd3748}
prefetch_distance = 10
SnapshotNonVacuumable = {snapshot_type = SNAPSHOT_NON_VACUUMABLE, xmin 
= 0, xmax = 2446145024, xip = 0x216, xcnt = 0, subxip = 0x, 
subxcnt = 1042063264, suboverflowed = 255, 
  takenDuringRecovery = 127, copied = false, curcid = 1042063408, 
speculativeToken = 32767, vistest = 0x564d8fab1c10 , 
active_count = 2446145040, regd_count = 22093, ph_node = {
first_child = 0x7fff3e1ca070, next_sibling = 0x7fff3e1ca060, 
prev_or_parent = 0x564d91ddda08}, whenTaken = 462, lsn = 18446744071860729372, 
snapXactCompletionCount = 0}
finalndeltids = 0
nblocksaccessed = 0
nblocksfavorable = 0
curtargetfreespace = 0
lastfreespace = 0
actualfreespace = 0
bottomup_final_block = false
#4  0x564d8f195780 in table_index_delete_tuples (rel=0x7fe1022847c8, 
delstate=0x7fff3e1ca120) at ../../../../src/include/access/tableam.h:1240
No locals.
#5  0x564d8f195e6a in index_compute_xid_horizon_for_tuples 
(irel=0x564d924f6428, hrel=0x7fe1022847c8, ibuf=807, itemnos=0x7fff3e1ca190, 
nitems=0) at genam.c:331
delstate = {bottomup = false, bottomupfreespace = 0, ndeltids = 0, 
deltids = 0x564d91cd3748, status = 0x564d91cd3768}
latestRemovedXid = 0
ipage = 0x7fe102fc5f00 ""
itup = 0x7fff3e1ca630
#6  0x564d8f13e568 in gistprunepage (rel=0x564d924f6428, 
page=0x7fe102fc5f00 "", buffer=807, heapRel=0x7fe1022847c8) at gist.c:1669
deletable = {1, 0, 0, 0, 46976, 37580, 22016, 0, 41520, 15900, 32767, 
0, 42544, 15900, 32767, 0, 42752, 15900, 0, 0, 2674, 2890, 32737, 0, 1, 0, 
25580, 15290, 41984, 48994, 5558, 11250, 42752, 15900, 
  32767, 0, 52880, 36628, 22093, 0, 46976, 37580, 22093, 0, 13728, 
37325, 22093, 0, 32512, 765, 32737, 0, 25640, 37455, 22093, 0, 0, 0, 2, 0, 
65535, 255, 0, 0, 41472, 15900, 32767, 0, 14468, 566, 
  32737, 0, 41536, 15900, 0, 256, 14468, 566, 32737, 0, 13736, 37325, 
22093, 0, 25640, 37455, 22093, 0, 0, 0, 0, 0, 0, 512, 32737, 0, 41584, 15900, 
32767, 0, 47989, 36700, 22093, 0, 41712, 15900, 0, 
  256, 14468, 566, 32737, 0, 41632, 15900, 32767, 0, 56219, 36700, 
22093, 0, 0, 0, 0, 0, 14464, 566, 32737, 0, 0, 0, 3, 0, 21856, 174, 0, 8192, 
41680, 15900, 32767, 0, 56472, 36700, 22093, 0, 11968, 
  37108, 22093, 0, 0, 0, 0, 0, 14480, 566, 32737, 0, 14464, 566, 32737, 
0, 41712, 15900, 32767, 0, 54083, 36639, 22093, 0, 40672, 765, 32737, 0, 25640, 
37455, 22093, 0, 0, 0, 0, 0, 0, 512, 32737, 0, 
  41776, 15900, 0, 256, 14468, 566, 32737, 0, 41776, 15900, 32767, 0, 
47349, 36700, 22093, 0...}
ndeletable = 0
offnum = 292
maxoff = 291
latestRemovedXid = 0
#7  0x564d8f139e3e in gistplacetopage (rel=0x564d924f6428, freespace=0, 
giststate=0x564d92ccb780, buffer=807, itup=0x7fff3e1ca6e0, ntup=1, oldoffnum=0, 
newblkno=0x0, leftchildbuf=0, 
splitinfo=0x7fff3e1ca690, markfollowright=true, heapRel=0x7fe1022847c8, 
is_build=false) at gist.c:274
blkno = 4
page = 0x7fe102fc5f00 ""
is_leaf = true
recptr = 94891117474655
i = 32767
is_split = true

Re: simplifying foreign key/RI checks

2021-01-26 Thread Amit Langote

Yamada-san,

On Wed, Jan 27, 2021 at 8:51 AM Tatsuro Yamada
 wrote:
> On 2021/01/25 18:19, Amit Langote wrote:
> > On Mon, Jan 25, 2021 at 9:24 AM Corey Huinker  
> > wrote:
> >> Anybody else want to look this patch over before I mark it Ready For 
> >> Committer?
> >
> > Would be nice to have others look it over.  Thanks.
>
> Thanks for creating the patch!
>
> I tried to review the patch. Here is my comment.

Thanks for the comment.

> * According to this thread [1], it might be better to replace elog() with
>ereport() in the patch.
>
> [1]: 
> https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com

Could you please tell which elog() of the following added by the patch
you are concerned about?

+   case TM_Invisible:
+   elog(ERROR, "attempted to lock invisible tuple");
+   break;
+
+   case TM_SelfModified:
+   case TM_BeingModified:
+   case TM_WouldBlock:
+   elog(ERROR, "unexpected table_tuple_lock status: %u", res);
+   break;

+   default:
+   elog(ERROR, "unrecognized table_tuple_lock status: %u", res);

All of these are meant as debugging elog()s for cases that won't
normally occur.  IIUC, the discussion at the linked thread excludes
those from consideration.


--
Amit Langote
EDB: http://www.enterprisedb.com

Re: fix typo in reorderbuffer.c

2021-01-26 Thread Michael Paquier

On Wed, Jan 27, 2021 at 05:57:06AM +, Hou, Zhijie wrote:
>> Combocid's ==>> Combocids
> 
> Ping again, just in case it’s missed.

There are in the comments "ComboCIDs", "combocids" and "ComboCids" on
top of the existing Combocid's.  Your patch introduces a fourth way of
writing that.  It may be better to just have one way to govern them
all.
--
Michael


signature.asc
Description: PGP signature

Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2021-01-26 Thread Kyotaro Horiguchi

At Wed, 27 Jan 2021 05:30:29 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Kyotaro Horiguchi 
> > "CREATE TABLE" is not "CREATE LOGGED TABLE". We can assume that as
> > "CREATE  TABLE", where the default logged-ness
> > varies according to context. Or it might have been so since the beginning.
> > Currently we don't have the syntax "CREATE LOGGED TABLE", but we can add
> > that syntax.
> 
> Yes, I thought of that possibility after a while too.  But I felt a bit(?) 
> hesitant to do it considering back-patching.  Also, the idea requires ALTER 
> TABLE ATTACH PARTITION will have to modify the logged-ness property of the 
> target partition and its subpartitions with that of the parent partitioned 
> table.  However, your idea is one candidate worth pursuing, including whether 
> or not to back-patch what.

I think this and all possible similar changes won't be back-patched at
all.  It is a desaster for any establish version to change this kind
of behavior.

> 
> >  We pursue relasing all fixes at once but we might release all fixes  other 
> > than
> > some items that cannot be fixed for some technical reasons  at the time, 
> > like
> > REPLICA IDENITTY.
> > 
> > I'm not sure how long we will wait for the time of release, though.
> 
> Anyway, we have to define the ideal behavior for each ALTER action based on 
> some comprehensible principle.  Yeah... this may become a long, tiring 
> journey.  (I anticipate some difficulty and compromise in reaching agreement, 
> as was seen in the past discussion for the fix for ALTER TABLE REPLICA 
> IDENTITY.  Scary)

It seems to me that we have agreed that the ideal behavior for ALTER
TABLE on a parent to propagate to descendants. An observed objection
is that behavior is dangerous for operations that requires large
amount of resources that could lead to failure after elapsing a long
time. The revealed difficulty of REPLICA IDENTITY is regarded as a
technical issue that should be overcome.

> FWIW, I wonder why Postgres decided to allow different logical structure of 
> tables such as DEFAULT values and constraints between the parent partitioned 
> table and a child partition.  That extra flexibility might stand in the way 
> to consensus.  I think it'd be easy to understand that partitions are simply 
> physically independent containers that have identical logical structure.

I understand that -- once upon a time the "traditional" partitioning
was mere a use case of inheritance tables. Parents and children are
basically independent from each other other than inherited attributes.
The nature of table inheritance compel column addition to a parent to
inherit to children, but other changes need not to be propagated. As
time goes by, partition becomes the major use and new features added
at the time are affected by the recognition. After the appearance of
native partitioning, the table inheritance seems to be regarded as a
heritage that we need to maintain.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: fix typo in reorderbuffer.c

2021-01-26 Thread Hou, Zhijie

> I found a possible typo in reorderbuffer.c
> 
>  *   has got a combocid. Combocid's are only valid for the duration
> of a
> 
> Combocid's ==>> Combocids
> 
> Attatching a small patch to correct it.
> 

Ping again, just in case it’s missed.

Re: doc review for v14

2021-01-26 Thread Michael Paquier

On Sat, Jan 23, 2021 at 07:15:40PM +0900, Michael Paquier wrote:
> I have been looking at 0005, the patch dealing with the docs of the
> replication stats, and have some comments.

And attached is a patch to clarify all that.  I am letting that sleep
for a couple of days for now, so please let me know if you have any
comments.
--
Michael
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9496f76b1f..c602ee4427 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -317,7 +317,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
   pg_stat_replication_slotspg_stat_replication_slots
   One row per replication slot, showing statistics about
-   replication slot usage.
+   the replication slot's usage.
See 
pg_stat_replication_slots for details.
   
@@ -2604,10 +2604,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 spill_txns bigint


-Number of transactions spilled to disk after the memory used by
-logical decoding of changes from WAL for this slot exceeds
+Number of transactions spilled to disk once the memory used by
+logical decoding to decode changes from WAL has exceeded
 logical_decoding_work_mem. The counter gets
-incremented both for toplevel transactions and subtransactions.
+incremented for both toplevel transactions and subtransactions.
   
  
 
@@ -2616,9 +2616,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 spill_count bigint


-Number of times transactions were spilled to disk while decoding changes
-from WAL for this slot. Transactions may get spilled repeatedly, and
-this counter gets incremented on every such invocation.
+Number of times transactions were spilled to disk while decoding
+changes from WAL for this slot. This counter is incremented each time
+a transaction is spilled, and the same transaction may be spilled
+multiple times.
   
  
 
@@ -2639,11 +2640,12 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 stream_txns bigint


-Number of in-progress transactions streamed to the decoding output plugin
-after the memory used by logical decoding of changes from WAL for this
-slot exceeds logical_decoding_work_mem. Streaming only
+Number of in-progress transactions streamed to the decoding output
+plugin after the memory used by logical decoding to decode changes
+from WAL for this slot has exceeded
+logical_decoding_work_mem. Streaming only
 works with toplevel transactions (subtransactions can't be streamed
-independently), so the counter does not get incremented for subtransactions.
+independently), so the counter is not incremented for subtransactions.

  
 
@@ -2653,9 +2655,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i


 Number of times in-progress transactions were streamed to the decoding
-output plugin while decoding changes from WAL for this slot. Transactions
-may get streamed repeatedly, and this counter gets incremented on every
-such invocation.
+output plugin while decoding changes from WAL for this slot. This
+counter is incremented each time a transaction is streamed, and the
+same transaction may be streamed multiple times.
   
  
 
@@ -5042,7 +5044,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
 
   
-
+   
 
   pg_stat_reset_replication_slot
 
@@ -5050,11 +5052,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 void


- Resets statistics to zero for a single replication slot, or for all
- replication slots in the cluster.  The argument can be either the name
- of the slot to reset the stats or NULL.  If the argument is NULL, all
- counters shown in the pg_stat_replication_slots
- view for all replication slots are reset.
+Resets statistics of the replication slot defined by the argument. If
+the argument is NULL, resets statistics for all
+the replication slots.


  This function is restricted to superusers by default, but other users


signature.asc
Description: PGP signature

Re: Columns correlation and adaptive query optimization

2021-01-26 Thread Yugo NAGATA

On Mon, 25 Jan 2021 16:27:25 +0300
Konstantin Knizhnik  wrote:

> Hello,
> 
> Thank you for review.
> My answers are inside.

Thank you for updating the patch and answering my questions.

> > (2)
> > If I understand correctly, your proposal consists of the following two 
> > features.
> >
> > 1. Add a feature to auto_explain that creates an extended statistic 
> > automatically
> > if an error on estimated rows number is large.
> >
> > 2. Improve rows number estimation of join results by considering functional
> > dependencies between vars in the join qual if the qual has more than one 
> > clauses,
> > and also functional dependencies between a var in the join qual and vars in 
> > quals
> > of the inner/outer relation.
> >
> > As you said, these two parts are independent each other, so one feature 
> > will work
> > even if we don't assume the other.  I wonder it would be better to split 
> > the patch
> > again, and register them to commitfest separately.
> 
> I agree with you that this are two almost unrelated changes, although 
> without clausesel patch additional statistic can not improve query planning.

I think extended statistics created by the auto_explain patch can improve query 
planning even without the clausesel patch. For example, suppose the following 
case:

postgres=# create table t ( i int, j int);
CREATE TABLE
postgres=# insert into t select i/10, i/100 from  generate_series(1,100) i;
INSERT 0 100
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
   QUERY PLAN   
 
-
 Seq Scan on t  (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 
rows=10 loops=1)
   Filter: ((i = 100) AND (j = 10))
   Rows Removed by Filter: 90
 Planning Time: 0.199 ms
 Execution Time: 97.327 ms
(5 rows)

After applying the auto_explain patch (without clausesel patch) and issuing the 
query,
additional statistics were created.

postgres=# select * from t where i = 100 and j = 10;
LOG:  Add statistics t_i_j

Then, after analyze, the row estimation was improved.

postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN  
  
--
 Seq Scan on t  (cost=0.00..19425.00 rows=10 width=8) (actual 
time=0.255..95.347 rows=10 loops=1)
   Filter: ((i = 100) AND (j = 10))
   Rows Removed by Filter: 90
 Planning Time: 0.124 ms
 Execution Time: 95.383 ms
(5 rows)

So, I think the auto_explain patch is useful with just that as a tool
to detect a gap between estimate and real and adjust the plan. Also,
the clausesel patch would be useful without the auto_explain patch
if an appropriate statistics are created.

> But I already have too many patches at commitfest.
> May be it will be enough to spit this patch into two?

Although we can discuss both of these patches in this thread, 
I wonder we don't have to commit them together.

> >
> > (3)
> > +   DefineCustomBoolVariable("auto_explain.suggest_only",
> > +"Do not create 
> > statistic but just record in WAL suggested create statistics statement.",
> > +NULL,
> > +
> > _explain_suggest_on
> >
> > To imply that this parameter is involving to add_statistics_threshold, it 
> > seems
> > better for me to use more related name like add_statistics_suggest_only.
> >
> > Also, additional documentations for new parameters are required.
> 
> Done.

+
+   
+
+ auto_explain.auto_explain.add_statistics_threshold 
(real)
+ 
+  auto_explain.add_statistics_threshold 
configuration parameter
+ 
+
+
+ 
+   auto_explain.add_statistics_threshold sets the 
threshold for
+   actual/estimated #rows ratio triggering creation of multicolumn 
statistic
+   for the related columns. It can be used for adpative query optimization.
+   If there is large gap between real and estimated number of tuples for 
the
+   concrete plan node, then multicolumn statistic is created for involved
+   attributes. Zero value (default) disables implicit creation of 
multicolumn statistic.
+ 
+

I wonder we need to say that this parameter has no effect unless log_analyze
is enabled and that statistics are created only when the excution time exceeds 
log_min_duration, if these behaviors are intentional.

In addition, additional statistics are created only if #rows is over-estimated
and not if it is under-estimated. Although it seems good as a criterion for 
creating

Re: [PATCH] remove pg_standby

2021-01-26 Thread Thomas Munro

On Wed, Jan 27, 2021 at 6:06 PM Michael Paquier  wrote:
> On Wed, Jan 27, 2021 at 04:13:24PM +1300, Thomas Munro wrote:
> > I would like to commit this, because "waiting restore commands" have
> > confusing interactions with my proposed prefetching-during-recovery
> > patch[1].  Here's a version that fixes an error when building the docs
> > (there was a stray remaining ), and adds a
> > commit message.  Any objections?
>
> It looks like you are missing two references in your patch set:
> $ git grep pg_standby
> doc/src/sgml/high-availability.sgml: Do not use pg_standby or
> similar tools with the built-in standby mode
> src/backend/access/transam/xlog.c:   * segment. Only recycle normal
> files, pg_standby for example can create

Thanks, fixed.

> The logic assumed in RemoveXlogFile() is actually a bit scary.  I have
> not checked in details but it could be possible to clean up more code
> in this area?

I think the check that it's a regular file is a good idea anyway, but
I removed the offending comment.

> > Furthermore, I think we should also remove the section of the manual
> > that describes how to write your own "waiting restore command".
> > Thoughts?
>
> Agreed.  No objections to that.

Thanks!
From 77015331a71e67a4dee5352afc9f5faa05524755 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 26 Oct 2020 16:37:46 -0500
Subject: [PATCH v4 1/2] Retire pg_standby.

Streaming replication made pg_standby obsolete.  It has been proposed
that we retire it many times.  Now seems like a good time to do it,
because "waiting restore commands" are incompatible with a proposed
recovery prefetching feature.

Discussion: https://postgr.es/m/20201029024412.GP5380%40telsasoft.com
Author: Justin Pryzby 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Peter Eisentraut 
Reviewed-by: Michael Paquier 
---
 contrib/Makefile   |   1 -
 contrib/pg_standby/.gitignore  |   1 -
 contrib/pg_standby/Makefile|  20 -
 contrib/pg_standby/pg_standby.c| 907 -
 doc/src/sgml/contrib.sgml  |   7 +-
 doc/src/sgml/filelist.sgml |   1 -
 doc/src/sgml/high-availability.sgml|  17 +-
 doc/src/sgml/pgstandby.sgml| 394 ---
 doc/src/sgml/ref/pgarchivecleanup.sgml |   7 -
 src/backend/access/transam/xlog.c  |   2 +-
 src/tools/msvc/Mkvcbuild.pm|   4 +-
 11 files changed, 9 insertions(+), 1352 deletions(-)
 delete mode 100644 contrib/pg_standby/.gitignore
 delete mode 100644 contrib/pg_standby/Makefile
 delete mode 100644 contrib/pg_standby/pg_standby.c
 delete mode 100644 doc/src/sgml/pgstandby.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 7a4866e338..cdc041c7db 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -33,7 +33,6 @@ SUBDIRS = \
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
-		pg_standby	\
 		pg_stat_statements \
 		pg_surgery	\
 		pg_trgm		\
diff --git a/contrib/pg_standby/.gitignore b/contrib/pg_standby/.gitignore
deleted file mode 100644
index a401b085a8..00
--- a/contrib/pg_standby/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-/pg_standby
diff --git a/contrib/pg_standby/Makefile b/contrib/pg_standby/Makefile
deleted file mode 100644
index 87732bedf1..00
--- a/contrib/pg_standby/Makefile
+++ /dev/null
@@ -1,20 +0,0 @@
-# contrib/pg_standby/Makefile
-
-PGFILEDESC = "pg_standby - supports creation of a warm standby"
-PGAPPICON = win32
-
-PROGRAM = pg_standby
-OBJS = \
-	$(WIN32RES) \
-	pg_standby.o
-
-ifdef USE_PGXS
-PG_CONFIG = pg_config
-PGXS := $(shell $(PG_CONFIG) --pgxs)
-include $(PGXS)
-else
-subdir = contrib/pg_standby
-top_builddir = ../..
-include $(top_builddir)/src/Makefile.global
-include $(top_srcdir)/contrib/contrib-global.mk
-endif
diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
deleted file mode 100644
index c9f33e4254..00
--- a/contrib/pg_standby/pg_standby.c
+++ /dev/null
@@ -1,907 +0,0 @@
-/*
- * contrib/pg_standby/pg_standby.c
- *
- *
- * pg_standby.c
- *
- * Production-ready example of how to create a Warm Standby
- * database server using continuous archiving as a
- * replication mechanism
- *
- * We separate the parameters for archive and nextWALfile
- * so that we can check the archive exists, even if the
- * WAL file doesn't (yet).
- *
- * This program will be executed once in full for each file
- * requested by the warm standby server.
- *
- * It is designed to cater to a variety of needs, as well
- * providing a customizable section.
- *
- * Original author:		Simon Riggs  si...@2ndquadrant.com
- * Current maintainer:	Simon Riggs
- */
-#include "postgres_fe.h"
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "access/xlog_internal.h"
-#include "pg_getopt.h"
-
-const char *progname;
-
-int			WalSegSz = -1;
-
-/* Options and defaults */
-int			sleeptime = 5;		/* amount of time to sleep between file checks */
-int			waittime = -1;		/* how long we have been

RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2021-01-26 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> "CREATE TABLE" is not "CREATE LOGGED TABLE". We can assume that as
> "CREATE  TABLE", where the default logged-ness
> varies according to context. Or it might have been so since the beginning.
> Currently we don't have the syntax "CREATE LOGGED TABLE", but we can add
> that syntax.

Yes, I thought of that possibility after a while too.  But I felt a bit(?) 
hesitant to do it considering back-patching.  Also, the idea requires ALTER 
TABLE ATTACH PARTITION will have to modify the logged-ness property of the 
target partition and its subpartitions with that of the parent partitioned 
table.  However, your idea is one candidate worth pursuing, including whether 
or not to back-patch what.


>  We pursue relasing all fixes at once but we might release all fixes  other 
> than
> some items that cannot be fixed for some technical reasons  at the time, like
> REPLICA IDENITTY.
> 
> I'm not sure how long we will wait for the time of release, though.

Anyway, we have to define the ideal behavior for each ALTER action based on 
some comprehensible principle.  Yeah... this may become a long, tiring journey. 
 (I anticipate some difficulty and compromise in reaching agreement, as was 
seen in the past discussion for the fix for ALTER TABLE REPLICA IDENTITY.  
Scary)

FWIW, I wonder why Postgres decided to allow different logical structure of 
tables such as DEFAULT values and constraints between the parent partitioned 
table and a child partition.  That extra flexibility might stand in the way to 
consensus.  I think it'd be easy to understand that partitions are simply 
physically independent containers that have identical logical structure.


Regards
Takayuki Tsunakawa

Re: [PATCH] remove pg_standby

2021-01-26 Thread Michael Paquier

On Wed, Jan 27, 2021 at 04:13:24PM +1300, Thomas Munro wrote:
> I would like to commit this, because "waiting restore commands" have
> confusing interactions with my proposed prefetching-during-recovery
> patch[1].  Here's a version that fixes an error when building the docs
> (there was a stray remaining ), and adds a
> commit message.  Any objections?

It looks like you are missing two references in your patch set:
$ git grep pg_standby
doc/src/sgml/high-availability.sgml: Do not use pg_standby or
similar tools with the built-in standby mode
src/backend/access/transam/xlog.c:   * segment. Only recycle normal
files, pg_standby for example can create

The logic assumed in RemoveXlogFile() is actually a bit scary.  I have
not checked in details but it could be possible to clean up more code
in this area?

> Furthermore, I think we should also remove the section of the manual
> that describes how to write your own "waiting restore command".
> Thoughts?

Agreed.  No objections to that.
--
Michael

signature.asc
Description: PGP signature

Re: WIP: WAL prefetch (another approach)

2021-01-26 Thread Thomas Munro

On Sat, Dec 5, 2020 at 7:27 AM Stephen Frost  wrote:
> * Thomas Munro (thomas.mu...@gmail.com) wrote:
> > I just noticed this thread proposing to retire pg_standby on that
> > basis:
> >
> > https://www.postgresql.org/message-id/flat/20201029024412.GP5380%40telsasoft.com
> >
> > I'd be happy to see that land, to fix this problem with my plan.  But
> > are there other people writing restore scripts that block that would
> > expect them to work on PG14?
>
> Ok, I think I finally get the concern that you're raising here-
> basically that if a restore command was written to sit around and wait
> for WAL segments to arrive, instead of just returning to PG and saying
> "WAL segment not found", that this would be a problem if we are running
> out ahead of the applying process and asking for WAL.
>
> The thing is- that's an outright broken restore command script in the
> first place.  If PG is in standby mode, we'll ask again if we get an
> error result indicating that the WAL file wasn't found.  The restore
> command documentation is quite clear on this point:
>
> The command will be asked for file names that are not present in the
> archive; it must return nonzero when so asked.
>
> There's no "it can wait around for the next file to show up if it wants
> to" in there- it *must* return nonzero when asked for files that don't
> exist.

Well the manual does actually describe how to write your own version
of pg_standby, referred to as a "waiting restore script":

https://www.postgresql.org/docs/13/log-shipping-alternative.html

I've now poked that other thread threatening to commit the removal of
pg_standby, and while I was there, also to remove the section on how
to write your own (it's possible that I missed some other reference to
the concept elsewhere, I'll need to take another look).

> So, I don't think that we really need to stress over this.  The fact
> that pg_standby offers options to have it wait instead of just returning
> a non-zero error-code and letting the loop that we already do in the
> core code seems like it's really just a legacy thing from before we were
> doing that and probably should have been ripped out long ago...  Even
> more reason to get rid of pg_standby tho, imv, we haven't been properly
> adjusting it when we've been making changes to the core code, it seems.

So far I haven't heard from anyone who thinks we should keep this old
facility (as useful as it was back then when it was the only way), so
I hope we can now quietly drop it.  It's not strictly an obstacle to
this recovery prefetching work, but it'd interact confusingly in hard
to describe ways, and it seems strange to perpetuate something that
many were already proposing to drop due to obsolescence.  Thanks for
the comments/sanity check.

Re: logical replication worker accesses catalogs in error context callback

2021-01-26 Thread Bharath Rupireddy

On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu  wrote:
> For 0002, a few comments on the description:
>
> bq. Avoid accessing system catalogues inside conversion_error_callback
>
> catalogues -> catalog
>
> bq. error context callback, because the the entire transaction might
>
> There is redundant 'the'
>
> bq. Since get_attname() and get_rel_name() could search the sys cache by
> store the required information
>
> store -> storing

Thanks for pointing to the changes in the commit message. I corrected
them. Attaching v4 patch set, consider it for further review.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 88767e3176cb6758ac3dd0a0da91f3747f84a4ed Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 26 Jan 2021 13:49:18 +0530
Subject: [PATCH v4] Avoid Catalogue Accesses In slot_store_error_callback

Avoid accessing system catalogues inside slot_store_error_callback
error context callback, because the the entire transaction might
have been broken at this point.

Since logicalrep_typmap_gettypname() could search the sys cache by
calling to format_type_be(), store both local and remote type names
to SlotErrCallbackArg so that we can just set the names in the error
callback without sys cache lookup.

Author: Masahiko Sawada
---
 src/backend/replication/logical/worker.c | 48 +---
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index eb7db89cef..f7c4fe9c4d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -132,8 +132,9 @@ static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
 typedef struct SlotErrCallbackArg
 {
 	LogicalRepRelMapEntry *rel;
-	int			local_attnum;
 	int			remote_attnum;
+	char		*local_typename;
+	char		*remote_typename;
 } SlotErrCallbackArg;
 
 /*
@@ -430,30 +431,19 @@ static void
 slot_store_error_callback(void *arg)
 {
 	SlotErrCallbackArg *errarg = (SlotErrCallbackArg *) arg;
-	LogicalRepRelMapEntry *rel;
-	char	   *remotetypname;
-	Oid			remotetypoid,
-localtypoid;
+	LogicalRepRelMapEntry *rel = errarg->rel;
 
 	/* Nothing to do if remote attribute number is not set */
 	if (errarg->remote_attnum < 0)
 		return;
 
-	rel = errarg->rel;
-	remotetypoid = rel->remoterel.atttyps[errarg->remote_attnum];
-
-	/* Fetch remote type name from the LogicalRepTypMap cache */
-	remotetypname = logicalrep_typmap_gettypname(remotetypoid);
-
-	/* Fetch local type OID from the local sys cache */
-	localtypoid = get_atttype(rel->localreloid, errarg->local_attnum + 1);
+	Assert(errarg->local_typename && errarg->remote_typename);
 
 	errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", "
 			   "remote type %s, local type %s",
 			   rel->remoterel.nspname, rel->remoterel.relname,
 			   rel->remoterel.attnames[errarg->remote_attnum],
-			   remotetypname,
-			   format_type_be(localtypoid));
+			   errarg->remote_typename, errarg->local_typename);
 }
 
 /*
@@ -474,8 +464,9 @@ slot_store_data(TupleTableSlot *slot, LogicalRepRelMapEntry *rel,
 
 	/* Push callback + info on the error context stack */
 	errarg.rel = rel;
-	errarg.local_attnum = -1;
 	errarg.remote_attnum = -1;
+	errarg.local_typename = NULL;
+	errarg.remote_typename = NULL;
 	errcallback.callback = slot_store_error_callback;
 	errcallback.arg = (void *) 
 	errcallback.previous = error_context_stack;
@@ -491,11 +482,17 @@ slot_store_data(TupleTableSlot *slot, LogicalRepRelMapEntry *rel,
 		if (!att->attisdropped && remoteattnum >= 0)
 		{
 			StringInfo	colvalue = >colvalues[remoteattnum];
+			Oid remotetypid = rel->remoterel.atttyps[remoteattnum];
 
 			Assert(remoteattnum < tupleData->ncols);
 
-			errarg.local_attnum = i;
+			/*
+			 * Set the attribute number and both local and remote type names
+			 * from the local sys cache and LogicalRepTypMap respectively.
+			 */
 			errarg.remote_attnum = remoteattnum;
+			errarg.local_typename = format_type_be(att->atttypid);
+			errarg.remote_typename = logicalrep_typmap_gettypname(remotetypid);
 
 			if (tupleData->colstatus[remoteattnum] == LOGICALREP_COLUMN_TEXT)
 			{
@@ -543,8 +540,9 @@ slot_store_data(TupleTableSlot *slot, LogicalRepRelMapEntry *rel,
 slot->tts_isnull[i] = true;
 			}
 
-			errarg.local_attnum = -1;
 			errarg.remote_attnum = -1;
+			errarg.local_typename = NULL;
+			errarg.remote_typename = NULL;
 		}
 		else
 		{
@@ -600,8 +598,9 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot *srcslot,
 
 	/* For error reporting, push callback + info on the error context stack */
 	errarg.rel = rel;
-	errarg.local_attnum = -1;
 	errarg.remote_attnum = -1;
+	errarg.local_typename = NULL;
+	errarg.remote_typename = NULL;
 	errcallback.callback = slot_store_error_callback;
 	errcallback.arg = (void *) 
 	errcallback.previous = error_context_stack;
@@ -622,9 +621,15 @@ slot_modify_data(TupleTableSlot *slot, TupleTableSlot

Re: Single transaction in the tablesync worker?

2021-01-26 Thread Amit Kapila

On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila  wrote:
>
> On Sat, Jan 23, 2021 at 4:55 AM Peter Smith  wrote:
> >
> > PSA the v18 patch for the Tablesync Solution1.
>
> 7. Have you tested with the new patch the scenario where we crash
> after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> replication using the new temporary slot? Here, we need to test the
> case where during the catchup phase we have received few commits and
> then the tablesync worker is crashed/errored out? Basically, check if
> the replication is continued from the same point?
>

I have tested this and it didn't work, see the below example.

Publisher-side

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl1;

Subscriber-side

- Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
worker stops.

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

CREATE SUBSCRIPTION mysub
 CONNECTION 'host=localhost port=5432 dbname=postgres'
PUBLICATION mypublication;

During debug, stop after we mark FINISHEDCOPY state.

Publisher-side

INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT INTO mytbl1(somedata, text) VALUES (1, 4);

Subscriber-side

- Have a breakpoint in apply_dispatch
- continue in debugger;
- After we replay first commit (which will be for values(1,3), note
down the origin position in apply_handle_commit_internal and somehow
error out. I have forced the debugger to set to the last line in
apply_dispatch where the error is raised.
- After the error, again the tablesync worker is restarted and it
starts from the position noted in the previous step
- It exits without replaying the WAL for (1,4)

So, on the subscriber-side, you will see 3 records. Fourth is missing.
Now, if you insert more records on the publisher, it will anyway
replay those but the fourth one got missing.

The temporary slots didn't seem to work because we created again the
new temporary slot after the crash and ask it to start decoding from
the point we noted in origin_lsn. The publisher didn’t hold the
required WAL as our slot was temporary so it started sending from some
later point. We retain WAL based on the slots restart_lsn position and
wal_keep_size. For our case, the positions of the slots will matter
and as we have created temporary slots, there is no way for a
publisher to save that WAL.

In this particular case, even if the WAL would have been there we only
pass the start_decoding_at position but didn’t pass restart_lsn, so it
picked a random location (current insert position in WAL) which is
ahead of start_decoding_at point so it never sent the required fourth
record. Now, I don’t think it will work even if somehow sent the
correct restart_lsn because of what I wrote earlier that there is no
guarantee that the earlier WAL would have been saved.

At this point, I can't think of any way to fix this problem except for
going back to the previous approach of permanent slots but let me know
if you have any ideas to salvage this approach?

-- 
With Regards,
Amit Kapila.

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-26 Thread Michael Paquier

On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote:
> CheckRelationTableSpaceMove() does not feel like a right place for invoking
> post alter hooks. It is intended only to check for tablespace change
> possibility. Anyway, ATExecSetTableSpace() and
> ATExecSetTableSpaceNoStorage() already do that in the no-op case.
>
> I have removed this InvokeObjectPostAlterHook() from your 0001 and made 0002
> to work on top of it. I think that now it should look closer to what you
> described above.

Yeah, I was a bit hesitating about this part as those new routines
would not be used by ALTER-related commands in the next steps.  Your
patch got that midway in-between though, by adding the hook to
SetRelationTableSpace but not to CheckRelationTableSpaceMove().  For
now, I have kept the hook out of those new routines because using an
ALTER hook for a utility command is inconsistent.  Perhaps we'd want a
separate hook type dedicated to utility commands in objectaccess.c.

I have double-checked the code, and applied it after a few tweaks.

> In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(),
> and removed expensive text generation in test. Not touched yet some of your
> previously raised concerns. Also, you made SetRelationTableSpace() to accept
> Relation instead of Oid, so now we have to open/close indexes in the
> ReindexPartitions(), I am not sure that I use proper locking there, but it
> works.

Passing down Relation to the new routines makes the most sense to me
because we force the callers to think about the level of locking
that's required when doing any tablespace moves.

+   Relation iRel = index_open(partoid, ShareLock);
+
+   if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid))
+   SetRelationTableSpace(iRel,
+ params->tablespaceOid,
+ InvalidOid);
Speaking of which, this breaks the locking assumptions of
SetRelationTableSpace().  I feel that we should think harder about
this part for partitioned indexes and tables because this looks rather
unsafe in terms of locking assumptions with partition trees.  If we
cannot come up with a safe solution, I would be fine with disallowing
TABLESPACE in this case, as a first step.  Not all problems have to be
solved at once, and even without this part the feature is still
useful.

+   /* It's not a shared catalog, so refuse to move it to shared tablespace */
+   if (params->tablespaceOid == GLOBALTABLESPACE_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot move non-shared relation to tablespace \"%s\"",
+get_tablespace_name(params->tablespaceOid;
Why is that needed if CheckRelationTableSpaceMove() is used?

bits32  options;/* bitmask of REINDEXOPT_* */
+   Oid  tablespaceOid; /* tablespace to rebuild index */
} ReindexParams;
Mentioned upthread, but here I think that we should tell that
InvalidOid => keep the existing tablespace.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] remove pg_standby

2021-01-26 Thread Thomas Munro

On Wed, Nov 25, 2020 at 10:04 PM Peter Eisentraut
 wrote:
> On 2020-11-21 20:41, Justin Pryzby wrote:
> > Oops.  I guess I'd write something like this.  If we just remove it, then
> > there'd no place to add a new server application, and "client applications"
> > would be the only subsection.
>
> I have committed the typo fix.  I don't have a well-formed opinion yet
> about whether all the reservations about removing pg_standby have been
> addressed.

I would like to commit this, because "waiting restore commands" have
confusing interactions with my proposed prefetching-during-recovery
patch[1].  Here's a version that fixes an error when building the docs
(there was a stray remaining ), and adds a
commit message.  Any objections?

Furthermore, I think we should also remove the section of the manual
that describes how to write your own "waiting restore command".
Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKFeYPL9K%2BSRixcsx1%2B6HsHhqK%2BPOZyrnnZjw1jERpGcQ%40mail.gmail.com
From 54b12290e304bf1d4c995836addb7a53e8e7b74b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 26 Oct 2020 16:37:46 -0500
Subject: [PATCH v3] Retire pg_standby.

Streaming replication made pg_standby obsolete.  It has been proposed
that we retire it many times.  Now seems like a good time to do it,
because "waiting restore commands" are incompatible with a proposed WAL
prefetching feature.

Discussion: https://postgr.es/m/20201029024412.GP5380%40telsasoft.com
Author: Justin Pryzby 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Peter Eisentraut 
---
 contrib/Makefile   |   1 -
 contrib/pg_standby/.gitignore  |   1 -
 contrib/pg_standby/Makefile|  20 -
 contrib/pg_standby/pg_standby.c| 907 -
 doc/src/sgml/contrib.sgml  |   7 +-
 doc/src/sgml/filelist.sgml |   1 -
 doc/src/sgml/high-availability.sgml|  14 +-
 doc/src/sgml/pgstandby.sgml| 394 ---
 doc/src/sgml/ref/pgarchivecleanup.sgml |   7 -
 src/tools/msvc/Mkvcbuild.pm|   4 +-
 10 files changed, 7 insertions(+), 1349 deletions(-)
 delete mode 100644 contrib/pg_standby/.gitignore
 delete mode 100644 contrib/pg_standby/Makefile
 delete mode 100644 contrib/pg_standby/pg_standby.c
 delete mode 100644 doc/src/sgml/pgstandby.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 7a4866e338..cdc041c7db 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -33,7 +33,6 @@ SUBDIRS = \
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
-		pg_standby	\
 		pg_stat_statements \
 		pg_surgery	\
 		pg_trgm		\
diff --git a/contrib/pg_standby/.gitignore b/contrib/pg_standby/.gitignore
deleted file mode 100644
index a401b085a8..00
--- a/contrib/pg_standby/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-/pg_standby
diff --git a/contrib/pg_standby/Makefile b/contrib/pg_standby/Makefile
deleted file mode 100644
index 87732bedf1..00
--- a/contrib/pg_standby/Makefile
+++ /dev/null
@@ -1,20 +0,0 @@
-# contrib/pg_standby/Makefile
-
-PGFILEDESC = "pg_standby - supports creation of a warm standby"
-PGAPPICON = win32
-
-PROGRAM = pg_standby
-OBJS = \
-	$(WIN32RES) \
-	pg_standby.o
-
-ifdef USE_PGXS
-PG_CONFIG = pg_config
-PGXS := $(shell $(PG_CONFIG) --pgxs)
-include $(PGXS)
-else
-subdir = contrib/pg_standby
-top_builddir = ../..
-include $(top_builddir)/src/Makefile.global
-include $(top_srcdir)/contrib/contrib-global.mk
-endif
diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
deleted file mode 100644
index c9f33e4254..00
--- a/contrib/pg_standby/pg_standby.c
+++ /dev/null
@@ -1,907 +0,0 @@
-/*
- * contrib/pg_standby/pg_standby.c
- *
- *
- * pg_standby.c
- *
- * Production-ready example of how to create a Warm Standby
- * database server using continuous archiving as a
- * replication mechanism
- *
- * We separate the parameters for archive and nextWALfile
- * so that we can check the archive exists, even if the
- * WAL file doesn't (yet).
- *
- * This program will be executed once in full for each file
- * requested by the warm standby server.
- *
- * It is designed to cater to a variety of needs, as well
- * providing a customizable section.
- *
- * Original author:		Simon Riggs  si...@2ndquadrant.com
- * Current maintainer:	Simon Riggs
- */
-#include "postgres_fe.h"
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "access/xlog_internal.h"
-#include "pg_getopt.h"
-
-const char *progname;
-
-int			WalSegSz = -1;
-
-/* Options and defaults */
-int			sleeptime = 5;		/* amount of time to sleep between file checks */
-int			waittime = -1;		/* how long we have been waiting, -1 no wait
- * yet */
-int			maxwaittime = 0;	/* how long are we prepared to wait for? */
-int			keepfiles = 0;		/* number of WAL files to keep, 0 keep all */
-int			maxretries = 3;		/* number of retries on restore command */
-bool		debug = false;		/* are we debugging? */
-bool		need_cleanup =

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-26 Thread Hou, Zhijie

Hi,

When testing the patch with the following kind of sql.

---
Insert into part_table select 1;
Insert into part_table select generate_series(1,1,1);
Insert into part_table select * from testfunc();
---

we usually use these sqls to initialize the table or for testing purpose.

Personally I think we do not need to do the parallel safety-check for these 
cases,
because there seems no chance for the select part to consider parallel.

I thought we aim to not check the safety unless parallel is possible.
, So I was thinking is it possible to avoid the check it these cases ?

I did some quick check on the code, An Immature ideal is to check if there is 
RTE_RELATION in query.
If no we do not check the safety-check.

I am not sure is it worth to do that, any thoughts ?

Best regards,
Houzj

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-26 Thread Joel Jacobson

On Tue, Jan 26, 2021, at 12:59, Mark Rofail wrote:
> Please don't hesitate to give your feedback.

The error message for insert or update violations looks fine:

UPDATE catalog_clone.pg_extension SET extconfig = ARRAY[12345] WHERE oid = 10;
ERROR:  insert or update on table "pg_extension" violates foreign key 
constraint "pg_extension_extconfig_fkey"
DETAIL:  Key (EACH ELEMENT OF extconfig)=(12345) is not present in table 
"pg_class".

But when trying to delete a still referenced row,
the column mentioned in the "EACH ELEMENT OF" sentence
is not the array column, but the column in the referenced table:

DELETE FROM catalog_clone.pg_class WHERE oid = 10;
ERROR:  update or delete on table "pg_class" violates foreign key constraint 
"pg_extension_extconfig_fkey" on table "pg_extension"
DETAIL:  Key (EACH ELEMENT OF oid)=(10) is still referenced from table 
"pg_extension".

I think either the "EACH ELEMENT OF" part of the latter error message should be 
dropped,
or the column name for the array column should be used.

/Joel

Re: Add SQL function for SHA1

2021-01-26 Thread Sehrope Sarkuni

On Tue, Jan 26, 2021 at 8:53 PM Michael Paquier  wrote:

> On Tue, Jan 26, 2021 at 10:38:43AM +0100, Daniel Gustafsson wrote:
> > Agreed, and pgcrypto already allows for using sha1.
> >
> > It seems like any legitimate need for sha1 could be better served by an
> > extension rather than supplying it in-core.
>
> Both of you telling the same thing is enough for me to discard this
> new stuff.  I'd like to refactor the code anyway as that's a nice
> cleanup, and this would have the advantage to make people look at
> cryptohashfuncs.c if introducing a new type.  After sleeping about it,
> I think that I would just make MD5 and SHA1 issue an elog(ERROR) if
> the internal routine is taken in those cases, like in the attached.
>

The refactor patch looks good. It builds and passes make check.

Thanks for the enum explanation too.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: archive status ".ready" files may be created too early

2021-01-26 Thread Kyotaro Horiguchi

At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan"  
wrote in 
> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan"  
> > wrote in
> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
> >> > You're right in that regard. There's a window where partial record is
> >> > written when write location passes F0 after insertion location passes
> >> > F1. However, remembering all spanning records seems overkilling to me.
> >>
> >> I'm curious why you feel that recording all cross-segment records is
> >> overkill.  IMO it seems far simpler to just do that rather than try to
> >
> > Sorry, my words are not enough. Remembering all spanning records in
> > *shared memory* seems to be overkilling.  Much more if it is stored in
> > shared hash table.  Even though it rarely the case, it can fail hard
> > way when reaching the limit. If we could do well by remembering just
> > two locations, we wouldn't need to worry about such a limitation.
> 
> I don't think it will fail if we reach max_size for the hash table.
> The comment above ShmemInitHash() has this note:
> 
> * max_size is the estimated maximum number of hashtable entries.  This is
> * not a hard limit, but the access efficiency will degrade if it is
> * exceeded substantially (since it's used to compute directory size and
> * the hash table buckets will get overfull).

That description means that a shared hash has a directory with fixed
size thus there may be synonyms, which causes degradation. Even though
buckets are preallocated with the specified number, since the minimum
directory size is 256, buckets are allocated at least 256 in a long
run. Minimum on-the-fly allocation size is 32. I haven't calcuated
further precicely, but I'm worried about the amount of spare shared
memory the hash can allocate.

> > Another concern about the concrete patch:
> >
> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> > LWLock every time XLogWrite is called while segment archive is being
> > held off.  I don't think it is acceptable and I think it could be a
> > problem when many backends are competing on WAL.
> 
> This is a fair point.  I did some benchmarking with a few hundred
> connections all doing writes, and I was not able to discern any
> noticeable performance impact.  My guess is that contention on this
> new lock is unlikely because callers of XLogWrite() must already hold
> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
> more than once per segment to record new record boundaries.

Thanks. I agree that the reader-reader contention is not a problem due
to existing serialization by WALWriteLock. Adding an entry happens
only at segment boundary so the ArchNotifyLock doesn't seem to be a
problem.

However the function prolongs the WALWriteLock section. Couldn't we
somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
out of the WALWriteLock section?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: logical replication worker accesses catalogs in error context callback

2021-01-26 Thread Zhihong Yu

For 0002, a few comments on the description:

bq. Avoid accessing system catalogues inside conversion_error_callback

catalogues -> catalog

bq. error context callback, because the the entire transaction might

There is redundant 'the'

bq. Since get_attname() and get_rel_name() could search the sys cache by
store the required information

store -> storing

Cheers

On Tue, Jan 26, 2021 at 5:17 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada 
> wrote:
> > > IMHO, adding an assertion in SearchCatCacheInternal(which is a most
> > > generic code part within the server) with a few error context global
> > > variables may not be always safe. Because we might miss using the
> > > error context variables properly. Instead, we could add a comment in
> > > ErrorContextCallback structure saying something like, "it's not
> > > recommended to access any system catalogues within an error context
> > > callback when the callback is expected to be called while processing
> > > an error, because the transaction might have been broken in that
> > > case." And let the future callback developers take care of it.
> > >
> > > Thoughts?
> >
> > That sounds good to me. But I still also see the value to add an
> > assertion in SearchCatCacheInternal(). If we had it, we could find
> > these two bugs earlier.
> >
> > Anyway, this seems to be unrelated to this bug fixing so we can start
> > a new thread for that.
>
> +1 to start a new thread for that.
>
> > > As I said earlier [1], currently only two(there could be more) error
> > > context callbacks access the sys catalogues, one is in
> > > slot_store_error_callback which will be fixed with your patch. Another
> > > is in conversion_error_callback, we can also fix this by storing the
> > > relname, attname and other required info in ConversionLocation,
> > > something like the attached. Please have a look.
> > >
> > > Thoughts?
> >
> > Thank you for the patch!
> >
> > Here are some comments:
>
> Thanks for the review comments.
>
> > +static void set_error_callback_info(ConversionLocation *errpos,
> > +  Relation rel, int cur_attno,
> > +  ForeignScanState *fsstate);
> >
> > I'm concerned a bit that the function name is generic. How about
> > set_conversion_error_callback_info() or something to make the name
> > clear?
>
> Done.
>
> > ---
> > +static void
> > +conversion_error_callback(void *arg)
> > +{
> > +   ConversionLocation *errpos = (ConversionLocation *) arg;
> > +
> > +   if (errpos->show_generic_message)
> > +   errcontext("processing expression at position %d in
> > select list",
> > +   errpos->cur_attno);
> > +
> >
> > I think we can set this generic message to the error context when
> > errpos->relname is NULL instead of using show_generic_message.
>
> Right. Done.
>
> > ---
> > +   /*
> > +* Set error context callback info, so that we could avoid
> accessing
> > +* the system catalogues while processing the error in
> > +* conversion_error_callback. System catalogue accesses are not
> safe in
> > +* error context callbacks because the transaction might have
> been
> > +* broken by then.
> > +*/
> > +   set_error_callback_info(, rel, i, fsstate);
> >
> > Looking at other code, we use "catalog" rather than "catalogue" in
> > most places. Is it better to use "catalog" for consistency? FYI, the
> > "catalogue" is used at only three places in the server code:
>
> Changed it to "catalog".
>
> > FYI I've added those bug fixes to the next Commitfest -
> https://commitfest.postgresql.org/32/2955/
>
> Thanks. I'm attaching 2 patches to make it easy for reviewing and also
> they will get a chance to be tested by cf bot.
>
> 0001 - for avoiding system catalog access in slot_store_error_callback.
> 0002 - for avoiding system catalog access in conversion_error_callback
>
> Please review it further.
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Add SQL function for SHA1

2021-01-26 Thread Michael Paquier

On Tue, Jan 26, 2021 at 10:38:43AM +0100, Daniel Gustafsson wrote:
> Agreed, and pgcrypto already allows for using sha1.
> 
> It seems like any legitimate need for sha1 could be better served by an
> extension rather than supplying it in-core.

Both of you telling the same thing is enough for me to discard this
new stuff.  I'd like to refactor the code anyway as that's a nice
cleanup, and this would have the advantage to make people look at
cryptohashfuncs.c if introducing a new type.  After sleeping about it,
I think that I would just make MD5 and SHA1 issue an elog(ERROR) if
the internal routine is taken in those cases, like in the attached.

If there are any comments or objections to the refactoring piece,
please let me know.
--
Michael
diff --git a/src/backend/utils/adt/cryptohashfuncs.c b/src/backend/utils/adt/cryptohashfuncs.c
index d99485f4c6..4bb684a08b 100644
--- a/src/backend/utils/adt/cryptohashfuncs.c
+++ b/src/backend/utils/adt/cryptohashfuncs.c
@@ -68,6 +68,60 @@ md5_bytea(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(hexsum));
 }
 
+/*
+ * Internal routine to compute a cryptohash with the given bytea input.
+ */
+static inline bytea *
+cryptohash_internal(pg_cryptohash_type type, bytea *input)
+{
+	const uint8 *data;
+	const char  *typestr = NULL;
+	int			digest_len = 0;
+	size_t		len;
+	pg_cryptohash_ctx *ctx;
+	bytea	   *result;
+
+	switch (type)
+	{
+		case PG_SHA224:
+			typestr = "SHA224";
+			digest_len = PG_SHA224_DIGEST_LENGTH;
+			break;
+		case PG_SHA256:
+			typestr = "SHA256";
+			digest_len = PG_SHA256_DIGEST_LENGTH;
+			break;
+		case PG_SHA384:
+			typestr = "SHA384";
+			digest_len = PG_SHA384_DIGEST_LENGTH;
+			break;
+		case PG_SHA512:
+			typestr = "SHA512";
+			digest_len = PG_SHA512_DIGEST_LENGTH;
+			break;
+		case PG_MD5:
+		case PG_SHA1:
+			elog(ERROR, "unsupported digest type %d", type);
+			break;
+	}
+
+	result = palloc0(digest_len + VARHDRSZ);
+	len = VARSIZE_ANY_EXHDR(input);
+	data = (unsigned char *) VARDATA_ANY(input);
+
+	ctx = pg_cryptohash_create(type);
+	if (pg_cryptohash_init(ctx) < 0)
+		elog(ERROR, "could not initialize %s context", typestr);
+	if (pg_cryptohash_update(ctx, data, len) < 0)
+		elog(ERROR, "could not update %s context", typestr);
+	if (pg_cryptohash_final(ctx, (unsigned char *) VARDATA(result)) < 0)
+		elog(ERROR, "could not finalize %s context", typestr);
+	pg_cryptohash_free(ctx);
+
+	SET_VARSIZE(result, digest_len + VARHDRSZ);
+
+	return result;
+}
 
 /*
  * SHA-2 variants
@@ -76,28 +130,7 @@ md5_bytea(PG_FUNCTION_ARGS)
 Datum
 sha224_bytea(PG_FUNCTION_ARGS)
 {
-	bytea	   *in = PG_GETARG_BYTEA_PP(0);
-	const uint8 *data;
-	size_t		len;
-	pg_cryptohash_ctx *ctx;
-	unsigned char buf[PG_SHA224_DIGEST_LENGTH];
-	bytea	   *result;
-
-	len = VARSIZE_ANY_EXHDR(in);
-	data = (unsigned char *) VARDATA_ANY(in);
-
-	ctx = pg_cryptohash_create(PG_SHA224);
-	if (pg_cryptohash_init(ctx) < 0)
-		elog(ERROR, "could not initialize %s context", "SHA224");
-	if (pg_cryptohash_update(ctx, data, len) < 0)
-		elog(ERROR, "could not update %s context", "SHA224");
-	if (pg_cryptohash_final(ctx, buf) < 0)
-		elog(ERROR, "could not finalize %s context", "SHA224");
-	pg_cryptohash_free(ctx);
-
-	result = palloc(sizeof(buf) + VARHDRSZ);
-	SET_VARSIZE(result, sizeof(buf) + VARHDRSZ);
-	memcpy(VARDATA(result), buf, sizeof(buf));
+	bytea   *result = cryptohash_internal(PG_SHA224, PG_GETARG_BYTEA_PP(0));
 
 	PG_RETURN_BYTEA_P(result);
 }
@@ -105,28 +138,7 @@ sha224_bytea(PG_FUNCTION_ARGS)
 Datum
 sha256_bytea(PG_FUNCTION_ARGS)
 {
-	bytea	   *in = PG_GETARG_BYTEA_PP(0);
-	const uint8 *data;
-	size_t		len;
-	pg_cryptohash_ctx *ctx;
-	unsigned char buf[PG_SHA256_DIGEST_LENGTH];
-	bytea	   *result;
-
-	len = VARSIZE_ANY_EXHDR(in);
-	data = (unsigned char *) VARDATA_ANY(in);
-
-	ctx = pg_cryptohash_create(PG_SHA256);
-	if (pg_cryptohash_init(ctx) < 0)
-		elog(ERROR, "could not initialize %s context", "SHA256");
-	if (pg_cryptohash_update(ctx, data, len) < 0)
-		elog(ERROR, "could not update %s context", "SHA256");
-	if (pg_cryptohash_final(ctx, buf) < 0)
-		elog(ERROR, "could not finalize %s context", "SHA256");
-	pg_cryptohash_free(ctx);
-
-	result = palloc(sizeof(buf) + VARHDRSZ);
-	SET_VARSIZE(result, sizeof(buf) + VARHDRSZ);
-	memcpy(VARDATA(result), buf, sizeof(buf));
+	bytea   *result = cryptohash_internal(PG_SHA256, PG_GETARG_BYTEA_PP(0));
 
 	PG_RETURN_BYTEA_P(result);
 }
@@ -134,28 +146,7 @@ sha256_bytea(PG_FUNCTION_ARGS)
 Datum
 sha384_bytea(PG_FUNCTION_ARGS)
 {
-	bytea	   *in = PG_GETARG_BYTEA_PP(0);
-	const uint8 *data;
-	size_t		len;
-	pg_cryptohash_ctx *ctx;
-	unsigned char buf[PG_SHA384_DIGEST_LENGTH];
-	bytea	   *result;
-
-	len = VARSIZE_ANY_EXHDR(in);
-	data = (unsigned char *) VARDATA_ANY(in);
-
-	ctx = pg_cryptohash_create(PG_SHA384);
-	if (pg_cryptohash_init(ctx) < 0)
-		elog(ERROR, "could not initialize %s context", "SHA384");
-	if (pg_cryptohash_update(ctx, data, len) < 0)
-		elog(ERROR, "could not update %s context", "SHA384");
-	if

Re: wiki:PostgreSQL_14_Open_Items

2021-01-26 Thread Michael Paquier

On Tue, Jan 26, 2021 at 03:55:15PM -0600, Justin Pryzby wrote:
> I've created a new page, and added some unresolved items that I've been 
> keeping
> in my head.
> 
> https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items

Thanks, Justin!
--
Michael


signature.asc
Description: PGP signature

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-26 Thread Fujii Masao





On 2021/01/18 14:54, Masahiko Sawada wrote:

On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu  wrote:


For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :

+   entry->changing_xact_state = true;
...
+   entry->changing_xact_state = abort_cleanup_failure;

I don't see return statement in between the two assignments. I wonder why 
entry->changing_xact_state is set to true, and later being assigned again.


Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

 /*
  * If connection is before starting transaction or is already 
unsalvageable,
  * do only the cleanup and don't touch it further.
  */
 if (entry->changing_xact_state)
 {
 pgfdw_cleanup_after_transaction(entry);
 return;
 }



For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :

bq. This commits introduces to new background processes: foreign

commits introduces to new -> commit introduces two new


Fixed.



+FdwXactExistsXid(TransactionId xid)

Since Xid is the parameter to this method, I think the Xid suffix can be 
dropped from the method name.


But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?



+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group

Please correct year in the next patch set.


Fixed.



+FdwXactLauncherRequestToLaunch(void)

Since the launcher's job is to 'launch', I think the Launcher can be omitted 
from the method name.


Agreed. How about FdwXactRequestToLaunchResolver()?



+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)

Are both Rsover and Rslv referring to resolver ? It would be better to use 
whole word which reduces confusion.
Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or 
FdwXactResolveShmemInit)


Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.



+fdwxact_launch_resolver(Oid dbid)

The above method is not in camel case. It would be better if method names are 
consistent (in casing).


Fixed.



+errmsg("out of foreign transaction resolver slots"),
+errhint("You might need to increase 
max_foreign_transaction_resolvers.")));

It would be nice to include the value of max_foreign_xact_resolvers


I agree it would be nice but looking at other code we don't include
the value in this kind of messages.



For fdwxact_resolver_onexit():

+   LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+   fdwxact->locking_backend = InvalidBackendId;
+   LWLockRelease(FdwXactLock);

There is no call to method inside the for loop which may take time. I wonder if 
the lock can be obtained prior to the for loop and released coming out of the 
for loop.


Agreed.



+FXRslvLoop(void)

Please use Resolver instead of Rslv


Fixed.



+   FdwXactResolveFdwXacts(held_fdwxacts, nheld);

Fdw and Xact are repeated twice each in the method name. Probably the method 
name can be made shorter.


Fixed.


You fixed some issues. But maybe you forgot to attach the latest patches?

I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that 
worth applying independent from 2PC feature. If there are such changes, IMO we 
can apply them in advance, and which would make the patches simpler.

+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction on server 
%s",
+  
frstate->server->servername)));

You changed the code this way because you want to include the server name in 
the error message? I agree that it's helpful to report also the server name 
that caused an error. OTOH, since this change gets rid of call to 
pgfdw_rerport_error() for the returned PGresult, the reported error message 
contains less information. If this understanding is right, I don't think that 
this change is an improvement.

Instead, if the server name should be included in the error message, 
pgfdw_report_error() should be changed so that it also reports the server name? 
If we do that, the server name is reported not only when COMMIT fails but also 
when

Re: logical replication worker accesses catalogs in error context callback

2021-01-26 Thread Bharath Rupireddy

On Tue, Jan 26, 2021 at 11:29 AM Masahiko Sawada  wrote:
> > IMHO, adding an assertion in SearchCatCacheInternal(which is a most
> > generic code part within the server) with a few error context global
> > variables may not be always safe. Because we might miss using the
> > error context variables properly. Instead, we could add a comment in
> > ErrorContextCallback structure saying something like, "it's not
> > recommended to access any system catalogues within an error context
> > callback when the callback is expected to be called while processing
> > an error, because the transaction might have been broken in that
> > case." And let the future callback developers take care of it.
> >
> > Thoughts?
>
> That sounds good to me. But I still also see the value to add an
> assertion in SearchCatCacheInternal(). If we had it, we could find
> these two bugs earlier.
>
> Anyway, this seems to be unrelated to this bug fixing so we can start
> a new thread for that.

+1 to start a new thread for that.

> > As I said earlier [1], currently only two(there could be more) error
> > context callbacks access the sys catalogues, one is in
> > slot_store_error_callback which will be fixed with your patch. Another
> > is in conversion_error_callback, we can also fix this by storing the
> > relname, attname and other required info in ConversionLocation,
> > something like the attached. Please have a look.
> >
> > Thoughts?
>
> Thank you for the patch!
>
> Here are some comments:

Thanks for the review comments.

> +static void set_error_callback_info(ConversionLocation *errpos,
> +  Relation rel, int cur_attno,
> +  ForeignScanState *fsstate);
>
> I'm concerned a bit that the function name is generic. How about
> set_conversion_error_callback_info() or something to make the name
> clear?

Done.

> ---
> +static void
> +conversion_error_callback(void *arg)
> +{
> +   ConversionLocation *errpos = (ConversionLocation *) arg;
> +
> +   if (errpos->show_generic_message)
> +   errcontext("processing expression at position %d in
> select list",
> +   errpos->cur_attno);
> +
>
> I think we can set this generic message to the error context when
> errpos->relname is NULL instead of using show_generic_message.

Right. Done.

> ---
> +   /*
> +* Set error context callback info, so that we could avoid accessing
> +* the system catalogues while processing the error in
> +* conversion_error_callback. System catalogue accesses are not safe 
> in
> +* error context callbacks because the transaction might have been
> +* broken by then.
> +*/
> +   set_error_callback_info(, rel, i, fsstate);
>
> Looking at other code, we use "catalog" rather than "catalogue" in
> most places. Is it better to use "catalog" for consistency? FYI, the
> "catalogue" is used at only three places in the server code:

Changed it to "catalog".

> FYI I've added those bug fixes to the next Commitfest - 
> https://commitfest.postgresql.org/32/2955/

Thanks. I'm attaching 2 patches to make it easy for reviewing and also
they will get a chance to be tested by cf bot.

0001 - for avoiding system catalog access in slot_store_error_callback.
0002 - for avoiding system catalog access in conversion_error_callback

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patch
Description: Binary data


v3-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patch
Description: Binary data

Re: Protect syscache from bloating with negative cache entries

2021-01-26 Thread Kyotaro Horiguchi

At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> The commit 4656e3d668 (debug_invalidate_system_caches_always)
> conflicted with this patch. Rebased.

At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (I found a bug in a benchmark-aid function
> (CatalogCacheFlushCatalog2), I repost an updated version soon.)

I noticed that a catcachebench-aid function
CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
current memory context, which leads to a crash.

This is a fixed it then rebased version.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5f318170b9c1e0caa1033862261800f06135e5bd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 18 Nov 2020 16:54:31 +0900
Subject: [PATCH v7 1/3] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 87 +-
 src/backend/utils/misc/guc.c   | 12 +
 src/include/utils/catcache.h   | 19 +++
 4 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd4..86888d2409 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
 	AcceptInvalidationMessages();
+
+	if (xactStartTimestamp != 0)
+		SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa2b49c676..644d92dd9a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */	/* turns DEBUG elogs on */
@@ -60,9 +61,19 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+uint64	prune_min_age_us;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+uint64	catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
 			   int nkeys,
 			   Datum v1, Datum v2,
@@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
 Index hashIndex,
 Datum v1, Datum v2,
 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
 		   Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 			 Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+	if (newval < 0)
+		prune_min_age_us = UINT64_MAX;
+	else
+		prune_min_age_us = ((uint64) newval) * USECS_PER_SEC;
+}
 
 /*
  *	internal support functions
@@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache,
 		 */
 		dlist_move_head(bucket, >cache_elem);
 
+		/* Record the last access timestamp */
+		ct->lastaccess = catcacheclock;
+
 		/*
 		 * If it's a positive entry, bump its refcount and return it. If it's
 		 * negative, we can report failure to the caller.
@@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache,
 	return >tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+	int		nremoved = 0;
+	int		i;
+	long	oldest_ts = catcacheclock;
+	uint64	prune_threshold = catcacheclock - prune_min_age_us;
+
+	/* Scan over the whole hash to find entries to remove */
+	for (i = 0 ; i < cp->cc_nbuckets ; i++)
+	{
+		dlist_mutable_iter	iter;
+
+		dlist_foreach_modify(iter, >cc_bucket[i])
+		{
+			CatCTup*ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+			/* Don't remove referenced entries */
+			if (ct->refcount == 0 &&
+(ct->c_list == NULL || ct->c_list->refcount == 0))
+			{
+if (ct->lastaccess < prune_threshold)
+{
+	CatCacheRemoveCTup(cp, ct);
+	nremoved++;
+
+	/* don't let the removed entry update oldest_ts */
+	continue;
+}
+			}
+
+			/* update the oldest timestamp if the entry remains alive */
+			if (ct->lastaccess < oldest_ts)
+oldest_ts = ct->lastaccess;
+		}
+	}
+
+

Re: Protect syscache from bloating with negative cache entries

2021-01-26 Thread Kyotaro Horiguchi

At Tue, 26 Jan 2021 11:43:21 +0200, Heikki Linnakangas  wrote 
in 
> Hi,
> 
> On 19/11/2020 07:25, Kyotaro Horiguchi wrote:
> > Performance measurement on the attached showed better result about
> > searching but maybe worse for cache entry creation.  Each time number
> > is the mean of 10 runs.
> > # Cacache (negative) entry creation
> > :  time(ms) (% to master)
> > master :  3965.61(100.0)
> > patched-off:  4040.93(101.9)
> > patched-on :  4032.22(101.7)
> > # Searching negative cache entries
> > master :  8173.46(100.0)
> > patched-off:  7983.43( 97.7)
> > patched-on :  8049.88( 98.5)
> > # Creation, searching and expiration
> > master :  6393.23(100.0)
> > patched-off:  6527.94(102.1)
> > patched-on : 15880.01(248.4)
> > That is, catcache searching gets faster by 2-3% but creation gets
> > slower by about 2%. If I moved the condition of 2 further up to
> > CatalogCacheCreateEntry(), that degradation reduced to 0.6%.
> > # Cacache (negative) entry creation
> > master  :  3967.45   (100.0)
> > patched-off :  3990.43   (100.6)
> > patched-on  :  4108.96   (103.6)
> > # Searching negative cache entries
> > master  :  8106.53   (100.0)
> > patched-off :  8036.61   ( 99.1)
> > patched-on  :  8058.18   ( 99.4)
> > # Creation, searching and expiration
> > master  :  6395.00   (100.0)
> > patched-off :  6416.57   (100.3)
> > patched-on  : 15830.91   (247.6)
> 
> Can you share the exact script or steps to reproduce these numbers? I
> presume these are from the catcachebench extension, but I can't figure
> out which scenario above corresponds to which catcachebench
> test. Also, catcachebench seems to depend on a bunch of tables being
> created in schema called "test"; what tables did you use for the above
> numbers?

gen_tbl.pl to generate the tables, then run2.sh to run the
benchmark. sumlog.pl to summarize the result of run2.sh.

$ ./gen_tbl.pl | psql postgres
$ ./run2.sh | tee rawresult.txt | ./sumlog.pl

(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)

Simple explanation follows since the scripts are a kind of crappy..

run2.sh:
  LOOPS: # of execution of catcachebench() in a single run
  USES : Take the average of this number of fastest executions in a
 single run.
  BINROOT  : Common parent directory of target binaries.
  DATADIR  : Data directory. (shared by all binaries)
  PREC : FP format for time and percentage in a result.
  TESTS: comma-separated numbers given to catcachebench.
  
 The "run" function spec

   run "binary-label"
 where A, B and C are the value for catalog_cache_prune_min_age. ""
 means no setting (used for master binary). Currently only C is in
 effect but all the three should be non-empty string to make it
 effective.

 The result output is:

   test |   version   |  n  |r | stddev  
  --+-+-+--+-
  1 | patched-off | 1/3 | 14211.96 |  261.19

   test   : # of catcachebench(#)
   version: binary label given to the run function
   n  : USES / LOOPS
   r  : result average time of catcachebench() in milliseconds
   stddev : stddev of

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

#! /usr/bin/perl
$collist = "";
foreach $i (0..1000) {
$collist .= sprintf(", c%05d int", $i);
}
$collist = substr($collist, 2);

printf "drop schema if exists test cascade;\n";
printf "create schema test;\n";
foreach $i (0..2999) {
printf "create table test.t%04d ($collist);\n", $i;
}
#!/bin/bash
LOOPS=3
USES=1
TESTS=1,2,3
BINROOT=/home/horiguti/bin
DATADIR=/home/horiguti/data/data_work
PREC="numeric(10,2)"

/usr/bin/killall postgres
/usr/bin/sleep 3

run() {
local BINARY=$1
local PGCTL=$2/bin/pg_ctl
local PGSQL=$2/bin/postgres
local PSQL=$2/bin/psql

if [ "$3" != "" ]; then
  local SETTING1="set catalog_cache_prune_min_age to \"$3\";"
  local SETTING2="set catalog_cache_prune_min_age to \"$4\";"
  local SETTING3="set catalog_cache_prune_min_age to \"$5\";"
fi

#   ($PGSQL -D $DATADIR 2>&1 > /dev/null)&
($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')&
/usr/bin/sleep 3
${PSQL} postgres <&1 > /dev/null | /usr/bin/sed -e 's/^/# 
/'

#   oreport > $BINARY_perf.txt
}

for i in $(seq 0 2); do
run "patched-off" $BINROOT/pgsql_catexp "-1" "-1" "-1"
run "patched-on" $BINROOT/pgsql_catexp "0" "0" "0"
run "master" $BINROOT/pgsql_master_o2 "" "" ""
done

#! /usr/bin/perl

while () {
#   if 
(/^\s+([0-9])\s*\|\s*(\w+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*(\w+)\s*$/) {
if 
(/^\s+([0-9])\s*\|\s*(\S+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*$/) {
$test = $1;
$bin = $2;
$time = $4;
if (defined $sum{$test}{$bin}) {
$sum{$test}{$bin} +=

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-26 Thread Bharath Rupireddy

On Tue, Jan 26, 2021 at 8:38 AM Bharath Rupireddy
 wrote:
> I will post "keep_connections" GUC and "keep_connection" server level
> option patches later.

Attaching v19 patch set for "keep_connections" GUC and
"keep_connection" server level option. Please review them further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v19-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch
Description: Binary data

v19-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data

Re: Single transaction in the tablesync worker?

2021-01-26 Thread Peter Smith

On Mon, Jan 25, 2021 at 4:48 PM Amit Kapila  wrote:
>
> On Mon, Jan 25, 2021 at 8:03 AM Peter Smith  wrote:
> >
> > Hi Amit.
> >
> > PSA the v19 patch for the Tablesync Solution1.
> >
>
> I see one race condition in this patch where we try to drop the origin
> via apply process and DropSubscription. I think it can lead to the
> error "cache lookup failed for replication origin with oid %u". The
> same problem can happen via exposed API pg_replication_origin_drop but
> probably because this is not used concurrently so nobody faced this
> issue. I think for the matter of this patch we can try to suppress
> such an error either via try..catch, or by adding missing_ok argument
> to replorigin_drop API, or we can just add to comments that such a
> race exists. Additionally, we should try to start a new thread for the
> existence of this problem in pg_replication_origin_drop. What do you
> think?

OK. A new thread [ps0127] for this problem was started

---
[ps0127] = 
https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: simplifying foreign key/RI checks

2021-01-26 Thread Tatsuro Yamada


Hi Amit-san,

On 2021/01/25 18:19, Amit Langote wrote:

On Mon, Jan 25, 2021 at 9:24 AM Corey Huinker  wrote:

On Sun, Jan 24, 2021 at 6:51 AM Amit Langote  wrote:

Here's v5.


v5 patches apply to master.
Suggested If/then optimization is implemented.
Suggested case merging is implemented.
Passes make check and make check-world yet again.
Just to confirm, we don't free the RI_CompareHashEntry because it points to an 
entry in a hash table which is TopMemoryContext aka lifetime of the session, 
correct?


Right.


Anybody else want to look this patch over before I mark it Ready For Committer?


Would be nice to have others look it over.  Thanks.



Thanks for creating the patch!

I tried to review the patch. Here is my comment.

* According to this thread [1], it might be better to replace elog() with
  ereport() in the patch.

[1]: 
https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com

Thanks,
Tatsuro Yamada

pg_replication_origin_drop API potential race condition

2021-01-26 Thread Peter Smith

Hi Hackers.

As discovered elsewhere [ak0125] there is a potential race condition
in the pg_replication_origin_drop API

The current code of pg_replication_origin_drop looks like:

roident = replorigin_by_name(name, false);
Assert(OidIsValid(roident));

replorigin_drop(roident, true);


Users cannot deliberately drop a non-existent origin
(replorigin_by_name passes missing_ok = false) but there is still a
small window where concurrent processes may be able to call
replorigin_drop for the same valid roident.

Locking within replorigin_drop guards against concurrent drops so the
1st execution will succeed, but then the 2nd execution would give
internal cache error: elog(ERROR, "cache lookup failed for replication
origin with oid %u", roident);

Some ideas to fix this include:
1. Do nothing except write a comment about this in the code. The
internal ERROR is not ideal for a user API there is no great harm
done.
2. Change the behavior of replorigin_drop to be like
replorigin_drop_IF_EXISTS, so the 2nd execution of this race would
silently do nothing when it finds the roident is already gone.
3. Same as 2, but make the NOP behavior more explicit by introducing a
new "missing_ok" parameter for replorigin_drop.

Thoughts?


[ak0125] 
https://www.postgresql.org/message-id/CAA4eK1%2ByeLwBCkTvTdPM-hSk1fr6jT8KJc362CN8zrGztq_JqQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: WIP: BRIN multi-range indexes

2021-01-26 Thread Tomas Vondra

On 1/26/21 7:52 PM, John Naylor wrote:
On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 >
 >
 > On 1/23/21 12:27 AM, John Naylor wrote:

 > > Still, it would be great if multi-minmax can be a drop in 
replacement. I

 > > know there was a sticking point of a distance function not being
 > > available on all types, but I wonder if that can be remedied or worked
 > > around somehow.
 > >
 >
 > Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
 > replacement for minmax (essentially, using these opclasses as the
 > default ones, with the option to switch back to plain minmax). I'm not
 > convinced we should do that - though. Imagine you have minmax indexes in
 > your existing DB, it's working perfectly fine, and then we come and just
 > silently change that during dump/restore. Is there some past example
 > when we did something similar and it turned it to be OK?

I was assuming pg_dump can be taught to insert explicit opclasses for 
minmax indexes, so that upgrade would not cause surprises. If that's 
true, only new indexes would have the different default opclass.

Maybe, I suppose we could do that. But I always found such changes 
happening silently in the background a bit suspicious, because it may be 
quite confusing. I certainly wouldn't expect such difference between 
creating a new index and index created by dump/restore. Did we do such 
changes in the past? That might be a precedent, but I don't recall any 
example ...

 > As for the distance functions, I'm pretty sure there are data types
 > without "natural" distance - like most strings, for example. We could
 > probably invent something, but the question is how much we can rely on
 > it working well enough in practice.
 >
 > Of course, is minmax even the right index type for such data types?
 > Strings are usually "labels" and not queried using range queries,
 > although sometimes people encode stuff as strings (but then it's very
 > unlikely we'll define the distance definition well). So maybe for those
 > types a hash / bloom would be a better fit anyway.

Right.

 > But I do have an idea - maybe we can do without distances, in those
 > cases. Essentially, the primary issue of minmax indexes are outliers, so
 > what if we simply sort the values, keep one range in the middle and as
 > many single points on each tail?

That's an interesting idea. I think it would be a nice bonus to try to 
do something along these lines. On the other hand, I'm not the one 
volunteering to do the work, and the patch is useful as is.

IMO it's fairly small amount of code, so I'll take a stab at in in the 
next version of the patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: mkid reference

2021-01-26 Thread Bruce Momjian

On Tue, Jan 26, 2021 at 10:19:44PM +0100, Magnus Hagander wrote:
> On Mon, Jan 25, 2021 at 4:40 PM Bruce Momjian  wrote:
> >
> > On Sun, Jan 24, 2021 at 02:20:58PM +0100, Magnus Hagander wrote:
> > > While at it, what point is "codelines" adding?
> >
> > That is the script I use to generate code line counts when comparing
> > releases.  I thought it should be in the tree so others can reproduce my
> > numbers.
> 
> Not that it particularly matters to keep it, but wouldn't something
> like cloc give a much better number?

Yes, we could, but we didn't really have any criteria on exactly what to
count, so I just counted physical lines.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Online checksums patch - once again

2021-01-26 Thread Heikki Linnakangas


On 22/01/2021 14:21, Heikki Linnakangas wrote:

On 22/01/2021 13:55, Heikki Linnakangas wrote:

I read through the latest patch,
v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
comments below:


One more thing:

In SetRelationNumChecks(), you should use SearchSysCacheCopy1() to get a
modifiable copy of the tuple. Otherwise you modify the tuple in the
relcache as a side effect. Maybe that's harmless in this case, as the
'relhaschecksums' value in the relcache isn't used for anything, but
let's be tidy.


Sorry, I meant SetRelHasChecksums. There is no SetRelationNumChecks 
function, I don't know where I got that from.


- Heikki

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-26 Thread Alexey Kondratov


On 2021-01-26 09:58, Michael Paquier wrote:

On Mon, Jan 25, 2021 at 11:11:38PM +0300, Alexey Kondratov wrote:
I updated comment with CCI info, did pgindent run and renamed new 
function

to SetRelationTableSpace(). New patch is attached.

[...]

Yeah, all these checks we complicated from the beginning. I will try 
to find

a better place tomorrow or put more info into the comments at least.


I was reviewing that, and I think that we can do a better
consolidation on several points that will also help the features
discussed on this thread for VACUUM, CLUSTER and REINDEX.

If you look closely, ATExecSetTableSpace() uses the same logic as the
code modified here to check if a relation can be moved to a new
tablespace, with extra checks for mapped relations,
GLOBALTABLESPACE_OID or if attempting to manipulate a temp relation
from another session.  There are two differences though:
- Custom actions are taken between the phase where we check if a
relation can be moved to a new tablespace, and the update of
pg_class.
- ATExecSetTableSpace() needs to be able to set a given relation
relfilenode on top of reltablespace, the newly-created one.

So I think that the heart of the problem is made of two things here:
- We should have one common routine for the existing code paths and
the new code paths able to check if a tablespace move can be done or
not.  The case of a cluster, reindex or vacuum on a list of relations
extracted from pg_class would still require a different handling
as incorrect relations have to be skipped, but the case of individual
relations can reuse the refactoring pieces done here
(see CheckRelationTableSpaceMove() in the attached).
- We need to have a second routine able to update reltablespace and
optionally relfilenode for a given relation's pg_class entry, once the
caller has made sure that CheckRelationTableSpaceMove() validates a
tablespace move.



I think that I got your idea. One comment:

+bool
+CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId)
+{
+   Oid oldTableSpaceId;
+   Oid reloid = RelationGetRelid(rel);
+
+   /*
+* No work if no change in tablespace.  Note that MyDatabaseTableSpace
+* is stored as 0.
+*/
+   oldTableSpaceId = rel->rd_rel->reltablespace;
+   if (newTableSpaceId == oldTableSpaceId ||
+   (newTableSpaceId == MyDatabaseTableSpace && oldTableSpaceId == 
0))
+   {
+   InvokeObjectPostAlterHook(RelationRelationId, reloid, 0);
+   return false;
+   }

CheckRelationTableSpaceMove() does not feel like a right place for 
invoking post alter hooks. It is intended only to check for tablespace 
change possibility. Anyway, ATExecSetTableSpace() and 
ATExecSetTableSpaceNoStorage() already do that in the no-op case.



Please note that was a bug in your previous patch 0002: shared
dependencies need to be registered if reltablespace is updated of
course, but also iff the relation has no physical storage.  So
changeDependencyOnTablespace() requires a check based on
RELKIND_HAS_STORAGE(), or REINDEX would have registered shared
dependencies even for relations with storage, something we don't
want per the recent work done by Alvaro in ebfe2db.



Yes, thanks.

I have removed this InvokeObjectPostAlterHook() from your 0001 and made 
0002 to work on top of it. I think that now it should look closer to 
what you described above.


In the new 0002 I moved ACL check to the upper level, i.e. 
ExecReindex(), and removed expensive text generation in test. Not 
touched yet some of your previously raised concerns. Also, you made 
SetRelationTableSpace() to accept Relation instead of Oid, so now we 
have to open/close indexes in the ReindexPartitions(), I am not sure 
that I use proper locking there, but it works.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 96a37399a9cf9ae08d62e28496e73b36087e5a19 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 26 Jan 2021 15:53:06 +0900
Subject: [PATCH v7 1/2] Refactor code to detect and process tablespace moves

---
 src/backend/commands/tablecmds.c | 218 +--
 src/include/commands/tablecmds.h |   4 +
 2 files changed, 127 insertions(+), 95 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8687e9a97c..c08eedf995 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3037,6 +3037,116 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
 	table_close(relationRelation, RowExclusiveLock);
 }
 
+/*
+ * CheckRelationTableSpaceMove
+ *		Check if relation can be moved to new tablespace.
+ *
+ * NOTE: caller must be holding an appropriate lock on the relation.
+ * ShareUpdateExclusiveLock is sufficient to prevent concurrent schema
+ * changes.
+ *
+ * Returns true if the relation can be moved to the new tablespace;
+ * false

wiki:PostgreSQL_14_Open_Items

2021-01-26 Thread Justin Pryzby

I've created a new page, and added some unresolved items that I've been keeping
in my head.

https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items

Re: strange error reporting

2021-01-26 Thread Tom Lane

Alvaro Herrera  writes:
> pgbench has one occurrence of the old pattern in master, in line 6043.
> However, since doConnect() returns NULL when it gets CONNECTION_BAD,
> that seems dead code.  This patch kills it.

Oh ... I missed that because it wasn't adjacent to the PQconnectdbParams
call :-(.  You're right, that's dead code and we should just delete it.

regards, tom lane

Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

2021-01-26 Thread Tom Lane

Justin Pryzby  writes:
> I have nothing new to add, but wanted to point out this is still an issue.
> This is on the v13 Opened Items list - for lack of anywhere else to put them, 
> I
> also added two other, unresolved issues.

It's probably time to make a v14 open items page, and move anything
you want to treat as a live issue to there.

regards, tom lane

Re: mkid reference

2021-01-26 Thread Magnus Hagander

On Mon, Jan 25, 2021 at 4:40 PM Bruce Momjian  wrote:
>
> On Sun, Jan 24, 2021 at 02:20:58PM +0100, Magnus Hagander wrote:
> > While at it, what point is "codelines" adding?
>
> That is the script I use to generate code line counts when comparing
> releases.  I thought it should be in the tree so others can reproduce my
> numbers.

Not that it particularly matters to keep it, but wouldn't something
like cloc give a much better number?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: mkid reference

2021-01-26 Thread Magnus Hagander

On Tue, Jan 26, 2021 at 6:58 PM Bruce Momjian  wrote:
>
> On Tue, Jan 26, 2021 at 05:03:30PM +0100, Magnus Hagander wrote:
> > On Mon, Jan 25, 2021 at 4:38 PM Bruce Momjian  wrote:
> > >
> > > On Fri, Jan 22, 2021 at 01:07:36PM -0500, Tom Lane wrote:
> > > > > There's also src/tools/make_mkid which use this mkid tool.  +1 for 
> > > > > removing.
> > > > > If anything, it seems better replaced by extended documentation on 
> > > > > the existing
> > > > > wiki article [0] on how to use "git format-patch".
> > > >
> > > > I found man pages for mkid online --- it's apparently a ctags-like
> > > > code indexing tool, not something for patches.  So maybe Bruce still
> > > > uses it, or maybe not.  But as long as we've also got make_ctags and
> > >
> > > Yes, I do still use it, so I thought having a script to generate its
> > > index files might be helpful to someone.
> >
> > Where do you actually get it? The old docs (now removed) suggested
> > getting it off ftp.postgresql.org...
>
> Not sure why it was on our ftp site, since it is a GNU download:
>
> https://www.gnu.org/software/idutils/

Ah, good. Then at least we now have it in the list archives for
reference if somebody else searches for it :)

And no, it wasn't actually on our ftp server. But it might have been
at some point far far in the past...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: shared tempfile was not removed on statement_timeout

2021-01-26 Thread Thomas Munro

On Wed, Jan 27, 2021 at 12:22 AM Kyotaro Horiguchi
 wrote:
> At Tue, 26 Jan 2021 11:00:56 +0200, Heikki Linnakangas  
> wrote in
> > Don't we potentially have the same problem with all on_dsm_detach
> > callbacks? Looking at the other on_dsm_detach callbacks, I don't see
> > any CHECK_FOR_INTERRUPT() calls in them, but it seems fragile to
> > assume that.
> >
> > I'd suggest adding HOLD/RESUME_INTERRUPTS() to dsm_detach(). At least
> > around the removal of the callback from the list and calling the
> > callback. Maybe even over the whole function.
>
> Yes, I first came up with HOLD/RESUME_QUERY_INTERRUPTS() to the same
> location.

+1, this seems like a good idea.  This is a little bit like the code
near the comments "Don't joggle the elbow of proc_exit".

Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

2021-01-26 Thread Justin Pryzby

I have nothing new to add, but wanted to point out this is still an issue.

This is on the v13 Opened Items list - for lack of anywhere else to put them, I
also added two other, unresolved issues.

https://wiki.postgresql.org/index.php?title=PostgreSQL_13_Open_Items=revision=35624=35352

On Tue, Aug 04, 2020 at 06:00:34PM -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > On Tue, Aug 4, 2020 at 1:31 PM Tom Lane  wrote:
> >> while the INSERT_IN_PROGRESS case has none.  Simple symmetry
> >> would suggest that the INSERT_IN_PROGRESS case should be
> >> 
> >>if (checking_uniqueness ||
> >>HeapTupleIsHeapOnly(heapTuple))
> >>// wait
> 
> > I had exactly the same intuition.
> 
> >> but I'm not 100% convinced that that's right.
> 
> > Why doubt that explanation?
> 
> First, it's not clear that this is an exact inverse, because
> HeapTupleIsHotUpdated does more than check the HOT_UPDATED flag.
> Second, I think there remains some doubt as to whether the
> DELETE_IN_PROGRESS case is right either.  If we were forcing
> a wait for *every* in-doubt HOT-chain element, not only non-last
> ones (or non-first ones for the INSERT case, if we use the above
> fix) then it'd be quite clear that we're safe.  But if we want
> to keep the optimization then I think maybe closer analysis is
> warranted.

Re: Key management with tests

2021-01-26 Thread Robert Haas

On Tue, Jan 26, 2021 at 11:15 AM Bruce Momjian  wrote:
> This version fixes OpenSSL detection and improves docs for initdb
> interactions.

Hi,

I'm wondering whether you've considered storing all the keys in one
file instead of a file per key. The reason I ask is that it seems to
me that the key rotation procedure would be a lot simpler if it were
all in one file. You could just create a temporary file and atomically
rename it over the existing file. If you see a temporary file you're
always free to remove it. This is a lot simpler than what you have
right now. The "repair" concept pretty much goes away completely,
which seems nice. Granted I don't know exactly how to store multiple
keys in one file, but I bet there's some way to do it.

The way in which you are posting these patches is quite unlike what
most people do when posting patches to this list. You seem to have
generated a bunch of patches using 'git format-patch' but then
concatenated them all together in a single file. It would be helpful
if you could do this more like the way that is now standard on this
list. Not only that, but the patches don't have meaningful commit
messages in them, and don't seem to be meaningfully split for easy
review. They just say things like 'crypto squash commit'. Compare this
to for example what I did on the "cleaning up a few CLOG-related
things" thread where the commits appear in a logical sequence, and
each one has a meaningful commit message. Or here's an example from
someone else --
http://postgr.es/m/be72abfa-e62e-eb81-4e70-1b57fe6dc...@amazon.com --
and note the inclusion of authorship information in the commit
messages, so that the source of the code can be easily understood.

The README in src/backend/crypto does not explain how the scripts in
that directory are intended to be used. If I want to use AWS Secrets
Manager with this feature, I can see that I should use
ckey_aws.sh.sample as a basis for that integration, but I don't know
what I should do with the script because the README says nothing about
it. I am frankly pretty doubtful about the idea of shipping a bunch of
/bin/sh scripts as a best practice; for one thing, that's totally
unusable on Windows, and it also means that this is dependent on
/bin/sh existing and having the behavior we expect and on all the
other executables in these scripts as well. But, at the very least,
there needs to be a clearer explanation of how the scripts are
intended to be used, which parts people are supposed to modify, what
arguments they're going to get called with, and things like that.

The comments in cipher.c and cipher_openssl.c could be improved to
explain that they are alternatives to each other. Perhaps the former
could be renamed to something like cipher_failure.c or cipher_noimpl.c
for clarity.

I believe that a StaticAssertStmt could be used to check the length of
the encryption_methods[] array, so that if someone changes
NUM_ENCRYPTION_METHODS without updating the array, compilation fails.
See UserAuthName[] for an example of how to do this.

You seem to have omitted to update the documentation with the names of
the new wait events that you added.

In process_postgres_switches(), when there's a multi-line comment
followed by a single line of actual code, I prefer to include braces
around the whole thing. There might be some disagreement on what is
best here, though.

What are the consequences of the placement of the code in
PostgresMain() for processes other than user backends and walsenders?
I think that the way you have it, background workers would not have
access to keys, nor auxiliary processes like the checkpointer ... at
least in the EXEC_BACKEND case. In the non-EXEC_BACKEND case you have
the postmaster doing it, so then I'm not sure why it has to be redone
for every backend. Won't they just inherit the data from the
postmaster? Has this code been meaningfully tested on Windows? How do
we know that it works? Maybe we need to think about adding some
asserts that guarantee that any process that attempts to access a
buffer has the key manager initialized; I bet such assertions would
fail at least on Windows as the code looks now.

I don't think it makes sense to think about committing this to v14. I
believe it only makes sense if we have a TDE patch that is relatively
close to committable that can be used with it. I also don't think that
this patch is in good enough shape to commit yet in terms of where
it's at in terms of quality; I think it needs more review first,
hopefully including review from people who can comment intelligently
specifically on the cryptography aspects of it. However, the
challenges don't seem insurmountable. There's also still some question
in my mind about whether the design choices here (one KEK, 2 DEKs, one
for data and one for WAL) have enough consensus. I don't have a
considered opinion on that, partly because I'm not quite sure what the
reasonable alternatives are, but it seems that other people had some
questions about

Re: strange error reporting

2021-01-26 Thread Alvaro Herrera

On 2021-Jan-20, Robert Haas wrote:

> On Wed, Jan 20, 2021 at 1:54 PM Tom Lane  wrote:
> > Alvaro Herrera  writes:

> > > Well, the patch seems small enough, and I don't think it'll be in any
> > > way helpful to omit that detail.
> >
> > I'm +1 for applying and back-patching that.  I still think we might
> > want to just drop the phrase altogether in HEAD, but we wouldn't do
> > that in the back branches, and the message is surely misleading as-is.
> 
> Sure, that makes sense.

OK, I pushed it.  Thanks,

pgbench has one occurrence of the old pattern in master, in line 6043.
However, since doConnect() returns NULL when it gets CONNECTION_BAD,
that seems dead code.  This patch kills it.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"I can see support will not be a problem.  10 out of 10."(Simon Wittber)
  (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1be1ad3d6d..a4a3f40048 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6038,17 +6038,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
 		exit(1);
 
-	if (PQstatus(con) == CONNECTION_BAD)
-	{
-		pg_log_fatal("connection to database \"%s\" failed: %s",
-	 dbName, PQerrorMessage(con));
-		exit(1);
-	}
-
 	if (internal_script_used)
 		GetTableInfo(con, scale_given);
 
 	/*
 	 * :scale variables normally get -s or database scale, but don't override

Re: making update/delete of inheritance trees scale better

2021-01-26 Thread Robert Haas

On Fri, Oct 30, 2020 at 6:26 PM Heikki Linnakangas  wrote:
> Yeah, you need to access the old tuple to update its t_ctid, but
> accessing it twice is still more expensive than accessing it once. Maybe
> you could optimize it somewhat by keeping the buffer pinned or
> something. Or push the responsibility down to the table AM, passing the
> AM only the modified columns, and let the AM figure out how to deal with
> the columns that were not modified, hoping that it can do something smart.

Just as a point of possible interest, back when I was working on
zheap, I sort of wanted to take this in the opposite direction. In
effect, a zheap tuple has system columns that don't exist for a heap
tuple, and you can't do an update or delete without knowing what the
values for those columns are, so zheap had to just refetch the tuple,
but that sucked in comparisons with the existing heap, which didn't
have to do the refetch. At the time, I thought maybe the right idea
would be to extend things so that a table AM could specify an
arbitrary set of system columns that needed to be bubbled up to the
point where the update or delete happens, but that seemed really
complicated to implement and I never tried. Here it seems like we're
thinking of going the other way, and just always doing the refetch.
That is of course fine for zheap comparative benchmarks: instead of
making zheap faster, we just make the heap slower!

Well, sort of. I didn't think about the benefits of the refetch
approach when the tuples are wide. That does cast a somewhat different
light on things. I suppose we could have both methods and choose the
one that seems likely to be faster in particular cases, but that seems
like way too much machinery. Maybe there's some way to further
optimize accessing the same tuple multiple times in rapid succession
to claw back some of the lost performance in the slow cases, but I
don't have a specific idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: archive status ".ready" files may be created too early

2021-01-26 Thread Bossart, Nathan

On 1/2/21, 8:55 AM, "Andrey Borodin"  wrote:
> Recently we had somewhat related incident. Do I understand correctly that 
> this incident is related to the bug discussed in this thread?

I'm not sure that we can rule it out, but the log pattern I've
typically seen for this is "invalid contrecord length."  The issue is
that we're marking segments as ready for archive when the segment is
fully written versus when its WAL records are fully written (since its
WAL records may cross into the next segment).  The fact that you're
seeing zeroes at the end of your archived segments leads me to think
it is unlikely that you are experiencing this bug.

Nathan

Re: archive status ".ready" files may be created too early

2021-01-26 Thread Bossart, Nathan

On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
> At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan"  
> wrote in
>> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
>> > You're right in that regard. There's a window where partial record is
>> > written when write location passes F0 after insertion location passes
>> > F1. However, remembering all spanning records seems overkilling to me.
>>
>> I'm curious why you feel that recording all cross-segment records is
>> overkill.  IMO it seems far simpler to just do that rather than try to
>
> Sorry, my words are not enough. Remembering all spanning records in
> *shared memory* seems to be overkilling.  Much more if it is stored in
> shared hash table.  Even though it rarely the case, it can fail hard
> way when reaching the limit. If we could do well by remembering just
> two locations, we wouldn't need to worry about such a limitation.

I don't think it will fail if we reach max_size for the hash table.
The comment above ShmemInitHash() has this note:

* max_size is the estimated maximum number of hashtable entries.  This is
* not a hard limit, but the access efficiency will degrade if it is
* exceeded substantially (since it's used to compute directory size and
* the hash table buckets will get overfull).

> Another concern about the concrete patch:
>
> NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> LWLock every time XLogWrite is called while segment archive is being
> held off.  I don't think it is acceptable and I think it could be a
> problem when many backends are competing on WAL.

This is a fair point.  I did some benchmarking with a few hundred
connections all doing writes, and I was not able to discern any
noticeable performance impact.  My guess is that contention on this
new lock is unlikely because callers of XLogWrite() must already hold
WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
more than once per segment to record new record boundaries.

Nathan

Re: WIP: BRIN multi-range indexes

2021-01-26 Thread John Naylor

On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra 
wrote:
>
>
> On 1/23/21 12:27 AM, John Naylor wrote:

> > Still, it would be great if multi-minmax can be a drop in replacement. I
> > know there was a sticking point of a distance function not being
> > available on all types, but I wonder if that can be remedied or worked
> > around somehow.
> >
>
> Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
> replacement for minmax (essentially, using these opclasses as the
> default ones, with the option to switch back to plain minmax). I'm not
> convinced we should do that - though. Imagine you have minmax indexes in
> your existing DB, it's working perfectly fine, and then we come and just
> silently change that during dump/restore. Is there some past example
> when we did something similar and it turned it to be OK?

I was assuming pg_dump can be taught to insert explicit opclasses for
minmax indexes, so that upgrade would not cause surprises. If that's true,
only new indexes would have the different default opclass.

> As for the distance functions, I'm pretty sure there are data types
> without "natural" distance - like most strings, for example. We could
> probably invent something, but the question is how much we can rely on
> it working well enough in practice.
>
> Of course, is minmax even the right index type for such data types?
> Strings are usually "labels" and not queried using range queries,
> although sometimes people encode stuff as strings (but then it's very
> unlikely we'll define the distance definition well). So maybe for those
> types a hash / bloom would be a better fit anyway.

Right.

> But I do have an idea - maybe we can do without distances, in those
> cases. Essentially, the primary issue of minmax indexes are outliers, so
> what if we simply sort the values, keep one range in the middle and as
> many single points on each tail?

That's an interesting idea. I think it would be a nice bonus to try to do
something along these lines. On the other hand, I'm not the one
volunteering to do the work, and the patch is useful as is.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Allow matching whole DN from a client certificate

2021-01-26 Thread Jacob Champion

On Tue, 2021-01-26 at 13:49 +0100, Daniel Gustafsson wrote:
> Reading more on this it seems we would essentially have to go with RFC 4514, 
> as
> it's the closest to a standard which seems to exist.  Having done a lot
> research on this topic, do you have a gut feeling on direction?

Yeah, if we're going to use string matching then I agree with the use
of RFC 4514. I think a string-based approach is going to work for only
the most simple cases, though. Anything more complicated than plain
ASCII and a fixed base DN is going to become pretty unmanageable with a
regex; you'd need a filter system of some sort that understands DN
structure. The documentation should be clear on the limitations.

> The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517
> section 4.2.15 (which in turn reference RFC 4514 for the DN string format).
> libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in
> lib/certdb/alg1485.c.  Comparing the two would be interesting.

Yeah. I'll poke around a bit.

> Right, multi-value RDN's are defined as sets so C=US,A=B+C=D is equivalent to
> C=US,C=D+A=B according to RFC 5280.

Other potential differences, by my understanding of that RFC, include
the quoting of the "escaped" character class (e.g. a comma can be
escaped as either "\," or "\2C"), the case of hex characters ("\FF" vs
"\ff"), and the option to not quote unprintable control characters
(e.g. ASCII DEL, 0x7F, can be included verbatim or quoted as "\7F").

--Jacob

Re: WIP: BRIN multi-range indexes

2021-01-26 Thread John Naylor

Hi Tomas,
I took another look through the Bloom opclass portion (0004) with
sorted_mode omitted, and it looks good to me code-wise. I think this part
is close to commit-ready. I also did one more proofreading pass for
minor details.

+ rows per block). The default values is -0.1, and

+ greater than 0.0 and smaller than 1.0. The default values is 0.01,

s/values/value/

+ * bloom filter, we can easily and cheaply test wheter it contains values

s/wheter/whether/

+ * XXX We assume the bloom filters have the same parameters fow now. In the

s/fow/for/

+ * or null if it does not exists.

s/exists/exist/

+ * We do expect the bloom filter to eventually switch to hashing mode,
+ * and it's bound to be almost perfectly random, so not compressible.

Leftover from when it started out in sorted mode.

+ if ((m/8) > BLCKSZ)

It seems we need something more safe, to account for page header and tuple
header at least. As the comment before says, the filter will eventually not
be compressible. I remember we can't be exact here, with the possibility of
multiple columns, but we can leave a little slack space.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-26 Thread Peter Geoghegan

On Mon, Jan 25, 2021 at 10:48 PM Amit Kapila  wrote:
> I need to spend more time on benchmarking to study the behavior and I
> think without that it would be difficult to make a conclusion in this
> regard. So, let's not consider any action on this front till I spend
> more time to find the details.

It is true that I committed the patch without thorough review, which
was less than ideal. I welcome additional review from you now.

I will say one more thing about it for now: Start with a workload, not
with the code. Without bottom-up deletion (e.g. when using Postgres
13) with a simple though extreme workload that will experience version
churn in indexes after a while, it still takes quite a few minutes for
the first page to split (when the table is at least a few GB in size
to begin with). When I was testing the patch I would notice that it
could take 10 or 15 minutes for the deletion mechanism to kick in for
the first time -- the patch really didn't do anything at all until
perhaps 15 minutes into the benchmark, despite helping *enormously* by
the 60 minute mark. And this is with significant skew, so presumably
the first page that would split (in the absence of the bottom-up
deletion feature) was approximately the page with the most skew --
most individual pages might have taken 30 minutes or more to split
without the intervention of bottom-up deletion.

Relatively rare events (in this case would-be page splits) can have
very significant long term consequences for the sustainability of a
workload, so relatively simple targeted interventions can make all the
difference. The idea behind bottom-up deletion is to allow the
workload to figure out the best way of fixing its bloat problems
*naturally*. The heuristics must be simple precisely because workloads
are so varied and complicated. We must be willing to pay small fixed
costs for negative feedback  -- it has to be okay for the mechanism to
occasionally fail in order to learn what works. I freely admit that I
don't understand all workloads. But I don't think anybody can. This
holistic/organic approach has a lot of advantages, especially given
the general uncertainty about workload characteristics. Your suspicion
of the simple nature of the heuristics actually makes a lot of sense
to me. I do get it.

-- 
Peter Geoghegan

Re: Error on failed COMMIT

2021-01-26 Thread Dave Cramer

On Tue, 26 Jan 2021 at 12:46, Laurenz Albe  wrote:

> On Tue, 2021-01-26 at 12:25 -0500, Dave Cramer wrote:
> > > After thinking some more about it, I think that COMMIT AND CHAIN would
> have
> > > to change behavior: if COMMIT throws an error (because the transaction
> was
> > > aborted), no new transaction should be started.  Everything else seems
> fishy:
> > > the statement fails, but still starts a new transaction?
> > >
> > > I guess that's also at fault for the unexpected result status that
> > > Masahiko complained about in the other message.
> >
> >
> > I haven't had a look at the result status in libpq. For JDBC we don't
> see that.
> > We throw an exception when we get this error report. This is very
> consistent as the commit fails and we throw an exception
> >
> > > So I think we should not introduce USER_ERROR at all.  It is too much
> > > of a kluge: fail, but not really...
> >
> > What we do now is actually worse as we do not get an error report and we
> silently change commit to rollback.
> > How is this better ?
>
> I see your point from the view of the JDBC driver.
>
> It just feels hacky - somewhat similar to what you say
> above: don't go through the normal transaction rollback steps,
> but issue an error message.
>
> At least we should fake it well...
>

OK, let me look into how we deal with COMMIT and CHAIN.

I can see some real issues with this as Vik pointed out.

Dave

Re: mkid reference

2021-01-26 Thread Bruce Momjian

On Tue, Jan 26, 2021 at 05:03:30PM +0100, Magnus Hagander wrote:
> On Mon, Jan 25, 2021 at 4:38 PM Bruce Momjian  wrote:
> >
> > On Fri, Jan 22, 2021 at 01:07:36PM -0500, Tom Lane wrote:
> > > > There's also src/tools/make_mkid which use this mkid tool.  +1 for 
> > > > removing.
> > > > If anything, it seems better replaced by extended documentation on the 
> > > > existing
> > > > wiki article [0] on how to use "git format-patch".
> > >
> > > I found man pages for mkid online --- it's apparently a ctags-like
> > > code indexing tool, not something for patches.  So maybe Bruce still
> > > uses it, or maybe not.  But as long as we've also got make_ctags and
> >
> > Yes, I do still use it, so I thought having a script to generate its
> > index files might be helpful to someone.
> 
> Where do you actually get it? The old docs (now removed) suggested
> getting it off ftp.postgresql.org...

Not sure why it was on our ftp site, since it is a GNU download:

https://www.gnu.org/software/idutils/

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-26 Thread Robert Haas

On Thu, Jan 21, 2021 at 3:08 PM Matthias van de Meent
 wrote:
> Re-thinking this, and after some research:
>
> Is the behaviour of "any process that invalidates TIDs in this table
> (that could be in an index on this table) always holds a lock that
> conflicts with CIC/RiC on that table" a requirement of tableams, or is
> it an implementation-detail?
>
> If it is a requirement, then this patch is a +1 for me (and that
> requirement should be documented in such case), otherwise a -1 while
> there is no mechanism in place to remove concurrently-invalidated TIDs
> from CIC-ed/RiC-ed indexes.

I don't believe that the table AM machinery presumes that every kind
of table AM necessarily has to support VACUUM; or at least, I don't
think the original version presumed that. However, if you want it to
work some other way, there's not really any infrastructure for that,
either. Like, if you want to run your own background workers to clean
up your table, you could, but you'd have to arrange that yourself. If
you want a SWEEP command or an OPTIMIZE TABLE command instead of
VACUUM, you'd have to hack the grammar. Likewise, I don't know that
there's any guarantee that CLUSTER would work on a table of any old AM
either, or that it would do anything useful.

But having said that, if you have a table AM that *does* support
VACUUM and CLUSTER, I think it would have to treat TIDs pretty much
just as we do today. Almost by definition, they have to live with the
way the rest of the system works. TIDs have to be 6 bytes, and lock
levels can't be changed, because there's nothing in table AM that
would let you tinker with that stuff. The table AM can opt out of that
machinery altogether in favor of just throwing an error, but it can't
make it work differently unless somebody extends the API. It's
possible that this might suck a lot for some AMs. For instance, if
your AM is an in-memory btree, you might want CLUSTER to work in
place, rather than by copying the whole relation file, but I doubt
it's possible to make that work using the APIs as they exist. Maybe
someday we'll have better ones, but right now we don't.

As far as the specific question asked here, I don't really understand
how it could ever work any differently. If you want to invalidate some
TIDs in such a way that they can be reused by unrelated tuples - in
the case of the heap this would be by making the line pointers
LP_UNUSED - that has to be coordinated with the indexing machinery. I
can imagine a table AM that never reuses TIDs - especially if we
eventually have TIDs > 6 bytes - but I can't imagine a table AM that
reuses TIDs that might still be present in the index without telling
the index. And that seems to be what is being proposed here: if CIC or
RiC could be putting TIDs into indexes while at the same time some of
those very same TIDs were being made available for reuse, that would
be chaos, and right now we have no mechanism other than the lock level
to prevent that. We could invent one, maybe, but it doesn't exist
today, and zheap never tried to invent anything like that.

I agree that, if we do this, the comment should make clear that the
CIC/RiC lock has to conflict with VACUUM to avoid indexing tuples that
VACUUM is busy marking LP_UNUSED, and that this is precisely because
other backends will ignore the CIC/RiC snapshot while determining
whether a tuple is live. I don't think this is the case.
Hypothetically speaking, suppose we rejected this patch. Then, suppose
someone proposed to replace ShareUpdateExclusiveLock with two new lock
levels VacuumTuplesLock and IndexTheRelationLock each of which
conflicts with itself and all other lock levels with which
ShareUpdateExclusiveLock conflicts, but the two don't conflict with
each other. AFAIK, that patch would be acceptable today and
unacceptable after this change. While I'm not arguing that as a reason
to reject this patch, it's not impossible that somebody could: they'd
just need to argue that the separated lock levels would have greater
value than this patch. I don't think I believe that, but it's not a
totally crazy suggestion. My main point though is that this
meaningfully changing some assumptions about how the system works, and
we should be sure to be clear about that.

I looked at the patch but don't really understand it; the code in this
area seems to have changed a lot since I last looked at it. So, I'm
just commenting mostly on the specific question that was asked, and a
little bit on the theory of the patch, but without expressing an
opinion on the implementation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error on failed COMMIT

2021-01-26 Thread Laurenz Albe

On Tue, 2021-01-26 at 12:25 -0500, Dave Cramer wrote:
> > After thinking some more about it, I think that COMMIT AND CHAIN would have
> > to change behavior: if COMMIT throws an error (because the transaction was
> > aborted), no new transaction should be started.  Everything else seems 
> > fishy:
> > the statement fails, but still starts a new transaction?
> > 
> > I guess that's also at fault for the unexpected result status that
> > Masahiko complained about in the other message.
> 
>  
> I haven't had a look at the result status in libpq. For JDBC we don't see 
> that. 
> We throw an exception when we get this error report. This is very consistent 
> as the commit fails and we throw an exception
> 
> > So I think we should not introduce USER_ERROR at all.  It is too much
> > of a kluge: fail, but not really...
> 
> What we do now is actually worse as we do not get an error report and we 
> silently change commit to rollback.
> How is this better ?

I see your point from the view of the JDBC driver.

It just feels hacky - somewhat similar to what you say
above: don't go through the normal transaction rollback steps,
but issue an error message.

At least we should fake it well...

Yours,
Laurenz Albe

Re: Error on failed COMMIT

2021-01-26 Thread Vik Fearing

On 1/26/21 6:34 PM, Vik Fearing wrote:
> On 1/26/21 6:20 PM, Laurenz Albe wrote:
>> After thinking some more about it, I think that COMMIT AND CHAIN would have
>> to change behavior: if COMMIT throws an error (because the transaction was
>> aborted), no new transaction should be started.  Everything else seems fishy:
>> the statement fails, but still starts a new transaction?
> 
> The standard is not clear (to me) on what exactly should happen here.
> It says that if a  is not successful then a  statement> is implied, but I don't see it say anything about whether the
> AND CHAIN should be propagated too.
> 
> My vote is that COMMIT AND CHAIN should become ROLLBACK AND NO CHAIN.

Hmm.  On the other hand, that means if the client isn't paying
attention, it'll start executing commands outside of a transaction which
will autocommit.  It might be better for a new transaction to be chained
and hopefully also fail because previous bits are missing.

I will hastily change my vote to "unsure".
-- 
Vik Fearing

Re: Error on failed COMMIT

2021-01-26 Thread Vik Fearing

On 1/26/21 6:20 PM, Laurenz Albe wrote:
> After thinking some more about it, I think that COMMIT AND CHAIN would have
> to change behavior: if COMMIT throws an error (because the transaction was
> aborted), no new transaction should be started.  Everything else seems fishy:
> the statement fails, but still starts a new transaction?

The standard is not clear (to me) on what exactly should happen here.
It says that if a  is not successful then a  is implied, but I don't see it say anything about whether the
AND CHAIN should be propagated too.

My vote is that COMMIT AND CHAIN should become ROLLBACK AND NO CHAIN.
-- 
Vik Fearing

Re: Error on failed COMMIT

2021-01-26 Thread Dave Cramer

On Tue, 26 Jan 2021 at 12:20, Laurenz Albe  wrote:

> On Tue, 2021-01-26 at 11:09 -0500, Dave Cramer wrote:
> > On Tue, 26 Jan 2021 at 05:05, Laurenz Albe 
> wrote:
> >
> > > I wonder about the introduction of the new USER_ERROR level:
> > >
> > >  #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client
> as usual, but
> > >  * never to the server log. */
> > > -#define ERROR  21  /* user error - abort transaction;
> return to
> > > +#define USER_ERROR 21
> > > +#define ERROR  22  /* user error - abort transaction;
> return to
> > >  * known state */
> > >  /* Save ERROR value in PGERROR so it can be restored when Win32
> includes
> > >   * modify it.  We have to use a constant rather than ERROR because
> macros
> > >   * are expanded only when referenced outside macros.
> > >   */
> > >  #ifdef WIN32
> > > -#define PGERROR21
> > > +#define PGERROR22
> > >  #endif
> > > -#define FATAL  22  /* fatal error - abort process */
> > > -#define PANIC  23  /* take down the other backends with
> me */
> > > +#define FATAL  23  /* fatal error - abort process */
> > > +#define PANIC  24  /* take down the other backends with
> me */
> > >
> > > I see that without that, COMMIT AND CHAIN does not behave correctly,
> > > since the respective regression tests fail.
> > >
> > > But I don't understand why.  I think that this needs some more
> comments to
> > > make this clear.
> >
> > First off thanks for reviewing.
> >
> > The problem is that ereport does not return for any level equal to or
> above ERROR.
> >  This code required it to return so that it could continue processing
>
> Oh, I see.
>
> After thinking some more about it, I think that COMMIT AND CHAIN would have
> to change behavior: if COMMIT throws an error (because the transaction was
> aborted), no new transaction should be started.  Everything else seems
> fishy:
> the statement fails, but still starts a new transaction?
>
> I guess that's also at fault for the unexpected result status that
> Masahiko complained about in the other message.
>

I haven't had a look at the result status in libpq. For JDBC we don't see
that.
We throw an exception when we get this error report. This is very
consistent as the commit fails and we throw an exception


> So I think we should not introduce USER_ERROR at all.  It is too much
> of a kluge: fail, but not really...
>

What we do now is actually worse as we do not get an error report and we
silently change commit to rollback.
How is this better ?

Dave


>

Re: Error on failed COMMIT

2021-01-26 Thread Laurenz Albe

On Tue, 2021-01-26 at 11:09 -0500, Dave Cramer wrote:
> On Tue, 26 Jan 2021 at 05:05, Laurenz Albe  wrote:
>
> > I wonder about the introduction of the new USER_ERROR level:
> > 
> >  #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as 
> > usual, but
> >  * never to the server log. */
> > -#define ERROR  21  /* user error - abort transaction; return to
> > +#define USER_ERROR 21
> > +#define ERROR  22  /* user error - abort transaction; return to
> >  * known state */
> >  /* Save ERROR value in PGERROR so it can be restored when Win32 includes
> >   * modify it.  We have to use a constant rather than ERROR because macros
> >   * are expanded only when referenced outside macros.
> >   */
> >  #ifdef WIN32
> > -#define PGERROR21
> > +#define PGERROR22
> >  #endif
> > -#define FATAL  22  /* fatal error - abort process */
> > -#define PANIC  23  /* take down the other backends with me */
> > +#define FATAL  23  /* fatal error - abort process */
> > +#define PANIC  24  /* take down the other backends with me */
> > 
> > I see that without that, COMMIT AND CHAIN does not behave correctly,
> > since the respective regression tests fail.
> > 
> > But I don't understand why.  I think that this needs some more comments to
> > make this clear.
> 
> First off thanks for reviewing.
> 
> The problem is that ereport does not return for any level equal to or above 
> ERROR.
>  This code required it to return so that it could continue processing

Oh, I see.

After thinking some more about it, I think that COMMIT AND CHAIN would have
to change behavior: if COMMIT throws an error (because the transaction was
aborted), no new transaction should be started.  Everything else seems fishy:
the statement fails, but still starts a new transaction?

I guess that's also at fault for the unexpected result status that
Masahiko complained about in the other message.

So I think we should not introduce USER_ERROR at all.  It is too much
of a kluge: fail, but not really...

I guess that is one example for the incompatibilities that Tom worried
about upthread.  I am beginning to see his point better now.

Yours,
Laurenz Albe

Re: FETCH FIRST clause PERCENT option

2021-01-26 Thread Surafel Temesgen

On Mon, Jan 25, 2021 at 2:39 PM Kyotaro Horiguchi 
wrote:

> Sorry for the dealy. I started to look this.
>
>
Hi kyotaro,
Thanks for looking into this but did we agree to proceed
on this approach? I fear that it will be the west of effort if
Andrew comes up with the patch for his approach.
Andrew Gierth: Are you working on your approach?

regards
Surafel

Re: Error on failed COMMIT

2021-01-26 Thread Dave Cramer

On Tue, 26 Jan 2021 at 05:05, Laurenz Albe  wrote:

> On Mon, 2021-01-25 at 11:29 -0500, Dave Cramer wrote:
> > Rebased against head
> >
> > Here's my summary of the long thread above.
> >
> > This change is in keeping with the SQL spec.
> >
> > There is an argument (Tom) that says that this will annoy more people
> than it will please.
> >  I presume this is due to the fact that libpq behaviour will change.
> >
> > As the author of the JDBC driver, and I believe I speak for other driver
> (NPGSQL for one)
> >  authors as well that have implemented the protocol I would argue that
> the current behaviour
> >  is more annoying.
> >
> > We currently have to keep state and determine if COMMIT actually failed
> or it ROLLED BACK.
> >  There are a number of async drivers that would also benefit from not
> having to keep state
> >  in the session.
>
> I think this change makes sense, but I think everybody agrees that it does
> as it
> makes PostgreSQL more standard compliant.
>
> About the fear that it will break user's applications:
>
> I think that the breakage will be minimal.  All that will change is that
> COMMIT of
> an aborted transaction raises an error.
>
> Applications that catch an error in a transaction and roll back will not
> be affected.  What will be affected are applications that do *not* check
> for
> errors in statements in a transaction, but check for errors in the COMMIT.
> I think that doesn't happen often.
>
> I agree that some people will be hurt, but I don't think it will be a
> major problem.
>
> The patch applies and passes regression tests.
>
> I wonder about the introduction of the new USER_ERROR level:
>
>  #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as
> usual, but
>  * never to the server log. */
> -#define ERROR  21  /* user error - abort transaction; return
> to
> +#define USER_ERROR 21
> +#define ERROR  22  /* user error - abort transaction; return
> to
>  * known state */
>  /* Save ERROR value in PGERROR so it can be restored when Win32 includes
>   * modify it.  We have to use a constant rather than ERROR because macros
>   * are expanded only when referenced outside macros.
>   */
>  #ifdef WIN32
> -#define PGERROR21
> +#define PGERROR22
>  #endif
> -#define FATAL  22  /* fatal error - abort process */
> -#define PANIC  23  /* take down the other backends with me */
> +#define FATAL  23  /* fatal error - abort process */
> +#define PANIC  24  /* take down the other backends with me */
>
> I see that without that, COMMIT AND CHAIN does not behave correctly,
> since the respective regression tests fail.
>
> But I don't understand why.  I think that this needs some more comments to
> make this clear.
>
> First off thanks for reviewing.

The problem is that ereport does not return for any level equal to or above
ERROR. This code required it to return so that it could continue processing
So after re-acquainting myself with the code I propose: Now we could use
"TRANSACTION_ERROR" instead of "USER_ERROR"
I'd like to comment more but I do not believe that elog.h is the place.
Suggestions ?


index 3c0e57621f..df79a2d6db 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -42,17 +42,19 @@
 * WARNING
is for unexpected messages. */
 #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as
usual, but
 * never to
the server log. */
-#define ERROR  21  /* user error - abort
transaction; return to
+#define USER_ERROR 21  /* similar to ERROR, except
we don't want to
+   * exit the
current context. */
+#define ERROR  22  /* user error - abort
transaction; return to
 * known
state */
 /* Save ERROR value in PGERROR so it can be restored when Win32 includes
  * modify it.  We have to use a constant rather than ERROR because macros
  * are expanded only when referenced outside macros.
  */
 #ifdef WIN32
-#define PGERROR21
+#define PGERROR22
 #endif
-#define FATAL  22  /* fatal error - abort
process */
-#define PANIC  23  /* take down the other
backends with me */
+#define FATAL  23  /* fatal error - abort
process */
+#define PANIC  24  /* take down the other
backends with me */





> Is this new message level something we need to allow setting for
> "client_min_messages" and "log_min_messages"?
>

Good question. I had not given that any thought.


Dave Cramer
www.postgres.rocks

>
>

Re: mkid reference

2021-01-26 Thread Magnus Hagander

On Mon, Jan 25, 2021 at 4:38 PM Bruce Momjian  wrote:
>
> On Fri, Jan 22, 2021 at 01:07:36PM -0500, Tom Lane wrote:
> > > There's also src/tools/make_mkid which use this mkid tool.  +1 for 
> > > removing.
> > > If anything, it seems better replaced by extended documentation on the 
> > > existing
> > > wiki article [0] on how to use "git format-patch".
> >
> > I found man pages for mkid online --- it's apparently a ctags-like
> > code indexing tool, not something for patches.  So maybe Bruce still
> > uses it, or maybe not.  But as long as we've also got make_ctags and
>
> Yes, I do still use it, so I thought having a script to generate its
> index files might be helpful to someone.

Where do you actually get it? The old docs (now removed) suggested
getting it off ftp.postgresql.org...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-26 Thread David G. Johnston

On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda 
wrote:

>
> > (wal_write)
> > The number of times WAL buffers were written out to disk via XLogWrite
> >
>
> Thanks.
>
> I thought it's better to omit "The" and "XLogWrite" because other views'
> description
> omits "The" and there is no description of "XlogWrite" in the documents.
> What do you think?
>
>
The documentation for WAL does get into the public API level of detail and
doing so here makes what this measures crystal clear.  The potential
absence of sufficient detail elsewhere should be corrected instead of
making this description more vague.  Specifically, probably XLogWrite
should be added to the WAL overview as part of this update and probably
even have the descriptive section of the documentation note that the number
of times that said function is executed is exposed as a counter in the wal
statistics table - thus closing the loop.

David J.

Re: Error on failed COMMIT

2021-01-26 Thread Dave Cramer

On Tue, 26 Jan 2021 at 06:59, Masahiko Sawada  wrote:

> On Tue, Jan 26, 2021 at 7:06 PM Laurenz Albe 
> wrote:
> >
> > On Mon, 2021-01-25 at 11:29 -0500, Dave Cramer wrote:
> > > Rebased against head
> > >
> > > Here's my summary of the long thread above.
> > >
> > > This change is in keeping with the SQL spec.
> > >
> > > There is an argument (Tom) that says that this will annoy more people
> than it will please.
> > >  I presume this is due to the fact that libpq behaviour will change.
> > >
> > > As the author of the JDBC driver, and I believe I speak for other
> driver (NPGSQL for one)
> > >  authors as well that have implemented the protocol I would argue that
> the current behaviour
> > >  is more annoying.
> > >
> > > We currently have to keep state and determine if COMMIT actually
> failed or it ROLLED BACK.
> > >  There are a number of async drivers that would also benefit from not
> having to keep state
> > >  in the session.
> >
> > I think this change makes sense, but I think everybody agrees that it
> does as it
> > makes PostgreSQL more standard compliant.
> >
> > About the fear that it will break user's applications:
> >
> > I think that the breakage will be minimal.  All that will change is that
> COMMIT of
> > an aborted transaction raises an error.
> >
> > Applications that catch an error in a transaction and roll back will not
> > be affected.  What will be affected are applications that do *not* check
> for
> > errors in statements in a transaction, but check for errors in the
> COMMIT.
> > I think that doesn't happen often.
> >
> > I agree that some people will be hurt, but I don't think it will be a
> major problem.
> >
> > The patch applies and passes regression tests.
> >
> > I wonder about the introduction of the new USER_ERROR level:
> >
> >  #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as
> usual, but
> >  * never to the server log. */
> > -#define ERROR  21  /* user error - abort transaction;
> return to
> > +#define USER_ERROR 21
> > +#define ERROR  22  /* user error - abort transaction;
> return to
> >  * known state */
> >  /* Save ERROR value in PGERROR so it can be restored when Win32 includes
> >   * modify it.  We have to use a constant rather than ERROR because
> macros
> >   * are expanded only when referenced outside macros.
> >   */
> >  #ifdef WIN32
> > -#define PGERROR21
> > +#define PGERROR22
> >  #endif
> > -#define FATAL  22  /* fatal error - abort process */
> > -#define PANIC  23  /* take down the other backends with me
> */
> > +#define FATAL  23  /* fatal error - abort process */
> > +#define PANIC  24  /* take down the other backends with me
> */
> >
> > I see that without that, COMMIT AND CHAIN does not behave correctly,
> > since the respective regression tests fail.
> >
> > But I don't understand why.  I think that this needs some more comments
> to
> > make this clear.
>
> While testing the patch I realized that the client gets an
> acknowledgment of COMMIT command completed successfully from
> PostgreSQL server (i.g., PQgetResult() returns PGRES_COMMAND_OK) even
> if the server raises an USER_ERROR level error. I think the command
> should be failed. Because otherwise, the drivers need to throw an
> exception by re-interpreting the results even in a case where the
> command is completed successfully.
>
> Regards,
>

Interesting. Thanks for looking at this. I'm curious what we return now
when we return rollback instead

Dave

>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>

Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Simon Riggs

On Tue, Jan 26, 2021 at 12:51 PM Vik Fearing  wrote:
>
> On 1/26/21 1:16 PM, Simon Riggs wrote:
> > On Tue, Jan 26, 2021 at 11:33 AM Vik Fearing  
> > wrote:
> >>
> >> On 1/11/21 3:02 PM, Simon Riggs wrote:
> >>> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently 
> >>> doesn't
> >>
> >> I'm still in the weeds of reviewing this patch, but why should this
> >> fail?  It should not fail.
> >
> > It should not be possible for the user to change the start or end
> > timestamp of a system_time time range, by definition.
>
> Correct, but setting it to DEFAULT is not changing it.
>
> See also SQL:2016 11.5  General Rule 3.a.

Thanks for pointing this out. Identity columns don't currently work that way.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Vik Fearing

On 1/26/21 1:16 PM, Simon Riggs wrote:
> On Tue, Jan 26, 2021 at 11:33 AM Vik Fearing  wrote:
>>
>> On 1/11/21 3:02 PM, Simon Riggs wrote:
>>> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently doesn't
>>
>> I'm still in the weeds of reviewing this patch, but why should this
>> fail?  It should not fail.
> 
> It should not be possible for the user to change the start or end
> timestamp of a system_time time range, by definition.

Correct, but setting it to DEFAULT is not changing it.

See also SQL:2016 11.5  General Rule 3.a.
-- 
Vik Fearing

Re: Allow matching whole DN from a client certificate

2021-01-26 Thread Daniel Gustafsson

> On 20 Jan 2021, at 20:07, Jacob Champion  wrote:
> 
> On Mon, 2021-01-18 at 11:23 +0100, Daniel Gustafsson wrote:
>> +   /* use commas instead of slashes */
>> +   X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS);
>> I don't have strong opinions on whether we shold use slashes or commas, but I
>> think it needs to be documented that commas are required since slashes is the
>> more common way to print a DN.
> 
> There's also a XN_FLAG_RFC2253 flag, which claims to print in an RFC-
> compatible escape system. (It includes XN_FLAG_SEP_COMMA_PLUS.) I think
> NSS uses a similar RFC-based escape scheme, but I haven't checked to
> see whether it's fully compatible.
> 
> I think you'll want to be careful to specify the format as much as
> possible, both to make sure that other backend TLS implementations can
> actually use the same escaping system and to ensure that user regexes
> don't suddenly start matching different things at some point in the
> future.

Reading more on this it seems we would essentially have to go with RFC 4514, as
it's the closest to a standard which seems to exist.  Having done a lot
research on this topic, do you have a gut feeling on direction?

The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517
section 4.2.15 (which in turn reference RFC 4514 for the DN string format).
libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in
lib/certdb/alg1485.c.  Comparing the two would be interesting.

> Even when using RFC 2253 (now 4514) escaping, there are things left to the 
> implementation, such as the order of AVAs inside multivalued RDNs. The RFC 
> warns not to consider these representations to be canonical forms, so it's 
> possible that switching (or upgrading) the TLS library in use could force 
> users to change their regexes at some point in the future.

Right, multi-value RDN's are defined as sets so C=US,A=B+C=D is equivalent to
C=US,C=D+A=B according to RFC 5280.

--
Daniel Gustafsson   https://vmware.com/

Re: Extensions not dumped when --schema is used

2021-01-26 Thread Guillaume Lelarge

Le mar. 26 janv. 2021 à 13:41, Guillaume Lelarge  a
écrit :

> Le mar. 26 janv. 2021 à 05:10, Julien Rouhaud  a
> écrit :
>
>> On Mon, Jan 25, 2021 at 9:34 PM Guillaume Lelarge
>>  wrote:
>> >
>> > "Anytime soon" was a long long time ago, and I eventually completely
>> forgot this, sorry. As nobody worked on it yet, I took a shot at it. See
>> attached patch.
>>
>> Great!
>>
>> I didn't reviewed it thoroughly yet, but after a quick look it sounds
>> sensible.  I'd prefer to see some tests added, and it looks like a
>> test for plpgsql could be added quite easily.
>>
>>
> I tried that all afternoon yesterday, but failed to do so. My had still
> hurts, but I'll try again though it may take some time.
>
>
s/My had/My head/ ..

> I don't know if I should add this right away in the commit fest app. If
>> yes, I guess it should go on the next commit fest (2021-03), right?
>>
>> Correct, and please add it on the commit fest!
>>
>
> Done, see https://commitfest.postgresql.org/32/2956/.
>
>

-- 
Guillaume.

Re: Extensions not dumped when --schema is used

2021-01-26 Thread Guillaume Lelarge

Le mar. 26 janv. 2021 à 05:10, Julien Rouhaud  a écrit :

> On Mon, Jan 25, 2021 at 9:34 PM Guillaume Lelarge
>  wrote:
> >
> > "Anytime soon" was a long long time ago, and I eventually completely
> forgot this, sorry. As nobody worked on it yet, I took a shot at it. See
> attached patch.
>
> Great!
>
> I didn't reviewed it thoroughly yet, but after a quick look it sounds
> sensible.  I'd prefer to see some tests added, and it looks like a
> test for plpgsql could be added quite easily.
>
>
I tried that all afternoon yesterday, but failed to do so. My had still
hurts, but I'll try again though it may take some time.

> I don't know if I should add this right away in the commit fest app. If
> yes, I guess it should go on the next commit fest (2021-03), right?
>
> Correct, and please add it on the commit fest!
>

Done, see https://commitfest.postgresql.org/32/2956/.


-- 
Guillaume.

Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Simon Riggs

On Tue, Jan 26, 2021 at 11:33 AM Vik Fearing  wrote:
>
> On 1/11/21 3:02 PM, Simon Riggs wrote:
> > * UPDATE foo SET start_timestamp = DEFAULT should fail but currently doesn't
>
> I'm still in the weeds of reviewing this patch, but why should this
> fail?  It should not fail.

It should not be possible for the user to change the start or end
timestamp of a system_time time range, by definition.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-26 Thread Mark Rofail

Hello Joel,

Thank you for your kind words and happy that you benefited from this patch.
We simply assert that the update/delete method used is supported currently
only "NO ACTION" and "RESTRICT", this can be extended in future patches
without rework, just extra logic.
Please don't hesitate to give your feedback.

I would love some help with some performance comparisons if you are up to
it, between Many-to-Many, Foreign Key Arrays and Gin Indexed Foreign Key
Arrays.

/Mark

On Tue, Jan 26, 2021 at 1:51 PM Joel Jacobson  wrote:

> Hi Mark,
>
> On Mon, Jan 25, 2021, at 09:14, Joel Jacobson wrote:
>
> I'll continue testing over the next couple of days and report if I find
> any more oddities.
>
>
> I've continued testing by trying to make full use of this feature in the
> catalog-diff-tool I'm working on.
>
> Today I became aware of a SQL feature that I must confess I've never used
> before,
> that turned out to be useful in my tool, as I then wouldn't need to follow
> FKs
> to do updates manually, but let the database do them automatically.
>
> I'm talking about "ON CASCADE UPDATE", which I see is not supported in the
> patch.
>
> In my tool, I could simplify the code for normal FKs by using ON CASCADE
> UPDATE,
> but will continue to need my hand-written traverse-FKs-recursively code
> for Foreign Key Arrays.
>
> I lived a long SQL life without ever needing ON CASCADE UPDATE,
> so I would definitively vote for Foreign Key Arrays to be added even
> without support for this.
>
> Would you say the patch is written in a way which would allow adding
> support for ON CASCADE UPDATE
> later on mostly by adding code, or would it require a major rewrite?
>
> I hesitated if I should share this with you, since I'm really grateful for
> this feature even without ON CASCADE UPDATE,
> but since this was discovered when testing real-life scenario and not some
> hypothetical example,
> I felt it should be noted that I stumbled upon this during testing.
>
> Again, thank you so much for working on this.
>
> /Joel
>

Re: Error on failed COMMIT

2021-01-26 Thread Masahiko Sawada

On Tue, Jan 26, 2021 at 7:06 PM Laurenz Albe  wrote:
>
> On Mon, 2021-01-25 at 11:29 -0500, Dave Cramer wrote:
> > Rebased against head
> >
> > Here's my summary of the long thread above.
> >
> > This change is in keeping with the SQL spec.
> >
> > There is an argument (Tom) that says that this will annoy more people than 
> > it will please.
> >  I presume this is due to the fact that libpq behaviour will change.
> >
> > As the author of the JDBC driver, and I believe I speak for other driver 
> > (NPGSQL for one)
> >  authors as well that have implemented the protocol I would argue that the 
> > current behaviour
> >  is more annoying.
> >
> > We currently have to keep state and determine if COMMIT actually failed or 
> > it ROLLED BACK.
> >  There are a number of async drivers that would also benefit from not 
> > having to keep state
> >  in the session.
>
> I think this change makes sense, but I think everybody agrees that it does as 
> it
> makes PostgreSQL more standard compliant.
>
> About the fear that it will break user's applications:
>
> I think that the breakage will be minimal.  All that will change is that 
> COMMIT of
> an aborted transaction raises an error.
>
> Applications that catch an error in a transaction and roll back will not
> be affected.  What will be affected are applications that do *not* check for
> errors in statements in a transaction, but check for errors in the COMMIT.
> I think that doesn't happen often.
>
> I agree that some people will be hurt, but I don't think it will be a major 
> problem.
>
> The patch applies and passes regression tests.
>
> I wonder about the introduction of the new USER_ERROR level:
>
>  #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as 
> usual, but
>  * never to the server log. */
> -#define ERROR  21  /* user error - abort transaction; return to
> +#define USER_ERROR 21
> +#define ERROR  22  /* user error - abort transaction; return to
>  * known state */
>  /* Save ERROR value in PGERROR so it can be restored when Win32 includes
>   * modify it.  We have to use a constant rather than ERROR because macros
>   * are expanded only when referenced outside macros.
>   */
>  #ifdef WIN32
> -#define PGERROR21
> +#define PGERROR22
>  #endif
> -#define FATAL  22  /* fatal error - abort process */
> -#define PANIC  23  /* take down the other backends with me */
> +#define FATAL  23  /* fatal error - abort process */
> +#define PANIC  24  /* take down the other backends with me */
>
> I see that without that, COMMIT AND CHAIN does not behave correctly,
> since the respective regression tests fail.
>
> But I don't understand why.  I think that this needs some more comments to
> make this clear.

While testing the patch I realized that the client gets an
acknowledgment of COMMIT command completed successfully from
PostgreSQL server (i.g., PQgetResult() returns PGRES_COMMAND_OK) even
if the server raises an USER_ERROR level error. I think the command
should be failed. Because otherwise, the drivers need to throw an
exception by re-interpreting the results even in a case where the
command is completed successfully.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-26 Thread Joel Jacobson

Hi Mark,

On Mon, Jan 25, 2021, at 09:14, Joel Jacobson wrote:
> I'll continue testing over the next couple of days and report if I find any 
> more oddities.

I've continued testing by trying to make full use of this feature in the 
catalog-diff-tool I'm working on.

Today I became aware of a SQL feature that I must confess I've never used 
before,
that turned out to be useful in my tool, as I then wouldn't need to follow FKs
to do updates manually, but let the database do them automatically.

I'm talking about "ON CASCADE UPDATE", which I see is not supported in the 
patch.

In my tool, I could simplify the code for normal FKs by using ON CASCADE UPDATE,
but will continue to need my hand-written traverse-FKs-recursively code for 
Foreign Key Arrays.

I lived a long SQL life without ever needing ON CASCADE UPDATE,
so I would definitively vote for Foreign Key Arrays to be added even without 
support for this.

Would you say the patch is written in a way which would allow adding support 
for ON CASCADE UPDATE
later on mostly by adding code, or would it require a major rewrite?

I hesitated if I should share this with you, since I'm really grateful for this 
feature even without ON CASCADE UPDATE,
but since this was discovered when testing real-life scenario and not some 
hypothetical example,
I felt it should be noted that I stumbled upon this during testing.

Again, thank you so much for working on this.

/Joel

Re: WIP: System Versioned Temporal Table

2021-01-26 Thread Vik Fearing

On 1/11/21 3:02 PM, Simon Riggs wrote:
> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently doesn't

I'm still in the weeds of reviewing this patch, but why should this
fail?  It should not fail.
-- 
Vik Fearing

Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation

2021-01-26 Thread japin


Hi,

When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION ... 
WITH (...),
it says "set_publication_option" only support "refresh" in documentation [1].
However, we can also supply the "copy_data" option, and the code is:

case ALTER_SUBSCRIPTION_PUBLICATION:
{
boolcopy_data;
boolrefresh;

parse_subscription_options(stmt->options,
   NULL,/* no "connect" */
   NULL, NULL,  /* no "enabled" */
   NULL,/* no "create_slot" */
   NULL, NULL,  /* no "slot_name" */
   _data,
   NULL,/* no "synchronous_commit" 
*/
   ,
   NULL, NULL,  /* no "binary" */
   NULL, NULL); /* no "streaming" */
values[Anum_pg_subscription_subpublications - 1] =
publicationListToArray(stmt->publication);
replaces[Anum_pg_subscription_subpublications - 1] = true;

update_tuple = true;

/* Refresh if user asked us to. */
if (refresh)
{
if (!sub->enabled)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("ALTER SUBSCRIPTION with refresh is not 
allowed for disabled subscriptions"),
 errhint("Use ALTER SUBSCRIPTION ... SET 
PUBLICATION ... WITH (refresh = false).")));

/* Make sure refresh sees the new list of publications. */
sub->publications = stmt->publication;

AlterSubscription_refresh(sub, copy_data);
}

break;
}

Should we fix the documentation or the code?  I'd be inclined fix the 
documentation.

[1] - https://www.postgresql.org/docs/devel/sql-altersubscription.html

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 97c427e0f4..418e9a8ce1 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -101,6 +101,17 @@ ALTER SUBSCRIPTION name RENAME TO <
  
 

+   
+copy_data (boolean)
+
+ 
+  Specifies whether the existing data in the publications that are
+  being subscribed to should be copied once the replication starts.
+  The default is true.  (Previously subscribed
+  tables are not copied.)
+ 
+
+   
   
 
   Additionally, refresh options as described

Re: shared tempfile was not removed on statement_timeout

2021-01-26 Thread Kyotaro Horiguchi

At Tue, 26 Jan 2021 11:00:56 +0200, Heikki Linnakangas  wrote 
in 
> On 26/01/2021 06:46, Kyotaro Horiguchi wrote:
> > Looking the comment of SharedFileSetOnDetach:
> > | * everything in them.  We can't raise an error on failures, because this
> > | * runs
> > | * in error cleanup paths.
> > I feel that a function that shouldn't error-out also shouldn't be
> > cancellable. If that's the case, we omit the CHECK_FOR_INTERRUPTS() in
> > walkdir() when elevel is smaller than ERROR.
> > =
> > diff --git a/src/backend/storage/file/fd.c
> > b/src/backend/storage/file/fd.c
> > index b58502837a..593c23553e 100644
> > --- a/src/backend/storage/file/fd.c
> > +++ b/src/backend/storage/file/fd.c
> > @@ -3374,7 +3374,9 @@ walkdir(const char *path,
> > {
> > charsubpath[MAXPGPATH * 2];
> >   - CHECK_FOR_INTERRUPTS();
> > +   /* omit interrupts while we shouldn't error-out */
> > +   if (elevel >= ERROR)
> > +   CHECK_FOR_INTERRUPTS();
> > if (strcmp(de->d_name, ".") == 0 ||
> > strcmp(de->d_name, "..") == 0)
> > =
> 
> Don't we potentially have the same problem with all on_dsm_detach
> callbacks? Looking at the other on_dsm_detach callbacks, I don't see
> any CHECK_FOR_INTERRUPT() calls in them, but it seems fragile to
> assume that.
> 
> I'd suggest adding HOLD/RESUME_INTERRUPTS() to dsm_detach(). At least
> around the removal of the callback from the list and calling the
> callback. Maybe even over the whole function.

Yes, I first came up with HOLD/RESUME_QUERY_INTERRUPTS() to the same
location.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-26 Thread japin


On Tue, 26 Jan 2021 at 13:46, japin  wrote:
> Hi, Bharath
>
> Thanks for your reviewing.
>
> On Tue, 26 Jan 2021 at 12:55, Bharath Rupireddy 
>  wrote:
>> On Tue, Jan 26, 2021 at 9:25 AM japin  wrote:
>>> > I think it's more convenient. Any thoughts?
>>>
>>> Sorry, I forgot to attach the patch.
>>
>> As I mentioned earlier in [1], +1 from my end to have the new syntax
>> for adding/dropping publications from subscriptions i.e. ALTER
>> SUBSCRIPTION ... ADD/DROP PUBLICATION. But I'm really not sure why
>> that syntax was not added earlier. Are we missing something here?
>>
>
> Yeah, we should figure out why we do not support this syntax earlier.  It 
> seems
> ALTER SUBSCRIPTION is introduced in 665d1fad99e, however the commit do not
> contains any discussion URL.
>
>> I would like to hear opinions from other hackers.
>>
>> [1] - 
>> https://www.postgresql.org/message-id/CALj2ACVGDNZDQk3wfv%3D3zYTg%3DqKUaEa5s1f%2B9_PYxN0QyAUdCw%40mail.gmail.com
>>
>> Some quick comments on the patch, although I have not taken a deeper look at 
>> it:
>>
>> 1. IMO, it will be good if the patch is divided into say coding, test
>> cases and documentation
>
> Agreed.  I will refactor it in next patch.
>

Split v1 path into coding, test cases, documentation and tab-complete.

>> 2. Looks like AlterSubscription() is already having ~200 LOC, why
>> can't we have separate functions for each ALTER_SUBSCRIPTION_ case
>> or at least for the new code that's getting added for this patch?
>
> I'm not sure it is necessary to provide a separate functions for each
> ALTER_SUBSCRIPTION_XXX, so I just followed current style.
>
>> 3. The new code added for ALTER_SUBSCRIPTION_ADD_PUBLICATION and
>> ALTER_SUBSCRIPTION_DROP_PUBLICATION looks almost same maybe with
>> little difference, so why can't we have single function
>> (alter_subscription_add_or_drop_publication or
>> hanlde_add_or_drop_publication or some better name?) and pass in a
>> flag to differentiate the code that differs for both commands.
>
> Agreed.

At present, I create a new function merge_subpublications() to merge the origin
publications and add/drop publications.  Thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 5af6cc67938b31b39fa1998a10a9c7f1bdd8fb0e Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 26 Jan 2021 15:43:52 +0800
Subject: [PATCH v2 1/4] Add ALTER SUBSCRIPTION...ADD/DROP PUBLICATION...
 syntax

---
 src/backend/commands/subscriptioncmds.c | 119 
 src/backend/parser/gram.y   |  20 
 src/include/nodes/parsenodes.h  |   2 +
 3 files changed, 141 insertions(+)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 082f7855b8..07167c9a8b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -46,6 +46,8 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static List *merge_subpublications(HeapTuple tuple, TupleDesc tupledesc,
+   List *publications, bool addpub);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -857,6 +859,51 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 break;
 			}
 
+		case ALTER_SUBSCRIPTION_ADD_PUBLICATION:
+		case ALTER_SUBSCRIPTION_DROP_PUBLICATION:
+			{
+bool		copy_data;
+bool		refresh;
+List	   *publications = NIL;
+
+publications = merge_subpublications(tup, RelationGetDescr(rel),
+	 stmt->publication,
+	 stmt->kind == ALTER_SUBSCRIPTION_ADD_PUBLICATION);
+
+parse_subscription_options(stmt->options,
+		   NULL,	/* no "connect" */
+		   NULL, NULL,	/* no "enabled" */
+		   NULL,	/* no "create_slot" */
+		   NULL, NULL,	/* no "slot_name" */
+		   _data,
+		   NULL,	/* no "synchronous_commit" */
+		   ,
+		   NULL, NULL,	/* no "binary" */
+		   NULL, NULL); /* no "streaming" */
+values[Anum_pg_subscription_subpublications - 1] =
+	publicationListToArray(publications);
+replaces[Anum_pg_subscription_subpublications - 1] = true;
+
+update_tuple = true;
+
+/* Refresh if user asked us to. */
+if (refresh)
+{
+	if (!sub->enabled)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
+ errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
+
+	/* Make sure refresh sees the new list of publications. */
+	sub->publications = publications;
+
+	AlterSubscription_refresh(sub, copy_data);
+}
+
+break;
+			}
+
 		case ALTER_SUBSCRIPTION_REFRESH:
 			{
 bool		copy_data;
@@ -1278,3 +1325,75 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications)
 
 	return tablelist;
 }
+
+/*
+ * merge current subpublications and user specified by add/drop

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-26 Thread Hou, Zhijie

> I think we can allow parallel insert in this case, because the column value
> is determined according to the DEFAULT definition of the target table
> specified in the INSERT statement.  This is described here:
> 
> https://www.postgresql.org/docs/devel/sql-createtable.html
> 
> "Defaults may be specified separately for each partition. But note that
> a partition's default value is not applied when inserting a tuple through
> a partitioned table."
> 
> So the parallel-unsafe function should not be called.

Thanks for the explanation.
I think you are right, I did miss it.

Best regards,
houzj

Re: automatic analyze: readahead - add "IO read time" log message

2021-01-26 Thread Heikki Linnakangas


On 13/01/2021 23:17, Stephen Frost wrote:

Would be great to get a review / comments from others as to if there's
any concerns.  I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)


Look good overall. A few minor comments:

The patch consists of two part: add stats to the log for auto-analyze, 
and implement prefetching. They seem like independent features, consider 
splitting into two patches.


It's a bit weird that you get more stats in the log for 
autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not 
really this patch's fault though.


This conflicts with the patch at 
https://commitfest.postgresql.org/31/2907/, to refactor the table AM 
analyze API. That's OK, it's straightforward to resolve regardless of 
which patch is committed first.



/* Outer loop over blocks to sample */
while (BlockSampler_HasMore())
{
#ifdef USE_PREFETCH
BlockNumber prefetch_targblock = InvalidBlockNumber;
#endif
BlockNumber targblock = BlockSampler_Next();

#ifdef USE_PREFETCH

/*
 * Make sure that every time the main BlockSampler is moved 
forward
 * that our prefetch BlockSampler also gets moved forward, so 
that we
 * always stay out ahead.
 */
if (BlockSampler_HasMore(_bs))
prefetch_targblock = BlockSampler_Next(_bs);
#endif

vacuum_delay_point();

if (!table_scan_analyze_next_block(scan, targblock, 
vac_strategy))
continue;

#ifdef USE_PREFETCH

/*
 * When pre-fetching, after we get a block, tell the kernel 
about the
 * next one we will want, if there's any left.
 */
if (effective_io_concurrency && prefetch_targblock != 
InvalidBlockNumber)
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
prefetch_targblock);
#endif

while (table_scan_analyze_next_tuple(scan, OldestXmin, , 
, slot))
{
...
}

pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 
++blksdone);
}


If effective_io_concurrency == 0, this calls 
BlockSampler_Next(_bs) anyway, which is a waste of cycles.


If table_scan_analyze_next_block() returns false, we skip the 
PrefetchBuffer() call. That seem wrong.


Is there any potential harm from calling PrefetchBuffer() on a page that 
table_scan_analyze_next_block() later deems as unsuitable for smapling 
and returns false? That's theoretical at the moment, because 
heapam_scan_analyze_next_block() always returns true. (The tableam 
ANALYZE API refactor patch will make this moot, as it moves this logic 
into the tableam's implementation, so the implementation can do whatever 
make sense for the particular AM.)


- Heikki

Re: Error on failed COMMIT

2021-01-26 Thread Laurenz Albe

On Mon, 2021-01-25 at 11:29 -0500, Dave Cramer wrote:
> Rebased against head 
> 
> Here's my summary of the long thread above.
> 
> This change is in keeping with the SQL spec.
> 
> There is an argument (Tom) that says that this will annoy more people than it 
> will please.
>  I presume this is due to the fact that libpq behaviour will change.
> 
> As the author of the JDBC driver, and I believe I speak for other driver 
> (NPGSQL for one)
>  authors as well that have implemented the protocol I would argue that the 
> current behaviour
>  is more annoying.
> 
> We currently have to keep state and determine if COMMIT actually failed or it 
> ROLLED BACK.
>  There are a number of async drivers that would also benefit from not having 
> to keep state
>  in the session.

I think this change makes sense, but I think everybody agrees that it does as it
makes PostgreSQL more standard compliant.

About the fear that it will break user's applications:

I think that the breakage will be minimal.  All that will change is that COMMIT 
of
an aborted transaction raises an error.

Applications that catch an error in a transaction and roll back will not
be affected.  What will be affected are applications that do *not* check for
errors in statements in a transaction, but check for errors in the COMMIT.
I think that doesn't happen often.

I agree that some people will be hurt, but I don't think it will be a major 
problem.

The patch applies and passes regression tests.

I wonder about the introduction of the new USER_ERROR level:

 #define WARNING_CLIENT_ONLY20  /* Warnings to be sent to client as usual, 
but
 * never to the server log. */
-#define ERROR  21  /* user error - abort transaction; return to
+#define USER_ERROR 21
+#define ERROR  22  /* user error - abort transaction; return to
 * known state */
 /* Save ERROR value in PGERROR so it can be restored when Win32 includes
  * modify it.  We have to use a constant rather than ERROR because macros
  * are expanded only when referenced outside macros.
  */
 #ifdef WIN32
-#define PGERROR21
+#define PGERROR22
 #endif
-#define FATAL  22  /* fatal error - abort process */
-#define PANIC  23  /* take down the other backends with me */
+#define FATAL  23  /* fatal error - abort process */
+#define PANIC  24  /* take down the other backends with me */

I see that without that, COMMIT AND CHAIN does not behave correctly,
since the respective regression tests fail.

But I don't understand why.  I think that this needs some more comments to
make this clear.

Is this new message level something we need to allow setting for
"client_min_messages" and "log_min_messages"?

Yours,
Laurenz Albe

Re: [PoC] Non-volatile WAL buffer

2021-01-26 Thread Takashi Menjo

 Dear everyone, Tomas,

First of all, the "v4" patchset for non-volatile WAL buffer attached to the
previous mail is actually v5... Please read "v4" as "v5."

Then, to Tomas:
Thank you for your crash report you gave on Nov 27, 2020, regarding msync
patchset. I applied the latest msync patchset v3 attached to the previous
to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
pgbench -i -s 500. Please try it if necessary.

Best regards,
Takashi


2021年1月26日(火) 17:52 Takashi Menjo :

> Dear everyone,
>
> Sorry but I forgot to attach my patchsets... Please see the files attached
> to this mail. Please also note that they contain some fixes.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:46 Takashi Menjo :
>
>> Dear everyone,
>>
>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>> buffer; the other prefixed with v3 is for msync.
>>
>> I will reply to your thankful feedbacks one by one within days. Please
>> wait for a moment.
>>
>> Best regards,
>> Takashi
>>
>>
>> 01/25/2021(Mon) 11:56 Masahiko Sawada :
>>
>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>>>  wrote:
>>> >
>>> >
>>> >
>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
>>> > >  wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I think I've managed to get the 0002 patch [1] rebased to master and
>>> > >> working (with help from Masahiko Sawada). It's not clear to me how
>>> it
>>> > >> could have worked as submitted - my theory is that an incomplete
>>> patch
>>> > >> was submitted by mistake, or something like that.
>>> > >>
>>> > >> Unfortunately, the benchmark results were kinda disappointing. For a
>>> > >> pgbench on scale 500 (fits into shared buffers), an average of three
>>> > >> 5-minute runs looks like this:
>>> > >>
>>> > >> branch 116326496
>>> > >> 
>>> > >> master  7291 87704165310150437224186
>>> > >> ntt 7912106095213206212410237819
>>> > >> simple-no-buffers   7654 96544115416 95828103065
>>> > >>
>>> > >> NTT refers to the patch from September 10, pre-allocating a large
>>> WAL
>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
>>> removing
>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
>>> PMEM.
>>> > >>
>>> > >> Note: The patch is just replacing the old implementation with mmap.
>>> > >> That's good enough for experiments like this, but we probably want
>>> to
>>> > >> keep the old one for setups without PMEM. But it's good enough for
>>> > >> testing, benchmarking etc.
>>> > >>
>>> > >> Unfortunately, the results for this simple approach are pretty bad.
>>> Not
>>> > >> only compared to the "ntt" patch, but even to master. I'm not
>>> entirely
>>> > >> sure what's the root cause, but I have a couple hypotheses:
>>> > >>
>>> > >> 1) bug in the patch - That's clearly a possibility, although I've
>>> tried
>>> > >> tried to eliminate this possibility.
>>> > >>
>>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster
>>> than
>>> > >> NVMe storage, but still much slower than DRAM (both in terms of
>>> latency
>>> > >> and bandwidth, see [2] for some data). It's not terrible, but the
>>> > >> latency is maybe 2-3x higher - not a huge difference, but may
>>> matter for
>>> > >> WAL buffers?
>>> > >>
>>> > >> 3) PMEM does not handle parallel writes well - If you look at [2],
>>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the
>>> > >> number of threads increase. That's pretty strange / annoying,
>>> because
>>> > >> that's how we write into WAL buffers - each thread writes it's own
>>> data,
>>> > >> so parallelism is not something we can get rid of.
>>> > >>
>>> > >> I've added some simple profiling, to measure number of calls / time
>>> for
>>> > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates
>>> data
>>> > >> for each backend, and logs the counts every 1M ops.
>>> > >>
>>> > >> Typical stats from a concurrent run looks like this:
>>> > >>
>>> > >> xlog stats cnt 4300
>>> > >>map cnt 100 time 5448333 unmap cnt 100 time 3730963
>>> > >>memcpy cnt 985964 time 1550442272 len 15150499
>>> > >>memset cnt 0 time 0 len 0
>>> > >>persist cnt 13836 time 10369617 len 16292182
>>> > >>
>>> > >> The times are in nanoseconds, so this says the backend did 100
>>> mmap and
>>> > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist
>>> calls,
>>> > >> taking 10ms in total. And the most time (~1.5s) was used by
>>> pmem_memcpy
>>> > >> copying about 15MB of data. That's quite a lot :-(
>>> > >
>>> > > It might also be interesting if we can see how much time spent on
>>> each
>>> > > logging function,

Re: Protect syscache from bloating with negative cache entries

2021-01-26 Thread Heikki Linnakangas


Hi,

On 19/11/2020 07:25, Kyotaro Horiguchi wrote:

Performance measurement on the attached showed better result about
searching but maybe worse for cache entry creation.  Each time number
is the mean of 10 runs.

# Cacache (negative) entry creation
:  time(ms) (% to master)
master :  3965.61(100.0)
patched-off:  4040.93(101.9)
patched-on :  4032.22(101.7)

# Searching negative cache entries
master :  8173.46(100.0)
patched-off:  7983.43( 97.7)
patched-on :  8049.88( 98.5)

# Creation, searching and expiration
master :  6393.23(100.0)
patched-off:  6527.94(102.1)
patched-on : 15880.01(248.4)


That is, catcache searching gets faster by 2-3% but creation gets
slower by about 2%. If I moved the condition of 2 further up to
CatalogCacheCreateEntry(), that degradation reduced to 0.6%.

# Cacache (negative) entry creation
master  :  3967.45   (100.0)
patched-off :  3990.43   (100.6)
patched-on  :  4108.96   (103.6)

# Searching negative cache entries
master  :  8106.53   (100.0)
patched-off :  8036.61   ( 99.1)
patched-on  :  8058.18   ( 99.4)

# Creation, searching and expiration
master  :  6395.00   (100.0)
patched-off :  6416.57   (100.3)
patched-on  : 15830.91   (247.6)


Can you share the exact script or steps to reproduce these numbers? I 
presume these are from the catcachebench extension, but I can't figure 
out which scenario above corresponds to which catcachebench test. Also, 
catcachebench seems to depend on a bunch of tables being created in 
schema called "test"; what tables did you use for the above numbers?


- Heikki

Re: Add SQL function for SHA1

2021-01-26 Thread Daniel Gustafsson

> On 26 Jan 2021, at 04:28, Noah Misch  wrote:
> 
> On Mon, Jan 25, 2021 at 10:12:28PM +0900, Michael Paquier wrote:
>> SHA-1 is now an option available for cryptohashes, and like the
>> existing set of functions of SHA-2, I don't really see a reason why we
>> should not have a SQL function for SHA1.
> 
> NIST deprecated SHA1 over ten years ago.  It's too late to be adding this.

Agreed, and pgcrypto already allows for using sha1.

It seems like any legitimate need for sha1 could be better served by an
extension rather than supplying it in-core.

--
Daniel Gustafsson   https://vmware.com/

Re: FETCH FIRST clause PERCENT option

2021-01-26 Thread Jaime Casanova

On Mon, Jan 25, 2021 at 6:39 AM Kyotaro Horiguchi
 wrote:
>
> Sorry for the dealy. I started to look this.
>
> At Fri, 25 Sep 2020 12:25:24 +0300, Surafel Temesgen  
> wrote in
> > Hi Michael
> > On Thu, Sep 24, 2020 at 6:58 AM Michael Paquier  wrote:
> >
> > > On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote:
> > > > I also Implement PERCENT WITH TIES option. patch is attached
> > > > i don't start a new tread because the patches share common code
> > >
> > > This fails to apply per the CF bot.  Could you send a rebase?
>
> This still applies on the master HEAD.
>
> percent-incremental-v11.patch
>
> The existing nodeLimit passes the slot of the subnode to the
> caller. but this patch changes that behavior. You added a new function
> to tuplestore.c not to store a minimal tuple into the slot that passed
> from subnode, but we should refrain from scribbling on the slot passed
> from the subnode. Instead, the PERCENT path of the limit node should
> use its own ResultTupleSlot for the purpose. See nodeSort for a
> concrete example.
>
>
> +   LIMIT_OPTION_PER_WITH_TIES, /* FETCH FIRST... PERCENT WITH TIES */
>
> That name is a bit hard to read. We should spell it with complete
> words.
>
> case LIMIT_INWINDOW:
> ...
> +   if (IsPercentOption(node->limitOption) && 
> node->backwardPosition
> ...
> +   if (IsPercentOption(node->limitOption) && 
> node->reachEnd)
> ...
> +   if (IsPercentOption(node->limitOption))
>
> I think we can use separate lstate state for each condition above
> since IsPercentOption() gives a constant result through the execution
> time. For example, LIMIT_PERCENT_TUPLESLOT_NOT_FILLED and
> LIMIT_PERCENT_TUPLESLOT_FILLED and some derived states similar to the
> non-percent path.  I *feel* that makes code simpler.
>
> What do you think about this?
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

Hi,

I was testing this and found that it doesn't work well with
subselects, this query make the server crash:

"""
select * from pg_type where exists (select 1 from pg_authid fetch
first 10 percent rows only);
"""

postgres was compiled with these options:

"""
CFLAGS="-ggdb -O0 -g3 -fno-omit-frame-pointer" ./configure
--prefix=/opt/var/pgdg/14dev/percent --enable-debug --enable-cassert
--with-pgport=54314 --enable-depend
"""

attached  is the stack trace

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {0, 140724437079104, 2, 6, 6458975, 94647092334544, 
4611686018427388799, 139701007526566, 0, 281470681751456, 0, 0, 0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7f0eacb55535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 0, 0, 0, 0, 139701005283317, 2, 3760846768312216920, 
7018409443653412194, 94647092334544, 
  7003717752975484000, 0, 3976935554753230848, 140724437079344, 0, 
140724437080208}}, sa_flags = -1101967408, sa_restorer = 0x0}
sigs = {__val = {32, 0 }}
#2  0x5614beb8d44e in ExceptionalCondition (conditionName=0x5614bed22e7b 
"limit->consttype == INT8OID", errorType=0x5614bed22b76 "FailedAssertion", 
fileName=0x5614bed22c44 "subselect.c", lineNumber=1594)
at assert.c:69
No locals.
#3  0x5614be8df87d in simplify_EXISTS_query (root=0x5614bfecf4b8, 
query=0x5614bfe038e0) at subselect.c:1594
node = 0x5614bfed0190
limit = 0x5614bfed0190
#4  0x5614be8df49b in convert_EXISTS_sublink_to_join (root=0x5614bfecf4b8, 
sublink=0x5614bfe036e0, under_not=false, available_rels=0x5614bfecf1d8) at 
subselect.c:1420
result = 0x5614bfecf940
parse = 0x5614bfe039f8
subselect = 0x5614bfe038e0
whereClause = 0x5614bfecf8b8
rtoffset = 22036
varno = 22036
clause_varnos = 0x18
upper_varnos = 0x20
#5  0x5614be8e3d8d in pull_up_sublinks_qual_recurse (root=0x5614bfecf4b8, 
node=0x5614bfe036e0, jtlink1=0x7ffcf6155a48, available_rels1=0x5614bfecf1d8, 
jtlink2=0x0, available_rels2=0x0) at prepjointree.c:458
sublink = 0x5614bfe036e0
j = 0x0
child_rels = 0x5614bfecf940
#6  0x5614be8e38cc in pull_up_sublinks_jointree_recurse 
(root=0x5614bfecf4b8, jtnode=0x5614bfecf3d0, relids=0x7ffcf6155ab0) at 
prepjointree.c:275
frelids = 0x5614bfecf1d8
newf = 0x5614bfecf928
jtlink = 0x5614bfecf928
f = 0x5614bfecf3d0
newfromlist = 0x5614bfecf8d0
l = 0x0
__func__ = "pull_up_sublinks_jointree_recurse"
#7  0x5614be8e36ce in pull_up_sublinks (root=0x5614bfecf4b8) at 
prepjointree.c:214
jtnode = 0x5614bfecf6e8
relids = 0x0
#8  0x5614be8c9e6a in subquery_planner (glob=0x5614bfe2acc0,

Re: shared tempfile was not removed on statement_timeout

2021-01-26 Thread Heikki Linnakangas


On 26/01/2021 06:46, Kyotaro Horiguchi wrote:

Looking the comment of SharedFileSetOnDetach:

| * everything in them.  We can't raise an error on failures, because this runs
| * in error cleanup paths.

I feel that a function that shouldn't error-out also shouldn't be
cancellable. If that's the case, we omit the CHECK_FOR_INTERRUPTS() in
walkdir() when elevel is smaller than ERROR.

=
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..593c23553e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3374,7 +3374,9 @@ walkdir(const char *path,
{
charsubpath[MAXPGPATH * 2];
  
-		CHECK_FOR_INTERRUPTS();

+   /* omit interrupts while we shouldn't error-out */
+   if (elevel >= ERROR)
+   CHECK_FOR_INTERRUPTS();
  
  		if (strcmp(de->d_name, ".") == 0 ||

strcmp(de->d_name, "..") == 0)
=


Don't we potentially have the same problem with all on_dsm_detach 
callbacks? Looking at the other on_dsm_detach callbacks, I don't see any 
CHECK_FOR_INTERRUPT() calls in them, but it seems fragile to assume that.


I'd suggest adding HOLD/RESUME_INTERRUPTS() to dsm_detach(). At least 
around the removal of the callback from the list and calling the 
callback. Maybe even over the whole function.


- Heikki

Re: [PoC] Non-volatile WAL buffer

2021-01-26 Thread Takashi Menjo

Dear everyone,

I'm sorry for the late reply. I rebase my two patchsets onto the latest
master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
buffer; the other prefixed with v3 is for msync.

I will reply to your thankful feedbacks one by one within days. Please wait
for a moment.

Best regards,
Takashi


01/25/2021(Mon) 11:56 Masahiko Sawada :

> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>  wrote:
> >
> >
> >
> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
> > >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> I think I've managed to get the 0002 patch [1] rebased to master and
> > >> working (with help from Masahiko Sawada). It's not clear to me how it
> > >> could have worked as submitted - my theory is that an incomplete patch
> > >> was submitted by mistake, or something like that.
> > >>
> > >> Unfortunately, the benchmark results were kinda disappointing. For a
> > >> pgbench on scale 500 (fits into shared buffers), an average of three
> > >> 5-minute runs looks like this:
> > >>
> > >> branch 116326496
> > >> 
> > >> master  7291 87704165310150437224186
> > >> ntt 7912106095213206212410237819
> > >> simple-no-buffers   7654 96544115416 95828103065
> > >>
> > >> NTT refers to the patch from September 10, pre-allocating a large WAL
> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
> removing
> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.
> > >>
> > >> Note: The patch is just replacing the old implementation with mmap.
> > >> That's good enough for experiments like this, but we probably want to
> > >> keep the old one for setups without PMEM. But it's good enough for
> > >> testing, benchmarking etc.
> > >>
> > >> Unfortunately, the results for this simple approach are pretty bad.
> Not
> > >> only compared to the "ntt" patch, but even to master. I'm not entirely
> > >> sure what's the root cause, but I have a couple hypotheses:
> > >>
> > >> 1) bug in the patch - That's clearly a possibility, although I've
> tried
> > >> tried to eliminate this possibility.
> > >>
> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster
> than
> > >> NVMe storage, but still much slower than DRAM (both in terms of
> latency
> > >> and bandwidth, see [2] for some data). It's not terrible, but the
> > >> latency is maybe 2-3x higher - not a huge difference, but may matter
> for
> > >> WAL buffers?
> > >>
> > >> 3) PMEM does not handle parallel writes well - If you look at [2],
> > >> Figure 4(b), you'll see that the throughput actually *drops" as the
> > >> number of threads increase. That's pretty strange / annoying, because
> > >> that's how we write into WAL buffers - each thread writes it's own
> data,
> > >> so parallelism is not something we can get rid of.
> > >>
> > >> I've added some simple profiling, to measure number of calls / time
> for
> > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
> > >> for each backend, and logs the counts every 1M ops.
> > >>
> > >> Typical stats from a concurrent run looks like this:
> > >>
> > >> xlog stats cnt 4300
> > >>map cnt 100 time 5448333 unmap cnt 100 time 3730963
> > >>memcpy cnt 985964 time 1550442272 len 15150499
> > >>memset cnt 0 time 0 len 0
> > >>persist cnt 13836 time 10369617 len 16292182
> > >>
> > >> The times are in nanoseconds, so this says the backend did 100  mmap
> and
> > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist
> calls,
> > >> taking 10ms in total. And the most time (~1.5s) was used by
> pmem_memcpy
> > >> copying about 15MB of data. That's quite a lot :-(
> > >
> > > It might also be interesting if we can see how much time spent on each
> > > logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().
> > >
> >
> > Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
> > maybe that could be visible in a regular perf profile. Also, I suppose
> > most of the time will be used by the pmem calls, shown in the stats.
> >
> > >>
> > >> My conclusion from this is that eliminating WAL buffers and writing
> WAL
> > >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not
> the
> > >> right approach.
> > >>
> > >> I suppose we should keep WAL buffers, and then just write the data to
> > >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch
> does,
> > >> except that it allocates one huge file on PMEM and writes to that
> > >> (instead of the traditional WAL segments).
> > >>
> > >> So I decided to try how it'd work with writing to regular WAL
> segments,
> > >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does
> that,
> > >> and the results look a bit nicer:
> >

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-26 Thread Bharath Rupireddy

On Tue, Jan 26, 2021 at 1:55 PM Fujii Masao  wrote:
> Thanks for the patch! I also created that patch, confirmed that the test
> successfully passed with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS,
> and pushed the patch.

Thanks a lot!

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-26 Thread Fujii Masao





On 2021/01/26 17:07, Bharath Rupireddy wrote:

On Tue, Jan 26, 2021 at 1:27 PM Fujii Masao  wrote:

Yes, so I pushed that change to stabilize the regression test.
Let's keep checking how the results of buildfarm members are changed.


Sorry, I'm unfamiliar with checking the system status on the build
farm website - https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
I'm trying to figure that out.


+WARNING:  roles created by regression test cases should have names starting with 
"regress_"
   CREATE ROLE multi_conn_user2 SUPERUSER;
+WARNING:  roles created by regression test cases should have names starting with 
"regress_"

Hmm... another failure happened.


My bad. I should have caught that earlier. I will take care in future.

Attaching a patch to fix it.


Thanks for the patch! I also created that patch, confirmed that the test
successfully passed with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS,
and pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-26 Thread Bharath Rupireddy

On Tue, Jan 26, 2021 at 1:27 PM Fujii Masao  wrote:
> > Yes, so I pushed that change to stabilize the regression test.
> > Let's keep checking how the results of buildfarm members are changed.

Sorry, I'm unfamiliar with checking the system status on the build
farm website - https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
I'm trying to figure that out.

> +WARNING:  roles created by regression test cases should have names starting 
> with "regress_"
>   CREATE ROLE multi_conn_user2 SUPERUSER;
> +WARNING:  roles created by regression test cases should have names starting 
> with "regress_"
>
> Hmm... another failure happened.

My bad. I should have caught that earlier. I will take care in future.

Attaching a patch to fix it.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Stabilize-test-case-for-postgres_fdw_disconnect_a.patch
Description: Binary data

98 matches

Mail list logo