Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-09 Thread Jeevan Ladhe
On Wed, Mar 10, 2021 at 10:44 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> While providing thoughts on [1], I observed that the error messages
> that are emitted while adding foreign, temporary and unlogged tables
> can be improved a bit from the existing [2] to [3].
>

+1 for improving the error messages here.


> Attaching a small patch. Thoughts?
>

I had a look at the patch and it looks good to me. However, I think after
you have added the specific kind of table type in the error message itself,
now the error details seem to be giving redundant information, but others
might
have different thoughts.

The patch itself looks good otherwise. Also the make check and postgres_fdw
check looking good.

Regards,
Jeevan Ladhe


Re: a misbehavior of partition row movement (?)

2021-03-09 Thread Masahiko Sawada
On Fri, Feb 26, 2021 at 4:30 PM Amit Langote  wrote:
>
> Hi Rahila,
>
> On Wed, Feb 24, 2021 at 3:07 PM Rahila Syed  wrote:
> >> > I think the documentation update is missing from the patches.
> >>
> >> Hmm, I don't think we document the behavior that is improved by the v3
> >> patches as a limitation of any existing feature, neither of foreign
> >> keys referencing partitioned tables nor of the update row movement
> >> feature.  So maybe there's nothing in the existing documentation that
> >> is to be updated.
> >>
> >> However, the patch does add a new error message for a case that the
> >> patch doesn't handle, so maybe we could document that as a limitation.
> >> Not sure if in the Notes section of the UPDATE reference page which
> >> has some notes on row movement or somewhere else.  Do you have
> >> suggestions?
> >>
> > You are right, I could not find any direct explanation of the impact of row 
> > movement during
> > UPDATE on a referencing table in the PostgreSQL docs.
> >
> > The two documents that come close are either:
>
> Thanks for looking those up.
>
> > 1. https://www.postgresql.org/docs/13/trigger-definition.html .
> > The para starting with "If an UPDATE on a partitioned table causes a row to 
> > move to another partition"
> > However, this does not describe the behaviour of  internal triggers which 
> > is the focus of this patch.
>
> The paragraph does talk about a very related topic, but, like you, I
> am not very excited about adding a line here about what we're doing
> with internal triggers.
>
> > 2. Another one like you mentioned,  
> > https://www.postgresql.org/docs/11/sql-update.html
> > This has explanation for row movement behaviour for partitioned table but 
> > does not explain
> > any impact of such behaviour on a referencing table.
> > I think it is worth adding some explanation in this document. Thus, 
> > explaining
> > impact on referencing tables here, as it already describes behaviour of
> > UPDATE on a partitioned table.
>
> ISTM the description of the case that will now be prevented seems too
> obscure to make into a documentation line, but I tried.  Please check.
>

I looked at the 0001 patch and here are random comments. Please ignore
a comment if it is already discussed.

---
@@ -9077,7 +9102,8 @@ addFkRecurseReferenced(List **wqueue, Constraint
*fkconstraint, Relation rel,
   partIndexId, constrOid, numfks,
   mapped_pkattnum, fkattnum,
   pfeqoperators, ppeqoperators, ffeqoperators,
-  old_check_ok);
+  old_check_ok,
+  deleteTriggerOid, updateTriggerOid);

/* Done -- clean up (but keep the lock) */
table_close(partRel, NoLock);
@@ -9126,8 +9152,12 @@ addFkRecurseReferencing(List **wqueue,
Constraint *fkconstraint, Relation rel,
Relation pkrel, Oid indexOid, Oid parentConstr,
int numfks, int16 *pkattnum, int16 *fkattnum,
Oid *pfeqoperators, Oid *ppeqoperators, Oid
*ffeqoperators,
-   bool old_check_ok, LOCKMODE lockmode)
+   bool old_check_ok, LOCKMODE lockmode,
+   Oid parentInsTrigger, Oid parentUpdTrigger)
 {

We need to update the function comments as well.

---
I think it's better to add comments for newly added functions such as
GetForeignKeyActionTriggers() and GetForeignKeyCheckTriggers() etc.
Those functions have no comment at all.

BTW, those two functions out of newly added four functions:
AttachForeignKeyCheckTriggers() and DetachForeignKeyCheckTriggers(),
have only one user. Can we past the functions body at where each
function is called?

---
/*
 * If the referenced table is a plain relation, create the action triggers
 * that enforce the constraint.
 */
-   if (pkrel->rd_rel->relkind == RELKIND_RELATION)
-   {
-   createForeignKeyActionTriggers(rel, RelationGetRelid(pkrel),
-  fkconstraint,
-  constrOid, indexOid);
-   }
+   createForeignKeyActionTriggers(rel, RelationGetRelid(pkrel),
+  fkconstraint,
+  constrOid, indexOid,
+  parentDelTrigger, parentUpdTrigger,
+  , );

The comment needs to be updated.

---
 /*
  * If the referencing relation is a plain table, add the check triggers to
  * it and, if necessary, schedule it to be checked in Phase 3.
  *
  * If the relation is partitioned, drill down to do it to its partitions.
  */
+createForeignKeyCheckTriggers(RelationGetRelid(rel),
+  RelationGetRelid(pkrel),
+  fkconstraint,
+  parentConstr,
+  

Re: Freeze the inserted tuples during CTAS?

2021-03-09 Thread Paul Guo
> On Mar 3, 2021, at 1:35 PM, Masahiko Sawada  wrote:
>> On Sun, Feb 21, 2021 at 4:46 PM Paul Guo  wrote:
>> Attached is the v2 version that fixes a test failure due to plan change 
>> (bitmap index scan -> index only scan).

> I think this is a good idea.

> BTW, how much does this patch affect the CTAS performance? I expect
> it's negligible but If there is much performance degradation due to
> populating visibility map, it might be better to provide a way to
> disable it.

Yes,  this is a good suggestion. I did a quick test yesterday.

Configuration: shared_buffers = 1280M and the test system memory is 7G.

Test queries:
  checkpoint;
  \timing
  create table t1 (a, b, c, d) as select i,i,i,i from 
generate_series(1,2000) i;
  \timing
  select pg_size_pretty(pg_relation_size('t1'));

Here are the running time:

HEAD   : Time: 10299.268 ms (00:10.299)  + 1537.876 ms (00:01.538)  
   
Patch  : Time: 12257.044 ms (00:12.257)  + 14.247 ms
 

The table size is 800+MB so the table should be all in the buffer. I was 
surprised
to see the patch increases the CTAS time by 19.x%, and also it is not better 
than
"CTAS+VACUUM" on HEAD version. In theory the visibility map buffer change should
not affect that much. I looked at related code again (heap_insert()). I believe
the overhead could decrease along with some discussed CTAS optimization
solutions (multi-insert, or raw-insert, etc).

I tested 'copy' also. The COPY FREEZE does not involve much overhead than COPY
according to the experiement results as below. COPY uses multi-insert. Seems 
there is
no other difference than CTAS when writing a new table.

COPY TO + VACUUM
Time: 8826.995 ms (00:08.827) + 1599.260 ms (00:01.599)
COPY TO FREEZE + VACUUM
Time: 8836.107 ms (00:08.836) + 13.581 ms

So maybe think about doing freeze in CTAS after optimizing the CTAS performance
later?

By the way, ‘REFRESH MatView’ does freeze by default. Matview is quite similar 
to CTAS.
I did test it also and the conclusion is similar to that of CTAS. Not sure why 
FREEZE was
enabled though, maybe I missed something?



Re: Occasional tablespace.sql failures in check-world -jnn

2021-03-09 Thread Michael Paquier
On Mon, Mar 08, 2021 at 11:53:57AM +0100, Peter Eisentraut wrote:
> On 09.12.20 08:55, Michael Paquier wrote:
>> ...  Because we may still introduce this problem again if some new
>> stuff uses src/test/pg_regress in a way similar to pg_upgrade,
>> triggering again tablespace-setup.  Something like the attached may be
>> enough, though I have not spent much time checking the surroundings,
>> Windows included.
> 
> This patch looks alright to me.

So, I have spent more time checking the surroundings of this patch,
and finally applied it.  Thanks for the review, Peter.
--
Michael


signature.asc
Description: PGP signature


Re: shared-memory based stats collector

2021-03-09 Thread Fujii Masao




On 2021/03/10 12:10, Kyotaro Horiguchi wrote:

At Tue, 9 Mar 2021 23:24:10 +0900, Fujii Masao  
wrote in



On 2021/03/09 16:51, Kyotaro Horiguchi wrote:

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao 
wrote in

I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.

Sounds reasonable. Now archiver is treated the same way to wal
receiver.  Specifically exit(1) doesn't cause server restart.


Thanks!

-   if (PgArchStartupAllowed())
-   PgArchPID = pgarch_start();

In the latest patch, why did you remove the code to restart new archiver
in reaper()? When archiver dies, I think new archiver should be restarted like
the current reaper() does. Otherwise, the restart of archiver can be
delayed until the next cycle of ServerLoop, which may take time.


Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.

+* Was it the archiver?  If exit status is zero (normal) or one 
(FATAL
+* exit), we assume everything is all right just like normal 
backends
+* and just try to restart a new one so that we immediately 
retry
  
+* archiving of remaining files. (If fail, we'll try again in 
future




"of" of "archiving of remaining" should be replaced with "the", or removed?


Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG:  archiver process (PID 46418) exited with exit code 1



I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?


Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.


Thanks!

+   if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+   SetLatch(ProcGlobal->archiverLatch);

The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?




(I moved the archiverLatch to just after checkpointerLatch in this version.)


In pgarch.c, #include "postmaster/fork_process.h" seems no longer necessary.


Right. That's not due to this patch, postmater.h, dsm.h and pg_shmem.h
are not used. (fd.h is not necessary but pgarch.c uses AllocateDir().)


+   if (strcmp(argv[1], "--forkarch") == 0)
+   {

Why is this necessary? I was thinking that "--forkboot" handles archiver
in SubPostmasterMain().


Yeah, the correspondent code is removed in the same patch at the same
time.

The attached is v51 patchset.


Thanks a lot!

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [PATCH] Identify LWLocks in tracepoints

2021-03-09 Thread Craig Ringer
On Wed, 3 Mar 2021 at 20:50, David Steele  wrote:

> On 1/22/21 6:02 AM, Peter Eisentraut wrote:
>
> This patch set no longer applies:
> http://cfbot.cputube.org/patch_32_2927.log.
>
> Can we get a rebase? Also marked Waiting on Author.
>

Rebased as requested.

I'm still interested in whether Andres will be able to do anything about
identifying LWLocks in a cross-backend manner. But this work doesn't really
depend on that; it'd benefit from it, but would be easily adapted to it
later if needed.
From 36c7ddcbca2dbbcb2967f01cb92aa1f61620c838 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 19 Nov 2020 17:38:45 +0800
Subject: [PATCH 1/4] Pass the target LWLock* and tranche ID to LWLock
 tracepoints

Previously the TRACE_POSTGRESQL_LWLOCK_ tracepoints only received a
pointer to the LWLock tranche name. This made it impossible to identify
individual locks.

Passing the lock pointer itself isn't perfect. If the lock is allocated inside
a DSM segment then it might be mapped at a different address in different
backends. It's safe to compare lock pointers between backends (assuming
!EXEC_BACKEND) if they're in the individual lock tranches or an
extension-requested named tranche, but not necessarily for tranches in
BuiltinTrancheIds or tranches >= LWTRANCHE_FIRST_USER_DEFINED that were
directly assigned with LWLockNewTrancheId(). Still, it's better than nothing;
the pointer is stable within a backend, and usually between backends.
---
 src/backend/storage/lmgr/lwlock.c | 35 +++
 src/backend/utils/probes.d| 18 +---
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..5c8744d316 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1321,7 +1321,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 #endif
 
 		LWLockReportWaitStart(lock);
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+lock->tranche);
 
 		for (;;)
 		{
@@ -1343,7 +1344,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		}
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+lock->tranche);
 		LWLockReportWaitEnd();
 
 		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");
@@ -1352,7 +1354,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		result = false;
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode, lock, lock->tranche);
 
 	/* Add lock to list of locks held by this backend */
 	held_lwlocks[num_held_lwlocks].lock = lock;
@@ -1403,14 +1405,16 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 		RESUME_INTERRUPTS();
 
 		LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	return !mustwait;
 }
@@ -1482,7 +1486,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 #endif
 
 			LWLockReportWaitStart(lock);
-			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+	lock->tranche);
 
 			for (;;)
 			{
@@ -1500,7 +1505,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 Assert(nwaiters < MAX_BACKENDS);
 			}
 #endif
-			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+	lock->tranche);
 			LWLockReportWaitEnd();
 
 			LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened");
@@ -1530,7 +1536,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
 		LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 	else
 	{
@@ -1538,7 +1545,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode, lock,
+lock->tranche);
 	}
 
 	return !mustwait;
@@ -1698,7 +1706,8 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 #endif
 
 		LWLockReportWaitStart(lock);
-		

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-09 Thread Peter Smith
On Tue, Mar 9, 2021 at 9:55 PM Amit Kapila  wrote:
>
> On Tue, Mar 9, 2021 at 3:22 PM Ajin Cherian  wrote:
> >
>
> Few comments:
> ==
>
> 3. In prepare_spoolfile_replay_messages(), it is better to free the
> memory allocated for temporary strings buffer and s2.

I guess this was suggested because it is what the
apply_handle_stream_commit() function was doing for very similar code.
But now the same code cannot work this time for the
*_replay_messages() function because those buffers are allocated with
the TopTransactionContext and they are already being freed as a
side-effect when the last psf message (the LOGICAL_REP_MSG_PREPARE) is
replayed/dispatched and ending the transaction. So attempting to free
them again causes segmentation violation (I already fixed this exact
problem last week when the pfree code was still in the code).

> 5. I think prepare_spoolfile_close can be extended to take PsfFile as
> input and then it can be also used from
> prepare_spoolfile_replay_messages.

No, the *_close() is intended only for ending the "current" psf (the
global psf_cur) which was being spooled. The function comment says the
same. The *_close() is paired with the *_create() which created the
psf_cur.

Whereas, the reply fd close at commit time is just a locally opened fd
unrelated to psf_cur. This close is deliberately self-contained in the
*_replay_messages() function, which is not dissimilar to what the
other streaming spool file code does - e.g. notice
apply_handle_stream_commit function simply closes its own fd using
BufFileClose; it doesn’t delegate stream_close_file() to do it.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Confusing behavior of psql's \e

2021-03-09 Thread Tom Lane
Laurenz Albe  writes:
> On Thu, 2021-03-04 at 16:51 +, Jacob Champion wrote:
>> You could backdate the temporary file, so that any save is guaranteed
>> to move the timestamp forward. That should work even if the filesystem
>> has extremely poor precision.

> Ah, of course, that is the way to go.

I took a quick look at this.  I don't have an opinion yet about the
question of changing the when-to-discard-the-buffer rules, but I agree
that trying to get rid of the race condition inherent in the existing
file mtime test would be a good idea.  However, I've got some
portability-related gripes about how you are doing the latter:

1. There is no principled reason to assume that the epoch date is in the
past.  IIRC, Postgres' timestamp epoch of 2000-01-01 was in the future
at the time we set it.  More relevant to the immediate issue, I clearly
recall a discussion at Red Hat in which one of the principal glibc
maintainers (likely Ulrich Drepper, though I'm not quite sure) argued
that 32-bit time_t could be used indefinitely by redefining the epoch
forward by 2^32 seconds every often; which would require intervals of
circa 68 years in which time_t was seen as a negative offset from a
future epoch date, rather than an unsigned offset from a past date.
Now, I thought he was nuts then and I still think that 32-bit hardware
will be ancient history by 2038 ... but there may be systems that do it
like that.  glibc hates ABI breakage.

2. Putting an enormously old date on a file that was just created will
greatly confuse onlookers, some of whom (such as backup or antivirus
daemons) might not react pleasantly.

Between #1 and #2, it's clearly worth the extra one or two lines of
code to set the file dates to, say, "time(NULL) - 1", rather than
assuming that zero is good enough.

3. I wonder about the portability of utime(2).  I see that we are using
it to cause updates of socket and lock file times, but our expectations
for it there are rock-bottom low.  I think that failing the edit command
if utime() fails is an overreaction even with optimistic assumptions about
its reliability.  Doing that makes things strictly worse than simply not
doing anything, because 99% of the time this refinement is unnecessary.

In short, I think the relevant code ought to be more like

else
{
struct utimbuf ut;

ut.modtime = ut.actime = time(NULL) - 1;
(void) utime(fname, );
}

(plus some comments of course)

regards, tom lane

PS: I seem to recall that some Microsoft filesystems have 2-second
resolution on file mod times, so maybe it needs to be "time(NULL) - 2"?




Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-09 Thread Bharath Rupireddy
Hi,

While providing thoughts on [1], I observed that the error messages
that are emitted while adding foreign, temporary and unlogged tables
can be improved a bit from the existing [2] to [3]. For instance, the
existing message when foreign table is tried to add into the
publication "f1" is not a table" looks odd. Because it says that the
foreign table is not a table at all.

Attaching a small patch. Thoughts?

[1] - 
https://www.postgresql.org/message-id/CALj2ACWAxO3vSToT0o5nXL%3Drz5cNx90zaV-at%3DcvM14Tag4%3DcQ%40mail.gmail.com
[2] - t1 is a temporary table:
postgres=# CREATE PUBLICATION testpub FOR TABLE t1;
ERROR:  table "t1" cannot be replicated
DETAIL:  Temporary and unlogged relations cannot be replicated.

t1 is an unlogged table:
postgres=# CREATE PUBLICATION testpub FOR TABLE t1;
ERROR:  table "t1" cannot be replicated
DETAIL:  Temporary and unlogged relations cannot be replicated.

f1 is a foreign table:
postgres=# CREATE PUBLICATION testpub FOR TABLE f1;
ERROR:  "f1" is not a table
DETAIL:  Only tables can be added to publications.

[3] - t1 is a temporary table:
postgres=# CREATE PUBLICATION testpub FOR TABLE t1;
ERROR:  temporary table "t1" cannot be replicated
DETAIL:  Temporary, unlogged and foreign relations cannot be replicated.

t1 is an unlogged table:
postgres=# CREATE PUBLICATION testpub FOR TABLE t1;
ERROR:  unlogged table "t1" cannot be replicated
DETAIL:  Temporary, unlogged and foreign relations cannot be replicated.

f1 is a foreign table:
postgres=# CREATE PUBLICATION testpub FOR TABLE f1;
ERROR:  foreign table "f1" cannot be replicated
DETAIL:  Temporary, unlogged and foreign relations cannot be replicated.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Improve-error-message-while-adding-tables-to-publ.patch
Description: Binary data


Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-09 Thread Masahiro Ikeda

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is 
also

incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" 
or
"which normally is called" if you want to keep true to the 
original)
You missed the adding the space before an opening parenthesis 
here and

elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query 
the

operating system..."
", because" -> "as"


Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This is 
also

incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL 
receiver

in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this 
event is
reported in wal_buffers_full in) This is undesirable 
because ..."


Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require 
explicitly
computing the sync statistics but does require computing the 
write
statistics.  This is because of the presence of 
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I 
observe that
the XLogWrite code path calls pgstat_report_wait_*() while the 
WAL
receiver path does not.  It seems technically straight-forward 
to
refactor here to avoid the almost-duplicated logic in the two 
places,
though I suspect there may be a trade-off for not adding 
another
function call to the stack given the importance of WAL 
processing
(though that seems marginalized compared to the cost of 
actually
writing the WAL).  Or, as Fujii noted, go the other way and 
don't have
any shared code between the two but instead implement the WAL 
receiver

one to use pg_stat_wal_receiver instead.  In either case, this
half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!


I added the infrastructure code to communicate the WAL receiver 
stats messages between the WAL receiver and the stats 
collector, and

the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?


On second thought, this idea seems not good. Because those stats 
are

collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process 
running
at that moment. IOW, it seems strange that some values show 
dynamic
stats and the others show collected stats, even though they are 
in

the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal 
view in v11 patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has elapsed 
to minimize

+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes 
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() 
is called.
For example, if wal_writer_delay is set to several seconds, some 
values in
pg_stat_wal would be not up-to-date meaninglessly for those 
seconds.
So I'm thinking to withdraw my previous comment and it's ok to 
send

the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.



Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+    * msec since we last sent one
+    */
+   now = 

RE: libpq debug log

2021-03-09 Thread iwata....@fujitsu.com
Hi all,

Following all reviewer's advice, I have created a new patch.

In this patch, I add only two tracing entry points; I call 
pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
pqParseInput3 () and pqPutMsgEnd () to output log.
The argument contains message first byte offset called msgCursor because it is 
simpler to specify the buffer pointer in the caller. 

In pqTraceOutputMsg(), the content common to all protocol messages (first 
timestamp, < or >, first 1 byte, and length) are output first and after that 
each protocol message contents is output. I referred to pqParseInput3 () , 
fe-exec.c and documentation page for that part.

This fix almost eliminates if(conn->Pfdebug) that was built into the code for 
other features.

Regards,
Aya Iwata

> -Original Message-
> From: alvhe...@alvh.no-ip.org 
> Sent: Friday, March 5, 2021 10:41 PM
> To: Tsunakawa, Takayuki/綱川 貴之 
> Cc: 'Tom Lane' ; Iwata, Aya/岩田 彩
> ; Jamison, Kirk/ジャミソン カーク
> ; 'Kyotaro Horiguchi' ;
> pgsql-hackers@lists.postgresql.org
> Subject: Re: libpq debug log
> 
> On 2021-Mar-05, tsunakawa.ta...@fujitsu.com wrote:
> 
> > From: Tom Lane 
> > > But I think passing the message start address explicitly might be
> > > better than having it understand the buffering behavior in enough
> > > detail to know where to find the message.  Part of the point here
> > > (IMO) is to decouple the tracing logic from the core libpq logic, in
> > > hopes of not having common-mode bugs.
> >
> > Ouch, you're perfectly right.  Then let's make the signature:
> >
> > void pqLogMessage(PGconn *conn, const char *message, bool
> > toServer);
> 
> Yeah, looks good!  I agree that going this route will result in more 
> trustworthy
> trace output.
> 
> --
> Álvaro Herrera39°49'30"S 73°17'W


v24-libpq-trace-log.patch
Description: v24-libpq-trace-log.patch


Re: pgbench: option delaying queries till connections establishment?

2021-03-09 Thread Thomas Munro
On Mon, Mar 8, 2021 at 3:18 PM Thomas Munro  wrote:
> David Rowley kindly tested this for me on Windows and told me how to
> fix one of the macros that had incorrect error checking on that OS.
> So here's a new version.  I'm planning to commit 0001 and 0002 soon,
> if there are no objections.  0003 needs some more review.

I made a few mostly cosmetic changes, pgindented and pushed all these patches.




Re: New Table Access Methods for Multi and Single Inserts

2021-03-09 Thread Bharath Rupireddy
On Tue, Mar 9, 2021 at 1:45 PM Bharath Rupireddy
 wrote:
> On Mon, Mar 8, 2021 at 6:37 PM Dilip Kumar  wrote:
>>
> > Why do we need to invent a new version table_insert_v2?  And also why
> > it is named table_insert* instead of table_tuple_insert*?
>
> New version, because we changed the input parameters, now passing the
> params via TableInsertState but existing table_tuple_insert doesn't do
> that. If okay, I can change table_insert_v2  to table_tuple_insert_v2?
> Thoughts?

Changed table_insert_v2 to table_tuple_insert_v2. And also, rebased
the patches on to the latest master.

Attaching the v4 patch set. Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 6518212583e24b017375512701d9fefa6de20e42 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 10 Mar 2021 09:53:48 +0530
Subject: [PATCH v4 1/3] New Table AMs for Multi and Single Inserts

This patch introduces new table access methods for multi and
single inserts. Also implements/rearranges the outside code for
heap am into these new APIs.

Main design goal of these new APIs is to give flexibility to
tableam developers in implementing multi insert logic dependent on
the underlying storage engine. Currently, for all the underlying
storage engines, we follow the same multi insert logic such as when
and how to flush the buffered tuples, tuple size calculation, and
this logic doesn't take into account the underlying storage engine
capabilities.

We can also avoid duplicating multi insert code (for existing COPY,
and upcoming CTAS, CREATE/REFRESH MAT VIEW and INSERT SELECTs). We
can also move bulk insert state allocation and deallocation inside
these APIs.
---
 src/backend/access/heap/heapam.c | 212 +++
 src/backend/access/heap/heapam_handler.c |   5 +
 src/backend/access/table/tableamapi.c|   7 +
 src/backend/executor/execTuples.c|  83 -
 src/include/access/heapam.h  |  49 +-
 src/include/access/tableam.h |  87 ++
 src/include/executor/tuptable.h  |   1 +
 7 files changed, 438 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d..d8bfe17f22 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -67,6 +67,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -2669,6 +2670,217 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	pgstat_count_heap_insert(relation, ntuples);
 }
 
+/*
+ * heap_insert_begin - allocate and initialize TableInsertState
+ *
+ * For single inserts:
+ *  1) Specify is_multi as false, then multi insert state will be NULL.
+ *
+ * For multi inserts:
+ *  1) Specify is_multi as true, then multi insert state will be allocated and
+ * 	   initialized.
+ *
+ *  Other input parameters i.e. relation, command id, options are common for
+ *  both single and multi inserts.
+ */
+TableInsertState*
+heap_insert_begin(Relation rel, CommandId cid, int options, bool is_multi)
+{
+	TableInsertState *state;
+
+	state = palloc(sizeof(TableInsertState));
+	state->rel = rel;
+	state->cid = cid;
+	state->options = options;
+	/* Below parameters are not used for single inserts. */
+	state->mi_slots = NULL;
+	state->mistate = NULL;
+	state->mi_cur_slots = 0;
+	state->flushed = false;
+
+	if (is_multi)
+	{
+		HeapMultiInsertState *mistate;
+
+		mistate = palloc(sizeof(HeapMultiInsertState));
+		state->mi_slots =
+palloc0(sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+		mistate->max_slots = MAX_BUFFERED_TUPLES;
+		mistate->max_size = MAX_BUFFERED_BYTES;
+		mistate->cur_size = 0;
+		/*
+		 * Create a temporary memory context so that we can reset once per
+		 * multi insert batch.
+		 */
+		mistate->context = AllocSetContextCreate(CurrentMemoryContext,
+ "heap_multi_insert",
+ ALLOCSET_DEFAULT_SIZES);
+		state->mistate = mistate;
+	}
+
+	return state;
+}
+
+/*
+ * heap_insert_v2 - insert single tuple into a heap
+ *
+ * Insert tuple from slot into table. This is like heap_insert(), the only
+ * difference is that the parameters for insertion are inside table insert
+ * state structure.
+ */
+void
+heap_insert_v2(TableInsertState *state, TupleTableSlot *slot)
+{
+	bool		shouldFree = true;
+	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, );
+
+	Assert(state);
+
+	/* Update tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(state->rel);
+	tuple->t_tableOid = slot->tts_tableOid;
+
+	/* Perform insertion, and copy the resulting ItemPointer */
+	heap_insert(state->rel, tuple, state->cid, state->options, state->bistate);
+	ItemPointerCopy(>t_self, >tts_tid);
+
+	if (shouldFree)
+		pfree(tuple);
+}
+
+/*
+ * heap_multi_insert_v2 - insert multiple tuples into a heap
+ *
+ * Compute size 

Re: a verbose option for autovacuum

2021-03-09 Thread Masahiko Sawada
On Tue, Mar 9, 2021 at 12:58 AM Euler Taveira  wrote:
>
> On Mon, Mar 8, 2021, at 2:32 AM, Masahiko Sawada wrote:
>
> * Proposed idea
> LOG:  automatic vacuum of table "postgres.public.test": index scans: 1
> pages: 0 removed, 443 remain, 0 skipped due to pins, 0 skipped frozen
> tuples: 1000 removed, 99000 remain, 0 are dead but not yet removable,
> oldest xmin: 545
> indexes: "postgres.public.test_idx1" 276 pages, 0 newly deleted, 0
> currently deleted, 0 reusable.
> "postgres.public.test_idx2" 300 pages, 10 newly deleted, 0 currently
> deleted, 3 reusable.
> "postgres.public.test_idx2" 310 pages, 4 newly deleted, 0 currently
> deleted, 0 reusable.
>
> Instead of using "indexes:" and add a list of indexes (one on each line), it
> would be more parse-friendly if it prints one index per line using 'index
> "postgres.public.idxname" 123 pages, 45 newly deleted, 67 currently deleted, 8
> reusable.'.

Agreed.

Attached a patch. I've slightly modified the format for consistency
with heap statistics.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


index_stats_log.patch
Description: Binary data


Re: Removing vacuum_cleanup_index_scale_factor

2021-03-09 Thread Peter Geoghegan
On Mon, Mar 8, 2021 at 10:21 PM Masahiko Sawada  wrote:
> Thank you for the patches. I looked at 0001 patch and have a comment:
>
> +* We don't report to the stats collector here because the stats collector
> +* only tracks per-table stats.  Reset the changes_since_analyze counter
> +* only if we analyzed all columns; otherwise, there is still work for
> +* auto-analyze to do.
>
> I think the comment becomes clearer if we add "if doing inherited
> stats" at top of the above paragraph since we actually report to the
> stats collector in !inh case.

I messed the comment up. Oops. Fixed now.

> > Also included is a patch that removes the
> > vacuum_cleanup_index_scale_factor mechanism for triggering an index
> > scan during VACUUM -- that's what the second patch does (this depends
> > on the first patch, really).
>
> 0002 patch looks good to me.

Great.

Attached revision has a bit more polish. It includes new commit
messages which explains what we're really trying to fix here. I also
included backpatchable versions for Postgres 13 -- that's the other
significant change compared to the last version.

My current plan is to commit everything within the next day or two.
This includes backpatching to Postgres 13 only. I am now leaning
against doing anything in Postgres 11 and 12, for the closely related
btm_last_cleanup_num_heap_tuples VACUUM accuracy issue. There have
been no complaints from users using Postgres 11 or 12, so I'll leave
them alone. (Sorry for changing my mind again and again.)

To be clear: I plan on disabling (though not removing) the
vacuum_cleanup_index_scale_factor GUC and storage parameter on
Postgres 13, even though that is a stable release. This approach is
unorthodox, but it has a kind of a precedent -- the recheck_on_update
storage param was disabled on the Postgres 11 branch by commit
5d28c9bd. More importantly, it just happens to make sense, given the
specifics here.

--
Peter Geoghegan


v2-0002-VACUUM-ANALYZE-Always-update-pg_class.reltuples.patch
Description: Binary data


REL_13_STABLE-v2-0002-VACUUM-ANALYZE-Always-update-pg_class.reltuples.patch
Description: Binary data


v2-0001-Don-t-consider-newly-inserted-tuples-in-nbtree-VA.patch
Description: Binary data


REL_13_STABLE-v2-0001-Don-t-consider-newly-inserted-tuples-in-nbtree-VA.patch
Description: Binary data


Re: Add some tests for pg_stat_statements compatibility verification under contrib

2021-03-09 Thread Julien Rouhaud
Hi Erica,

On Wed, Mar 10, 2021 at 11:14:52AM +0800, Erica Zhang wrote:
> Hi Julien,
> Thanks a lot for the quick review. Please see my answer below in blue. 
> Attached is the new patch.

Thanks!

>> The upgrade scripts are already tested as postgres will install 1.4 and 
>> perform
>> all upgrades to reach the default version.
> Thanks for pointing that the upgrades paths are covered by upgrade scripts 
> tests. Since I don't need to test the upgrade, I will test the installation 
> of different versions directly, any concern?

I think you should keep your previous approach.  The result will be the same
but it will consume less resources for that which is always good.

>> +SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements' and 
>> installed_version = '1.4';
>> 
>> 
>> What is this supposed to test? All those tests will break every time 
>> we change
>> the default version, which will add maintenance efforts. It could be 
>> good to
>> have one test breaking when changing the version to remind us to add a test 
>> for
>> the new version, but not more.
> Here I just want to verify that "installed" version is the expected version. 
> But we do have the issue as you mentioned which will add maintenance efforts. 
> 
> So I prefer to keep one test as now which can remind us to add a new version. 
> As for others, just to check the count(*) to make sure installation is 
> success.
> Such as SELECT count(*) FROM pg_available_extensions WHERE name = 
> 'pg_stat_statements' and installed_version = '1.4'; What do you think?

How about tweaking your previous query so only the last execution fails when
pg_stat_statements default version is updated?  Something like:

SELECT installed_version = default_version, installed_version
FROM pg_available_extensions
WHERE name = 'pg_stat_statements';

This way the same query can be reused for both older versions and current
version.

Also, can you register your patch for the next commitfest at
https://commitfest.postgresql.org/33/, to make sure it won't be forgotten?




RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-09 Thread houzj.f...@fujitsu.com
> Attaching the first version patch which avoid CCI in RI trigger when insert 
> into
> referencing table.

After some more on how to support parallel insert into fk relation.
It seems we do not have a cheap way to implement this feature.
Please see the explanation below:

In RI_FKey_check, Currently, postgres execute "select xx for key share" to 
check that foreign key exists in PK table.
However "select for update/share" is considered as parallel unsafe. It may be 
dangerous to do this in parallel mode, we may want to change this.

And also, "select for update/share" is considered as "not read only" which will 
force readonly = false in _SPI_execute_plan.
So, it seems we have to completely change the implementation of RI_FKey_check.

At the same time, " simplifying foreign key/RI checks " thread is trying to 
replace "select xx for key share" with index_beginscan()+table_tuple_lock() (I 
think it’s parallel safe).
May be we can try to support parallel insert fk relation after " simplifying 
foreign key/RI checks " patch applied ?

Best regards,
houzj





Re: Columns correlation and adaptive query optimization

2021-03-09 Thread Tomas Vondra
Hello Konstantin,


Sorry for not responding to this thread earlier. I definitely agree the
features proposed here are very interesting and useful, and I appreciate
you kept rebasing the patch.

I think the patch improving join estimates can be treated as separate,
and I see it already has a separate CF entry - it however still points
to this thread, which will be confusing. I suggest we start a different
thread for it, to keep the discussions separate.

I'll focus on the auto_explain part here.


I did have some ideas about adaptive query optimization too, although
maybe in a slightly different form. My plan was to collect information
about estimated / actual cardinalities, and then use this knowledge to
directly tweak the estimates. Directly, without creating extended stats,
but treat the collected info about estimates / row counts as a kind of
ad hoc statistics. (Not sure if this is what the AQE extension does.)


What is being proposed here - an extension suggesting which statistics
to create (and possibly creating them automatically) is certainly
useful, but I'm not sure I'd call it "adaptive query optimization". I
think "adaptive" means the extension directly modifies the estimates
based on past executions. So I propose calling it maybe "statistics
advisor" or something like that.


A couple additional points:

1) I think we should create a new extension for this.

auto_explain has a fairly well defined purpose, I don't think this is
consistent with it. It's quite likely it'll require stuff like shared
memory, etc. which auto_explain does not (and should not) need.

Let's call it statistics_advisor, or something like that. It will use
about the same planner/executor callbacks as auto_explain, but that's
fine I think.


2) I'm not sure creating statistics automatically based on a single
query execution is a good idea. I think we'll need to collect data from
multiple runs (in shared memory), and do suggestions based on that.


3) I wonder if it should also consider duration of the query (who cares
about estimates if it still executed in 10ms)? Similarly, it probably
should require some minimal number of rows (1 vs. 10 rows is likely
different from 1M vs. 10M rows, but both is 10x difference).


4) Ideally it'd evaluate impact of the improved estimates on the whole
query plan (you may fix one node, but the cost difference for the whole
query may be negligible). But that seems very hard/expensive :-(


5) I think AddMultiColumnStatisticsForQual() needs refactoring - it
mixes stuff at many different levels of abstraction (generating names,
deciding which statistics to build, ...). I think it'll also need some
improvements to better identify which Vars to consider for statistics,
and once we get support for statistics on expressions committed (which
seems to be fairly close now) also to handle expressions.

BTW Why is "qual" in

  static void
  AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)

declared as "void *"? Shouldn't that be "List *"?


5) I'm not sure about automatically creating the stats. I can't imagine
anyone actually enabling that on production, TBH (I myself probably
would not do that). I suggest we instead provide an easy way to show
which statistics are suggested.

For one execution that might be integrated into EXPLAIN ANALYZE, I guess
(through some callback, which seems fairly easy to do).

For many executions (you can leave it running for a coupel days, then
see what is the suggestion based on X runs) we could have a view or
something. This would also work for read-only replicas, where just
creating the statistics is impossible.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Boundary value check in lazy_tid_reaped()

2021-03-09 Thread Masahiko Sawada
On Tue, Mar 9, 2021 at 9:57 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 8, 2021 at 7:16 PM Peter Eisentraut
>  wrote:
> >
> > On 21.01.21 14:11, Masahiko Sawada wrote:
> > > Agreed. bsearch with bound check showed a reasonable improvement in my
> > > evaluation in terms of performance. Regarding memory efficiency, we
> > > can experiment with other methods later.
> > >
> > > I've attached the patch that adds a bound check for encoded
> > > itermpointers before bsearch() in lazy_tid_reaped() and inlines the
> > > function.
> >
> > Do you have any data showing the effect of inlining lazy_tid_reaped()?
> > I mean, it probably won't hurt, but it wasn't part of the original patch
> > that you tested, so I wonder whether it has any noticeable effect.
>
> I've done some benchmarks while changing the distribution of where
> dead tuples exist within the table. The table size is 4GB and 20% of
> total tuples are dirty. Here are the results of index vacuum execution
> time:
>
> 1. Updated evenly the table (every block has at least one dead tuple).
> master  : 8.15
> inlining  : 4.84
> not-inlinning  : 5.01
>
> 2. Updated the middle of the table.
> master  : 8.71
> inlining  : 3.51
> not-inlinning  : 3.58
>
> 3. Updated both the beginning and the tail of the table.
> master  : 8.44
> inlining  : 3.46
> not-inlinning  : 3.50
>
> There is no noticeable effect of inlining lazy_tid_reaped(). So it
> would be better to not do that.

Attached the patch that doesn't inline lazy_tid_reaped().

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


bound_check_lazy_vacuum_noinline.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2021-03-09 Thread Ajin Cherian
I  ran a 5 cascaded setup of pub-subs on the latest patchset which starts
pgbench on the first server and waits till the data on the fifth server
matches the first.
This is based on a test script created by Erik Rijkers. The tests run fine
and the 5th server achieves data consistency in around a minute.
Attaching the test script and the test run log.

regards,
Ajin Cherian
Fujitsu Australia
#!/bin/bash

#  I compile postgres server versions into dir:
#  $HOME/pg_stuff/pg_installations/pgsql.$project   ( where project is a name )
#
#  This script assumes that projects HEAD and head0 are present
#  $HOME/pg_stuff/pg_installations/pgsql.HEAD   --> git master as of today - friday 12 febr 2021
#  $HOME/pg_stuff/pg_installations/pgsql.head0  --> 3063eb17593c  so that's 11 febr, before the replication changes
#
#  variables 'project' and 'BIN' reflect my set up (so should probably be changed)
#

#  my instance HEAD has the bug: it keeps looping with 'NOK' (=Not OK): primary not identical to all replicas
#  my instance head0 is ok - finishes in 20 s - with the replicas identical to primary
#
#  Erik Rijkers


unset PGPORT PGSERVICE
export PGDATABASE=postgres

#if   [[ "$1" == "HEAD"  ]] ; then project=HEAD   # new HEAD = 12 febr
#elif [[ "$1" == "head0" ]] ; then project=head0  # older: 3063eb17593c = 11 febr
#else
# echo "arg1 missing -  "
# exit
#fi


#root_dir=/tmp/cascade/$project
project=two_phase
root_dir=/home/ajin/twophasetest/mar09

mkdir -p $root_dir

#BIN=$HOME/pg_stuff/pg_installations/pgsql.$project/bin
BIN=/home/ajin/install-oss/bin

export PATH=$BIN:$PATH

  initdb=$BIN/initdb
postgres=$BIN/postgres
  pg_ctl=$BIN/pg_ctl

baseport=6525
   port1=$(( $baseport + 0 )) 
   port2=$(( $baseport + 1 ))
   port3=$(( $baseport + 2 ))
   port4=$(( $baseport + 3 ))
   port5=$(( $baseport + 4 ))
 appname=$project

num_instances=5   # or 2

scale=1   #  pgbench -s
  #clients=64  #  pgbench -c
  clients=1  #  pgbench -c
 duration=1   #  how long to run: pgbench -T $duration 
 wait=1

if [[ -d $root_dir/instance1 ]]; then rm -rf $root_dir/instance1; fi
if [[ -d $root_dir/instance2 ]]; then rm -rf $root_dir/instance2; fi
if [[ -d $root_dir/instance3 ]]; then rm -rf $root_dir/instance3; fi
if [[ -d $root_dir/instance4 ]]; then rm -rf $root_dir/instance4; fi
if [[ -d $root_dir/instance5 ]]; then rm -rf $root_dir/instance5; fi
if [[ -d $root_dir/instance1 ]]; then exit ; fi
if [[ -d $root_dir/instance2 ]]; then exit ; fi
if [[ -d $root_dir/instance3 ]]; then exit ; fi
if [[ -d $root_dir/instance4 ]]; then exit ; fi
if [[ -d $root_dir/instance5 ]]; then exit ; fi

# devel_file=/tmp/bugs
# echo filterbug>$devel_file

for n in `seq 1 $num_instances`
do
  instance=instance$n
  server_dir=$root_dir/$instance
  data_dir=$server_dir/data
  port=$(( $baseport + $n -1 ))
  logfile=$server_dir/logfile.$port
# echo "-- $initdb --pgdata=$data_dir --encoding=UTF8 --pwfile=$devel_file  #  $port"
#  $initdb --pgdata=$data_dir --encoding=UTF8 --pwfile=$devel_file  &> /dev/null
  echo "-- $initdb --pgdata=$data_dir --encoding=UTF8   #  $port"
   $initdb --pgdata=$data_dir --encoding=UTF8   &> /dev/null

  ( $postgres  -D $data_dir -p $port \
--wal_level=logical --logging_collector=on \
--client_min_messages=warning \
--log_directory=$server_dir --log_filename=logfile.${port} \
--log_replication_commands=on & ) &> /dev/null
done 

# dbsize=$(echo " $scale * 10 " | bc )

# sleep 1

echo "
  drop table if exists pgbench_accounts;
  drop table if exists pgbench_branches;
  drop table if exists pgbench_tellers;
  drop table if exists pgbench_history;" | psql -qXp $port1 \
&& echo "
  drop table if exists pgbench_accounts;
  drop table if exists pgbench_branches;
  drop table if exists pgbench_tellers;
  drop table if exists pgbench_history;" | psql -qXp $port2 \
&& echo "
  drop table if exists pgbench_accounts;
  drop table if exists pgbench_branches;
  drop table if exists pgbench_tellers;
  drop table if exists pgbench_history;" | psql -qXp $port3 \
&& echo "
  drop table if exists pgbench_accounts;
  drop table if exists pgbench_branches;
  drop table if exists pgbench_tellers;
  drop table if exists pgbench_history;" | psql -qXp $port4 \
&& pgbench -p $port1 -qis $scale \
&& echo "alter table pgbench_history add column hid serial primary key;" \
  | psql -q1Xp $port1 && pg_dump -F c -p $port1 \
 --exclude-table-data=pgbench_history  \
 --exclude-table-data=pgbench_accounts \
 --exclude-table-data=pgbench_branches \
 --exclude-table-data=pgbench_tellers  \
   -t pgbench_history -t pgbench_accounts \
   -t pgbench_branches -t pgbench_tellers \
  | pg_restore -1 -p $port2 -d postgres

if [[ $num_instances -eq 5 ]]
then
  pg_dump -F c -p $port1 \
 --exclude-table-data=pgbench_history  \
 --exclude-table-data=pgbench_accounts \
 --exclude-table-data=pgbench_branches \
 

Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-09 Thread David Steele

On 3/9/21 1:08 PM, Tom Lane wrote:

David Steele  writes:

1) PL/SQL seems to be used in a few places where I believe PL/pgSQL is
meant. This was pre-existing but now seems like a good opportunity to
fix it, unless I am misunderstanding.


PL/SQL is Oracle's function language, which PL/pgSQL is modeled on.
At least some of the mentions of PL/SQL are probably intentional,
so you'll have to look closely not just search-and-replace.


Ah, yes. That's what I get for just reading the patch and not looking at 
the larger context.


Regards,
--
-David
da...@pgmasters.net




Re: partial heap only tuples

2021-03-09 Thread Bruce Momjian
On Tue, Mar  9, 2021 at 09:33:31PM +, Bossart, Nathan wrote:
> I'm cautiously optimistic that index creation and deletion will not
> require too much extra work.  For example, if a new index needs to
> point to a partial heap only tuple, it can do so (unlike HOT, which
> would require that the new index point to the root of the chain).  The
> modified-columns bitmaps could include the entire set of modified
> columns (not just the indexed ones), so no additional changes would
> need to be made there.  Furthermore, I'm anticipating that the
> modified-columns bitmaps will end up only being used with the
> redirected LPs to help reduce heap bloat after single-page vacuuming.
> In that case, new indexes would probably avoid the existing bitmaps
> anyway.

Yes, that would probably work, sure.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Huge memory consumption on partitioned table with FKs

2021-03-09 Thread Tom Lane
Amit Langote  writes:
> On Fri, Mar 5, 2021 at 6:00 AM Tom Lane  wrote:
>> This claim seems false on its face:
>>> All child constraints of a given foreign key constraint can use the
>>> same RI query and the resulting plan, that is, no need to create as
>>> many copies of the query and the plan as there are partitions, as
>>> happens now due to the child constraint OID being used in the key
>>> for ri_query_cache.

> The quoted comment could have been written to be clearer about this,
> but it's not talking about the table that is to be queried, but the
> table whose RI trigger is being executed.  In all the cases except one
> (mentioned below), the table that is queried is the same irrespective
> of which partition's trigger is being executed, so we're basically
> creating the same plan separately for each partition.

Hmm.  So, the key point is that the values coming from the partitioned
child table are injected into the test query as parameters, not as
column references, thus it doesn't matter *to the test query* what
numbers the referencing columns have in that child.  We just have to
be sure we pass the right parameter values.  But ... doesn't the code
use riinfo->fk_attnums[] to pull out the values to be passed?

IOW, I now get the point about being able to share the SPI plans,
but I'm still dubious about sharing the RI_ConstraintInfo cache entries.

It looks to me like the v4 patch is now actually not sharing the
cache entries, ie their hash key is just the child constraint OID
same as before; but the comments are pretty confused about this.

It might be simpler if you add just one new field which is the OID of
the constraint that we're building the SPI query from, which might be
either equal to constraint_id, or the OID of some parent constraint.
In particular it's not clear to me why we need both constraint_parent
and constraint_root_id.

regards, tom lane




Re: POC: GROUP BY optimization

2021-03-09 Thread Tomas Vondra
Hi,

I take a look at the patch today. Attached is the v12 and a separate
patch with some comment tweaks and review comments.


1) I see the patch results in some plan changes in postgres_fdw. I
assume it's somehow related to the sort costing changes, but I find it a
bit suspicious. Why should the plan change from this:

  Foreign Scan
Remote SQL: ... ORDER BY ... OFFSET 100 LIMIT 10;

to

  Limit
Foreign Scan
  Remote SQL: ... ORDER BY ...

But even if this is due to changes to order by costing, postponing the
limit seems like a net loss - the sort happens before the limit, and by
postponing the limit after foreign scan we need to transfer 100 more
rows. So what's causing this?

The difference in plan cost seems fairly substantial (old: 196.52, new:
241.94). But maybe that's OK and expected.


2) I wonder how much more expensive (in terms of CPU) is the new sort
costing code. It's iterative, and it's calling functions to calculate
number of groups etc. I wonder what's the impact simple queries?


3) It's not clear to me why we need "fake var" protection? Why would the
estimate_num_groups() fail for that?


4) In one of the places a comment says DEFAULT_NUM_DISTINCT is not used
because we're dealing with multiple columns. That makes sense, but maybe
we should use DEFAULT_NUM_DISTINCT at least to limit the range. That is,
with N columns we should restrict the nGroups estimate by:

min = DEFAULT_NUM_DISTINCT
max = Min(pow(DEFAULT_NUM_DISTINCT, ncolumns), tuples)

Also, it's not clear what's the logic behind the formula:

nGroups = ceil(2.0 + sqrt(tuples) *
   list_length(pathkeyExprs) / list_length(pathkeys));

A comment explaining that would be helpful.


5) The compute_cpu_sort_cost comment includes two references (in a quite
mixed-up way), and then there's another reference in the code. I suggest
we list all of them in the function comment.


6) There's a bunch of magic values (various multipliers etc.). It's not
clear which values are meant to be equal, etc. Let's introduce nicer
constants with reasonable names.


7) The comment at get_cheapest_group_keys_order is a bit misleading,
because it talks about "uniqueness" - but that's not what we're looking
at here (I mean, is the column unique or not). We're looking at number
of distinct values in the column, which is a different thing. Also, it'd
be good to roughly explain what get_cheapest_group_keys_order does - how
it calculates the order (permutations up to 4 values, etc.).


8) The comment at compute_cpu_sort_cost should also explain basics of
the algorithm. I tried doing something like that, but I'm not sure I got
all the details right and it probably needs further improvements.


9) The main concern I have is still about the changes in planner.c, and
I think I've already shared it before. The code takes the grouping
pathkeys, and just reshuffles them to what it believes is a better order
(cheaper, ...). That is, there's on input pathkey list, and it gets
transformed into another pathkey list. The problem is that even if this
optimizes the sort, it may work against some optimization in the upper
part of the plan.

Imagine a query that does something like this:

   SELECT a, b, count(*) FROM (
  SELECT DISTINCT a, b, c, d FROM x
   ) GROUP BY a, b;

Now, from the costing perspective, it may be cheaper to do the inner
grouping by "c, d, a, b" for example. But that works directly against
the second grouping, which would benefit from having the input sorted by
"a, b". How exactly would this behaves depends on the number of distinct
values in the columns, how expensive the comparisons are for each
column, and so on. But you get the idea.

I admit I haven't managed to construct a reasonably query that'd be
significantly harmed by this, but I haven't spent a lot of time on it.

I'm pretty sure I used this trick in the past, when doing aggregations
on large data sets, where it was much better to "pipe" the data through
multiple GroupAggregate nodes.

For this reason I think the code generating paths should work a bit like
get_useful_pathkeys_for_relation and generate_useful_gather_paths.

* group_keys_reorder_by_pathkeys should not produce just one list of
  pathkeys, but multiple interesting options. At the moment it should
  produce at least the input grouping pathkeys (as specified in the
  query), and the "optimal" pathkeys (by lowest cost).

* the places calling group_keys_reorder_by_pathkeys should loop on the
  result, and generate separate path for each option.

I'd guess in the future we'll "peek forward" in the plan and see if
there are other interesting pathkeys (that's the expectation for
get_useful_pathkeys_for_relation).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 0d8a7ddc441a3b26a498f6d3d4805fe5d305ebfc Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 9 Mar 2021 18:44:23 +0100
Subject: [PATCH 1/2] v12

---
 

Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Peter Geoghegan
On Tue, Mar 9, 2021 at 1:54 PM Peter Geoghegan  wrote:
> It occurs to me that we should also mark the hole in the middle of the
> page (which includes the would-be LP_UNUSED line pointers at the end
> of the original line pointer array space) as undefined to Valgrind
> within PageRepairFragmentation().

It would probably also make sense to memset() the space in question to
a sequence of 0x7F bytes in CLOBBER_FREED_MEMORY builds. That isn't
quite as good as what Valgrind will do in some ways, but it has a
major advantage: it will usually visibly break code where the
PageRepairFragmentation() calls made by VACUUM happen to take place
inside another backend.

Valgrind instrumentation of PageRepairFragmentation() along the lines
I've described won't recognize the "hole in the middle of the page"
area as undefined when it was marked undefined in another backend.
It's as if shared memory is private to each process as far as the
memory poisoning/undefined-to-Valgrind stuff is concerned. In other
words, it deals with memory mappings, not memory.

-- 
Peter Geoghegan




Re: Optimising latch signals

2021-03-09 Thread Thomas Munro
On Tue, Mar 9, 2021 at 1:09 PM Thomas Munro  wrote:
> On Tue, Mar 9, 2021 at 12:20 PM Alvaro Herrera  
> wrote:
> > Hi, I don't know if you realized but we have two new Illumos members
> > now (haddock and hake), and they're both failing initdb on signalfd()
> > problems.

> I'll wait a short time while he tries to see if that can be fixed (I

They're green now.  For the record: https://www.illumos.org/issues/13613




Re: WIP: BRIN multi-range indexes

2021-03-09 Thread Tomas Vondra



On 3/9/21 9:51 PM, John Naylor wrote:
> 
> On Sun, Mar 7, 2021 at 8:53 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> [v20210308b]
> 
> I managed to trap an assertion that somehow doesn't happen during the
> regression tests. The callers of fill_expanded_ranges() do math like this:
> 
> /* both ranges and points are expanded into a separate element */
> neranges = ranges->nranges + ranges->nvalues;
> 
> but inside fill_expanded_ranges() we have this assertion:
> 
> /* Check that the output array has the right size. */
> Assert(neranges == (2 * ranges->nranges + ranges->nvalues));
> 

Yeah, that assert is bogus. It's calculating the number of boundary
values (and ranges have two), but in ExpandedRanges each we still
represent that as a single element. So the Assert should be just

Assert(neranges == (ranges->nranges + ranges->nvalues));

But maybe it's just an overkill and we don't really need it.

> In the regression test data, a debugging elog() shows that nranges is
> most often zero, so in that case, the math happens to be right either
> way. I can reliably get nranges above zero by running
> 
> update brintest_multi set int8col = int8col - 1;
> 
> a few times, at which point I get the crash.
>

Hmm, so maybe we should do something like this in regression tests too?
It's not good that we don't trigger the "nranges > 0" case at all.

> 
> Aside from that, the new changes look good. Just a couple small things:
> 
> +    allowed parameters.  Only the bloom operator class
> +    allows specifying parameters:
> 
> minmax-multi aren't mentioned here, but are mentioned further down.
> 

I forgot to add this bit. Will fix.

> + * Addresses from different families are consider to be in maximum
> 
> (comment above brin_minmax_multi_distance_inet)
> s/consider/considered/
> 

Will fix.

>> > 2) moving minmax/inclusion changes from 0002 to a separate patch 0003
>> >
>> > I think we should either ditch the 0003 (i.e. keep the existing
>> > opclasses unchanged) or commit 0003 (in which case I'd vote to just stop
>> > supporting the old signature of the consistent function).
>> >
>>
>> Still not sure what do to about this. I'm leaning towards keeping 0003
>> and just removing the "old" signature entirely, to keep the API cleaner.
>> It might cause some breakage in out-of-core BRIN opclasses, but that
>> seems like a reasonable price. Moreover, the opclasses may need some
>> updating anyway, because of the changes in handling NULL scan keys (0004
>> moves that from the opclass to the bringetbitmap function).
> 
> Keeping 0003 seems reasonable, given the above.
> 

And do you agree with removing the old signature entirely? That might
break some out-of-core opclasses, but we're not aware of any, and
they'll be broken anyway. Seems fine to me.

>> > The remaining part that didn't get much review is the very last patch,
>> > adding an option to ignore correlation for some BRIN opclases. This is
>> > needed as the regular BRIN costing is quite sensitive to correlation,
>> > and the cost gets way too high for poorly correlated data, making it
>> > unlikely the index will be used. But handling such data sets efficiently
>> > is the main point of those new opclasses. Any opinions on this?
>> >
>>
>> Not sure about this.
> 
> I hadn't given it much thought (nor tested), but I just took a look at
> brincostestimate(). If the table is badly correlated, I'm thinking the
> number of ranges that need to be scanned will increase regardless. Maybe
> rather than ignoring correlation, we could clamp it or otherwise tweak
> it. Not sure about the details, though, that would require some testing.
> 

Well, maybe. In any case we need to do something about this, otherwise
the new opclasses won't be used even in cases where it's perfectly OK.
And it needs to be opclass-dependent, in some way.

I'm pretty sure even the simple examples you've used to test
minmax-multi (with updating a fraction of tuples to low/high value)
would be affected by this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Tom Lane
Peter Geoghegan  writes:
> It occurs to me that we should also mark the hole in the middle of the
> page (which includes the would-be LP_UNUSED line pointers at the end
> of the original line pointer array space) as undefined to Valgrind
> within PageRepairFragmentation().

+1

regards, tom lane




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Tom Lane
Mark Dilger  writes:
>> On Mar 9, 2021, at 1:35 PM, Tom Lane  wrote:
>> So, to accept a patch that shortens the line pointer array, what we need
>> to do is verify that every such code path checks for an out-of-range
>> offset before trying to fetch the target line pointer.

> Much as Pavan asked [1], I'm curious how we wouldn't already be in trouble if 
> such code exists?  In such a scenario, what stops a dead line pointer from 
> being reused (rather than garbage collected by this patch) prior to such 
> hypothetical code using an outdated TID?

The line pointer very well *could* be re-used before the in-flight
reference gets to it.  That's okay though, because whatever tuple now
occupies the TID would have to have xmin too new to match the snapshot
that such a reference is scanning with.

(Back when we had non-MVCC snapshots to contend with, a bunch of
additional arm-waving was needed to argue that such situations were
safe.  Possibly the proposed change wouldn't have flown back then.)

regards, tom lane




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Peter Geoghegan
On Tue, Mar 9, 2021 at 1:36 PM Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > The only two existing mechanisms that I could find (in the access/heap
> > directory) that possibly could fail on shrunken line pointer arrays;
> > being xlog recovery (I do not have enough knowledge on recovery to
> > determine if that may touch pages that have shrunken line pointer
> > arrays, or if those situations won't exist due to never using dirtied
> > pages in recovery) and backwards table scans on non-MVCC snapshots
> > (which would be fixed in the attached patch).
>
> I think you're not visualizing the problem properly.
>
> The case I was concerned about back when is that there are various bits of
> code that may visit a page with a predetermined TID in mind to look at.
> An index lookup is an obvious example, and another one is chasing an
> update chain's t_ctid link.  You might argue that if the tuple was dead
> enough to be removed, there should be no such in-flight references to
> worry about, but I think such an assumption is unsafe.  There is not, for
> example, any interlock that ensures that a process that has obtained a TID
> from an index will have completed its heap visit before a VACUUM that
> subsequently removed that index entry also removes the heap tuple.
>
> So, to accept a patch that shortens the line pointer array, what we need
> to do is verify that every such code path checks for an out-of-range
> offset before trying to fetch the target line pointer.  I believed
> back in 2007 that there were, or once had been, code paths that omitted
> such a range check, assuming that they could trust the TID they had
> gotten from $wherever to point at an extant line pointer array entry.
> Maybe things are all good now, but I think you should run around and
> examine every place that checks for tuple deadness to see if the offset
> it used is known to be within the current page bounds.

It occurs to me that we should also mark the hole in the middle of the
page (which includes the would-be LP_UNUSED line pointers at the end
of the original line pointer array space) as undefined to Valgrind
within PageRepairFragmentation(). This is not to be confused with
marking them inaccessible to Valgrind, which just poisons the bytes.
Rather, it represents that the bytes in question are considered safe
to copy around but not safe to rely on being any particular value. So
Valgrind will complain if the bytes in question influence control
flow, directly or indirectly.

Obviously the code should also be audited. Even then, there may still
be bugs. I think that we need to bite the bullet here -- line pointer
bloat is a significant problem.

-- 
Peter Geoghegan




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Mark Dilger



> On Mar 9, 2021, at 1:35 PM, Tom Lane  wrote:
> 
> So, to accept a patch that shortens the line pointer array, what we need
> to do is verify that every such code path checks for an out-of-range
> offset before trying to fetch the target line pointer.  I believed
> back in 2007 that there were, or once had been, code paths that omitted
> such a range check, assuming that they could trust the TID they had
> gotten from $wherever to point at an extant line pointer array entry.
> Maybe things are all good now, but I think you should run around and
> examine every place that checks for tuple deadness to see if the offset
> it used is known to be within the current page bounds.

Much as Pavan asked [1], I'm curious how we wouldn't already be in trouble if 
such code exists?  In such a scenario, what stops a dead line pointer from 
being reused (rather than garbage collected by this patch) prior to such 
hypothetical code using an outdated TID?

I'm not expressing a view here, just asking questions.

[1] 
https://www.postgresql.org/message-id/2e78013d0709130832t31244e79k9488a3e4eb00d64c%40mail.gmail.com

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Tom Lane
Matthias van de Meent  writes:
> The only two existing mechanisms that I could find (in the access/heap
> directory) that possibly could fail on shrunken line pointer arrays;
> being xlog recovery (I do not have enough knowledge on recovery to
> determine if that may touch pages that have shrunken line pointer
> arrays, or if those situations won't exist due to never using dirtied
> pages in recovery) and backwards table scans on non-MVCC snapshots
> (which would be fixed in the attached patch).

I think you're not visualizing the problem properly.

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link.  You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe.  There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer.  I believed
back in 2007 that there were, or once had been, code paths that omitted
such a range check, assuming that they could trust the TID they had
gotten from $wherever to point at an extant line pointer array entry.
Maybe things are all good now, but I think you should run around and
examine every place that checks for tuple deadness to see if the offset
it used is known to be within the current page bounds.

regards, tom lane




Re: partial heap only tuples

2021-03-09 Thread Bossart, Nathan
On 3/9/21, 8:24 AM, "Bruce Momjian"  wrote:
> On Mon, Feb 15, 2021 at 08:19:40PM +, Bossart, Nathan wrote:
>> Yeah, this is something I'm concerned about.  I think adding a bitmap
>> of modified columns to the header of PHOT-updated tuples improves
>> matters quite a bit, even for single-page vacuuming.  Following is a
>> strategy I've been developing (there may still be some gaps).  Here's
>> a basic PHOT chain where all tuples are visible and the last one has
>> not been deleted or updated:
>>
>> idx10   1   2   3
>> idx20   1   2
>> idx30   2   3
>> lp  1   2   3   4   5
>> tuple   (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
>> bitmap  -xx xx- --x x-x
>
> First, I want to continue encouraging you to work on this because I
> think it can yield big improvements.  Second, I like the wiki you
> created.  Third, the diagram above seems to be more meaningful if read
> from the bottom-up.  I suggest you reorder it on the wiki so it can be
> read top-down, maybe:
>
>> lp  1   2   3   4   5
>> tuple   (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
>> bitmap  -xx xx- --x x-x
>> idx10   1   2   3
>> idx20   1   2
>> idx30   2   3

I appreciate the feedback and the words of encouragement.  I'll go
ahead and flip the diagrams like you suggested.  I'm planning on
publishing a larger round of edits to the wiki once the patch set is
ready to share.  There are a few changes to the design that I've
picked up along the way.

> Fourth, I know in the wiki you said create/drop index needs more
> research, but I suggest you avoid any design that will be overly complex
> for create/drop index.  For example, a per-row bitmap that is based on
> what indexes exist at time of row creation might cause unacceptable
> problems in handling create/drop index.  Would you number indexes?  I am
> not saying you have to solve all the problems now, but you have to keep
> your eye on obstacles that might block your progress later.

I am agreed on avoiding an overly complex design.  This project
introduces a certain amount of inherent complexity, so one of my main
goals is ensuring that it's easy to reason about each piece.

I'm cautiously optimistic that index creation and deletion will not
require too much extra work.  For example, if a new index needs to
point to a partial heap only tuple, it can do so (unlike HOT, which
would require that the new index point to the root of the chain).  The
modified-columns bitmaps could include the entire set of modified
columns (not just the indexed ones), so no additional changes would
need to be made there.  Furthermore, I'm anticipating that the
modified-columns bitmaps will end up only being used with the
redirected LPs to help reduce heap bloat after single-page vacuuming.
In that case, new indexes would probably avoid the existing bitmaps
anyway.

Nathan



Re: WIP: BRIN multi-range indexes

2021-03-09 Thread John Naylor
On Sun, Mar 7, 2021 at 8:53 PM Tomas Vondra 
wrote:
[v20210308b]

I managed to trap an assertion that somehow doesn't happen during the
regression tests. The callers of fill_expanded_ranges() do math like this:

/* both ranges and points are expanded into a separate element */
neranges = ranges->nranges + ranges->nvalues;

but inside fill_expanded_ranges() we have this assertion:

/* Check that the output array has the right size. */
Assert(neranges == (2 * ranges->nranges + ranges->nvalues));

In the regression test data, a debugging elog() shows that nranges is most
often zero, so in that case, the math happens to be right either way. I can
reliably get nranges above zero by running

update brintest_multi set int8col = int8col - 1;

a few times, at which point I get the crash.


Aside from that, the new changes look good. Just a couple small things:

+allowed parameters.  Only the bloom operator class
+allows specifying parameters:

minmax-multi aren't mentioned here, but are mentioned further down.

+ * Addresses from different families are consider to be in maximum

(comment above brin_minmax_multi_distance_inet)
s/consider/considered/

> > 2) moving minmax/inclusion changes from 0002 to a separate patch 0003
> >
> > I think we should either ditch the 0003 (i.e. keep the existing
> > opclasses unchanged) or commit 0003 (in which case I'd vote to just stop
> > supporting the old signature of the consistent function).
> >
>
> Still not sure what do to about this. I'm leaning towards keeping 0003
> and just removing the "old" signature entirely, to keep the API cleaner.
> It might cause some breakage in out-of-core BRIN opclasses, but that
> seems like a reasonable price. Moreover, the opclasses may need some
> updating anyway, because of the changes in handling NULL scan keys (0004
> moves that from the opclass to the bringetbitmap function).

Keeping 0003 seems reasonable, given the above.

> > The remaining part that didn't get much review is the very last patch,
> > adding an option to ignore correlation for some BRIN opclases. This is
> > needed as the regular BRIN costing is quite sensitive to correlation,
> > and the cost gets way too high for poorly correlated data, making it
> > unlikely the index will be used. But handling such data sets efficiently
> > is the main point of those new opclasses. Any opinions on this?
> >
>
> Not sure about this.

I hadn't given it much thought (nor tested), but I just took a look at
brincostestimate(). If the table is badly correlated, I'm thinking the
number of ranges that need to be scanned will increase regardless. Maybe
rather than ignoring correlation, we could clamp it or otherwise tweak it.
Not sure about the details, though, that would require some testing.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: New IndexAM API controlling index vacuum strategies

2021-03-09 Thread Peter Geoghegan
On Mon, Mar 8, 2021 at 7:34 PM Peter Geoghegan  wrote:
> > One possible
> > consequence that I'm concerned about is sequential scan performance.
> > For an index scan, you just jump to the line pointer you want and then
> > go get the tuple, but a sequential scan has to loop over all the line
> > pointers on the page, and skipping a lot of dead ones can't be
> > completely free. A small increase in MaxHeapTuplesPerPage probably
> > wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
> > is a bit scary.
>
> I agree. Maybe the real problem here is that MaxHeapTuplesPerPage is a
> generic constant. Perhaps it should be something that can vary by
> table, according to practical table-level considerations such as
> projected tuple width given the "shape" of tuples for that table, etc.

Speaking of line pointer bloat (and "irreversible" bloat), I came
across something relevant today. I believe that this recent patch from
Matthias van de Meent is a relatively easy way to improve the
situation:

https://www.postgresql.org/message-id/flat/CAEze2WjgaQc55Y5f5CQd3L%3DeS5CZcff2Obxp%3DO6pto8-f0hC4w%40mail.gmail.com

-- 
Peter Geoghegan




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Peter Geoghegan
On Tue, Mar 9, 2021 at 7:13 AM Matthias van de Meent
 wrote:
> The shrinking of the line pointer array is already common practice in
> indexes (in which all LP_UNUSED items are removed), but this specific
> implementation cannot be used for heap pages due to ItemId
> invalidation. One available implementation, however, is that we
> truncate the end of this array, as mentioned in [1]. There was a
> warning at the top of PageRepairFragmentation about not removing
> unused line pointers, but I believe that was about not removing
> _intermediate_ unused line pointers (which would imply moving in-use
> line pointers); as far as I know there is nothing that relies on only
> growing page->pd_lower, and nothing keeping us from shrinking it
> whilst holding a pin on the page.

Sounds like a good idea to me.

-- 
Peter Geoghegan




Re: pg_upgrade failing for 200+ million Large Objects

2021-03-09 Thread Justin Pryzby
On Wed, Mar 03, 2021 at 11:36:26AM +, Tharakan, Robins wrote:
> While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the
> instance had ~200 million (in-use) Large Objects. I was able to reproduce
> this on a test instance which too fails with a similar error.

If pg_upgrade can't handle millions of objects/transactions/XIDs, that seems
like a legitimate complaint, since apparently the system is working okay
otherwise.

But it also seems like you're using it outside the range of its intended use
(See also [1]).  I'm guessing that not many people are going to spend time
running tests of pg_upgrade, each of which takes 25hr, not to mention some
multiple of 128GB RAM+swap.

Creating millions of large objects was too slow for me to test like this:
| time { echo 'begin;'; for a in `seq 1 9`; do echo '\lo_import /dev/null'; 
done; echo 'commit;'; } |psql -qh /tmp postgres&

This seems to be enough for what's needed:
| ALTER SYSTEM SET fsync=no; ALTER SYSTEM SET full_page_writes=no; SELECT 
pg_reload_conf();
| INSERT INTO pg_largeobject_metadata SELECT a, 0 FROM generate_series(10, 
200111222)a;

Now, testing the pg_upgrade was killed after runnning 100min and using 60GB
RAM, so you might say that's a problem too.  I converted getBlobs() to use a
cursor, like dumpBlobs(), but it was still killed.  I think a test case and a
way to exercizes this failure with a more reasonable amount of time and
resources might be a prerequisite for a patch to fix it.

pg_upgrade is meant for "immediate" upgrades, frequently allowing upgrade in
minutes, where pg_dump |pg_restore might take hours or days.  There's two
components to consider: the catalog/metadata part, and the data part.  If the
data is large (let's say more than 100GB), then pg_upgrade is expected to be an
improvement over the "dump and restore" process, which is usually infeasible
for large DBs measure in TB.

But the *catalog* part is large, and pg_upgrade still has to run pg_dump, and
pg_restore.  The time to do this might dominate over the data part.  Our own
customers DBs are 100s of GB to 10TB.  For large customers, pg_upgrade takes
45min.  In the past, we had tables with many column defaults, which caused the
dump+restore to be slow at a larger fraction of customers.

If it were me, in an EOL situation, I would look at either: 1) find a way to do
dump+restore rather than pg_upgrade; and/or, 2) separately pg_dump the large
objects, drop as many as you can, then pg_upgrade the DB, then restore the
large objects.  (And find a better way to store them in the future).

I was able to hack pg_upgrade to call pg_restore --single (with a separate
invocation to handle --create).  That passes tests...but I can't say much
beyond that.

Regarding your existing patch: "make check" only tests SQL features.
For development, you'll want to configure like:
|./configure --enable-debug --enable-cassert --enable-tap-tests
And then use "make check-world", and in particular:
time make check -C src/bin/pg_resetwal
time make check -C src/bin/pg_upgrade

I don't think pg_restore needs a user-facing option for XIDs.  I think it
should "just work", since a user might be as likely to shoot themselves in the
foot with a commandline option as they are to make an upgrade succeed that
would otherwise fail.  pg_upgrade has a --check mode, and if that passes, the
upgrade is intended to work, and not fail halfway through between the schema
dump and restore, with the expectation that the user know to rerun with some
commandline flags.  If you pursue the patch with setting a different XID
threshold, maybe you could count the number of objects to be created, or
transactions to be used, and use that as the argument to resetxlog ?  I'm not
sure, but pg_restore -l might be a good place to start looking.

I think a goal for this patch should be to allow an increased number of
objects to be handled by pg_upgrade.  Large objects may be a special case, and
increasing the number of other objects to be restored to the 100s of millions
might be unimportant.

-- 
Justin

[1] https://www.postgresql.org/message-id/502641.1606334432%40sss.pgh.pa.us
| Does pg_dump really have sane performance for that situation, or
| are we soon going to be fielding requests to make it not be O(N^2)
| in the number of listed tables?
>From cfc7400bb021659d49170e8b17d067c8e1b9fa33 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 9 Mar 2021 14:06:17 -0600
Subject: [PATCH] pg_dump: use a cursor in getBlobs..

..to mitigate huge memory use in the case of millions of large objects
---
 src/bin/pg_dump/pg_dump.c | 96 +--
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index aa02ada079..3fd7f48605 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -,7 +,7 @@ getBlobs(Archive *fout)
 	BlobInfo   *binfo;
 	DumpableObject *bdata;
 	PGresult   *res;
-	int			ntups;

Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-09 Thread David G. Johnston
On Tue, Mar 9, 2021 at 10:45 AM Pavel Stehule 
wrote:

>
>
> út 9. 3. 2021 v 18:03 odesílatel David Steele 
> napsal:
>
>> On 11/30/20 10:37 AM, Pavel Stehule wrote:
>> > po 30. 11. 2020 v 16:06 odesílatel David G. Johnston
>> >
>> > ok
>> This patch looks reasonable to me overall.
>>
>> A few comments:
>>
>> 1) PL/SQL seems to be used in a few places where I believe PL/pgSQL is
>> meant. This was pre-existing but now seems like a good opportunity to
>> fix it, unless I am misunderstanding.
>>
>
> +1
>

I vaguely recall looking for this back in October and not finding anything
that needed fixing in the area I was working in.  The ready-for-commit can
stand without further investigation.  Feel free to look for and fix
oversights of this nature if you feel they exist.


>
>> 2) I think:
>>
>> + makes the command behave like SELECT, which is
>> described
>>
>> flows a little better as:
>>
>> + makes the command behave like SELECT, as
>> described
>>
>
> I am not native speaker, so I am not able to evaluate it.
>

"which is described" is perfectly valid.  I don't know that "as described"
is materially better from a flow perspective (I agree it reads a tiny bit
better) but either seems to adequately communicate the intended point so I
wouldn't gripe if it was changed during commit.

I intend to leave the patch as-is though since as written it is
committable, this second comment is just style and the first is scope creep.

David J.


Re: Allow batched insert during cross-partition updates

2021-03-09 Thread Georgios Kokolatos
Hi,

thanks for the patch. I had a first look and played around with the code.

The code seems clean, complete, and does what it says on the tin. I will
need a bit more time to acclimatise with all the use cases for a more
thorough review.

I small question though is why expose PartitionTupleRouting and not add
a couple of functions to get the necessary info? If I have read the code
correctly the only members actually needed to be exposed are num_partitions
and partitions. Not a critique, I am just curious.

Cheers,
//Georgios

Re: Procedures versus the "fastpath" API

2021-03-09 Thread Joe Conway
On 3/9/21 2:15 PM, Tom Lane wrote:
> So the question on the table is what to do about this.  As far as
> window functions go, it seems clear that fastpath.c should just reject
> any attempt to call a window function that way (or an aggregate for
> that matter; aggregates fail already, but with relatively obscure
> error messages).  Perhaps there's also an argument that window
> functions should have run-time tests, not just assertions, that
> they're called in a valid way.
> 
> As for procedures, I'm of the opinion that we should just reject those
> too, but some other security team members were not happy with that
> idea.  Conceivably we could attempt to make the case actually work,
> but is it worth the trouble?  Given the lack of field complaints
> about the "invalid transaction termination" failure, it seems unlikely
> that it's occurred to anyone to try to call procedures this way.
> We'd need special infrastructure to test the case, too, since psql
> offers no access to fastpath calls.

+1

> A compromise suggestion was to prohibit calling procedures via
> fastpath as of HEAD, but leave existing releases as they are,
> in case anyone is using a case that happens to work.
> 
> Thoughts?

My vote would be reject using fastpath for procedures in all relevant branches.
If someday someone cares enough to make it work, it is a new feature for a new
major release.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:
> "Joel Jacobson"  writes:
> > Tom - can you please give details on your unpleasant experiences with 
> > parallel arrays?
> 
> The problems I can recall running into were basically down to not having
> an easy way to iterate through parallel arrays.  There are ways to do
> that in SQL, certainly, but they all constrain how you write the query,
> and usually force ugly stuff like splitting it into sub-selects.

I see now what you mean, many thanks for explaining.

> 
> As an example, presuming that regexp_positions is defined along the
> lines of
> 
> regexp_positions(str text, pat text, out starts int[], out lengths int[])
> returns setof record

+1

I think this is the most feasible best option so far.

Attached is a patch implementing it this way.

I changed the start to begin at 1, since this is how position ( substring text 
IN string text ) → integer works.

SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
+-
{1}| {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
+-
{4}| {3}
{12}   | {3}
(2 rows)

Mark's examples:

SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
+-
{7}| {0}
(1 row)


SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
+-
{15}   | {0}
(1 row)

I've also tested your template queries:

> 
> then to actually get the identified substrings you'd have to do something
> like
> 
> select
>   substring([input string] from starts[i] for lengths[i])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> generate_series(1, array_length(starts, 1)) i;

select
  substring('foobarbequebaz' from starts[i] for lengths[i])
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
generate_series(1, array_length(starts, 1)) i;

substring
---
bar
baz
(2 rows)

> I think the last time I confronted this, we didn't have multi-array
> UNNEST.  Now that we do, we can get rid of the generate_series(),
> but it's still not beautiful:
> 
> select
>   substring([input string] from s for l)
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> unnest(starts, lengths) u(s,l);

select
  substring('foobarbequebaz' from s for l)
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
unnest(starts, lengths) u(s,l);

substring
---
bar
baz
(2 rows)

> Having said that, the other alternative with a 2-D array:
> 
> regexp_positions(str text, pat text) returns setof int[]
> 
> seems to still need UNNEST, though now it's not the magic multi-array
> UNNEST but this slicing version:
> 
> select
>   substring([input string] from u[1] for u[2])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
> unnest_slice(r, 1) u;

Unable to test this one since there is no unnest_slice() (yet)

> 
> Anyway, I'd counsel trying to write out SQL implementations
> of regexp_matches() and other useful things based on any
> particular regexp_positions() API you might be thinking about.
> Can we do anything useful without a LATERAL UNNEST thingie?
> Are some of them more legible than others?

Hmm, I cannot think of a way.


/Joel

0004-regexp-positions.patch
Description: Binary data


Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Matthias van de Meent
On Tue, 9 Mar 2021 at 17:21, Mark Dilger  wrote:
>
> For a prior discussion on this topic:
>
> https://www.postgresql.org/message-id/2e78013d0709130606l56539755wb9dbe17225ffe90a%40mail.gmail.com

Thanks for the reference! I note that that thread mentions the
old-style VACUUM FULL as a reason as to why it would be unsafe, which
I believe was removed quite a few versions ago (v9.0).

The only two existing mechanisms that I could find (in the access/heap
directory) that possibly could fail on shrunken line pointer arrays;
being xlog recovery (I do not have enough knowledge on recovery to
determine if that may touch pages that have shrunken line pointer
arrays, or if those situations won't exist due to never using dirtied
pages in recovery) and backwards table scans on non-MVCC snapshots
(which would be fixed in the attached patch).

With regards,

Matthias van de Meent
From ef3db93a172ceb2bcab5516336462117d3c99fd3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 9 Mar 2021 14:42:52 +0100
Subject: [PATCH v2] Truncate a pages' line pointer array when it has trailing
 unused ItemIds.

This will allow reuse of what is effectively free space for data as well as
new line pointers, instead of keeping it reserved for line pointers only.

An additional benefit is that the HasFreeLinePointers hint-bit optimization
now doesn't hint for free line pointers at the end of the array, slightly
increasing the specificity of where the free lines are; and saving us from
needing to search to the end of the array if all other entries are already
filled.
---
 src/backend/access/heap/heapam.c   |  9 -
 src/backend/storage/page/bufpage.c | 13 -
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d..ae218cfac6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -635,8 +635,15 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
+			/* 
+			 * The previous returned tuple may have been vacuumed since the
+			 * previous scan when we use a non-MVCC snapshot, so we must
+			 * re-establish the lineoff <= PageGetMaxOffsetNumber(dp)
+			 * invariant
+			 */
 			lineoff =			/* previous offnum */
-OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self)));
+Min(lines,
+	OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self;
 		}
 		/* page and lineoff now reference the physically previous tid */
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..10d8f26ad0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -672,7 +672,11 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  * PageRepairFragmentation
  *
  * Frees fragmented space on a page.
- * It doesn't remove unused line pointers! Please don't change this.
+ * It doesn't remove intermediate unused line pointers (that would mean
+ * moving ItemIds, and that would imply invalidating indexed values), but it
+ * does truncate the page->pd_linp array to the last unused line pointer, so
+ * that this space may also be reused for data, instead of only for line
+ * pointers.
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
@@ -691,6 +695,7 @@ PageRepairFragmentation(Page page)
 	int			nline,
 nstorage,
 nunused;
+	OffsetNumber lastUsed = InvalidOffsetNumber;
 	int			i;
 	Size		totallen;
 	bool		presorted = true;	/* For now */
@@ -724,6 +729,7 @@ PageRepairFragmentation(Page page)
 		lp = PageGetItemId(page, i);
 		if (ItemIdIsUsed(lp))
 		{
+			lastUsed = i;
 			if (ItemIdHasStorage(lp))
 			{
 itemidptr->offsetindex = i - 1;
@@ -771,6 +777,11 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
+	if (lastUsed != nline) {
+		((PageHeader) page)->pd_lower = SizeOfPageHeaderData + (sizeof(ItemIdData) * lastUsed);
+		nunused = nunused - (nline - lastUsed);
+	}
+
 	/* Set hint bit for PageAddItem */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
-- 
2.20.1



Procedures versus the "fastpath" API

2021-03-09 Thread Tom Lane
The security team received a report from Theodor-Arsenij
Larionov-Trichkin of PostgresPro that it's possible to crash the
backend with an assertion or null-pointer dereference by trying to
call a window function via the "fast path function call" protocol
message.  fastpath.c doesn't set up any WindowObject function context,
of course, but most of the built-in window functions just assume there
will be one.  We concluded that there's no possibility of anything
more interesting than an immediate core dump, so per our usual rules
this isn't a CVE-grade bug.  However, we poked around to see if there
were any related problems, and soon found that fastpath.c will happily
attempt to call procedures as well as functions.  That seems to work,
accidentally IMO, for simple procedures --- but if the procedure tries
to COMMIT or ROLLBACK then you get "ERROR: invalid transaction
termination".  (There might be other edge-case problems; I've not
tried subtransactions or OUT parameters for example.)

So the question on the table is what to do about this.  As far as
window functions go, it seems clear that fastpath.c should just reject
any attempt to call a window function that way (or an aggregate for
that matter; aggregates fail already, but with relatively obscure
error messages).  Perhaps there's also an argument that window
functions should have run-time tests, not just assertions, that
they're called in a valid way.

As for procedures, I'm of the opinion that we should just reject those
too, but some other security team members were not happy with that
idea.  Conceivably we could attempt to make the case actually work,
but is it worth the trouble?  Given the lack of field complaints
about the "invalid transaction termination" failure, it seems unlikely
that it's occurred to anyone to try to call procedures this way.
We'd need special infrastructure to test the case, too, since psql
offers no access to fastpath calls.

A compromise suggestion was to prohibit calling procedures via
fastpath as of HEAD, but leave existing releases as they are,
in case anyone is using a case that happens to work.

Thoughts?

regards, tom lane




Re: Proposal: Save user's original authenticated identity for logging

2021-03-09 Thread Jacob Champion
On Tue, 2021-03-09 at 18:03 +, Jacob Champion wrote:
> v4 removes the TODO and the extra allocation for peer_user. I'll hold
> off on the other two suggestions pending that conversation.

And v5 is rebased over this morning's SSL test changes.

--Jacob
From 1159ef7f8fb846649c1c36bb1ecd17bd4b687668 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 3 Feb 2021 11:04:42 -0800
Subject: [PATCH v5 1/3] test/kerberos: only search forward in logs

The log content tests searched through the entire log file, from the
beginning, for every match. If two tests shared the same expected log
content, it was possible for the second test to get a false positive by
matching against the first test's output. (This could be seen by
modifying one of the expected-failure tests to expect the same output as
a previous happy-path test.)

Store the offset of the previous match, and search forward from there
during the next match, resetting the offset every time the log file
changes. This could still result in false positives if one test spits
out two or more matching log lines, but it should be an improvement over
what's there now.
---
 src/test/kerberos/t/001_auth.pl | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 079321bbfc..c721d24dbd 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -182,6 +182,9 @@ $node->safe_psql('postgres', 'CREATE USER test1;');
 
 note "running tests";
 
+my $current_log_name = '';
+my $current_log_position;
+
 # Test connection success or failure, and if success, that query returns true.
 sub test_access
 {
@@ -221,18 +224,37 @@ sub test_access
 		$lfname =~ s/^stderr //;
 		chomp $lfname;
 
+		if ($lfname ne $current_log_name)
+		{
+			$current_log_name = $lfname;
+
+			# Search from the beginning of this new file.
+			$current_log_position = 0;
+			note "current_log_position = $current_log_position";
+		}
+
 		# might need to retry if logging collector process is slow...
 		my $max_attempts = 180 * 10;
 		my $first_logfile;
 		for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
 		{
 			$first_logfile = slurp_file($node->data_dir . '/' . $lfname);
-			last if $first_logfile =~ m/\Q$expect_log_msg\E/;
+
+			# Don't include previously matched text in the search.
+			$first_logfile = substr $first_logfile, $current_log_position;
+			if ($first_logfile =~ m/\Q$expect_log_msg\E/g)
+			{
+$current_log_position += pos($first_logfile);
+last;
+			}
+
 			usleep(100_000);
 		}
 
 		like($first_logfile, qr/\Q$expect_log_msg\E/,
 			 'found expected log file content');
+
+		note "current_log_position = $current_log_position";
 	}
 
 	return;
-- 
2.25.1

From 5ac20ba6cac339bc4ffc0d765c67bfbf6583425c Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 8 Feb 2021 10:53:20 -0800
Subject: [PATCH v5 2/3] ssl: store client's DN in port->peer_dn

Original patch by Andrew Dunstan:

https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net

but I've taken out the clientname=DN functionality; all that will be
needed for the next patch is the DN string.
---
 src/backend/libpq/be-secure-openssl.c | 53 +++
 src/include/libpq/libpq-be.h  |  1 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 8c37381add..a7b09534a8 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -546,22 +546,25 @@ aloop:
 	/* Get client certificate, if available. */
 	port->peer = SSL_get_peer_certificate(port->ssl);
 
-	/* and extract the Common Name from it. */
+	/* and extract the Common Name / Distinguished Name from it. */
 	port->peer_cn = NULL;
+	port->peer_dn = NULL;
 	port->peer_cert_valid = false;
 	if (port->peer != NULL)
 	{
 		int			len;
+		X509_NAME  *x509name = X509_get_subject_name(port->peer);
+		char	   *peer_cn;
+		char	   *peer_dn;
+		BIO		   *bio = NULL;
+		BUF_MEM*bio_buf = NULL;
 
-		len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		NID_commonName, NULL, 0);
+		len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0);
 		if (len != -1)
 		{
-			char	   *peer_cn;
-
 			peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1);
-			r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		  NID_commonName, peer_cn, len + 1);
+			r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn,
+		  len + 1);
 			peer_cn[len] = '\0';
 			if (r != len)
 			{
@@ -585,6 +588,36 @@ aloop:
 
 			port->peer_cn = peer_cn;
 		}
+
+		bio = BIO_new(BIO_s_mem());
+		if (!bio)
+		{
+			pfree(port->peer_cn);
+			port->peer_cn = NULL;
+			return -1;
+		}
+		/* use commas instead of slashes */
+		X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_RFC2253);
+		BIO_get_mem_ptr(bio, _buf);
+		peer_dn = 

Re: Online checksums patch - once again

2021-03-09 Thread Daniel Gustafsson
> On 9 Mar 2021, at 19:12, Bruce Momjian  wrote:

Since this patch is de-facto rejected I'll mark it withdrawn in the CF app to
save on cfbot bandwidth.

> Do we support or document the ability to create a standby with checksums
> from a primary without it, and is that a better approach?

Michael Banck started a new thread for that forking off of this one on message
id 8f193f949b39817b9c642623e1fe7ccb94137ce4.ca...@credativ.de so it's probably
better to continue the discussion of that over there.

--
Daniel Gustafsson   https://vmware.com/





Re: Fwd: Row description Metadata information

2021-03-09 Thread Bruce Momjian
On Mon, Feb 15, 2021 at 05:25:55PM -0800, Aleksei Ivanov wrote:
> Not sure that previous email was sent correctly. If it was duplicated, sorry
> for the inconvenience.
> 
> Hi, hackers,
> 
> I have one question related to returned information in the row description for
> prepared statement.
> 
> For example Select $1 * 2 and then Bind 1.6 to it.
> The returned result is correct and equal to 3.2, but type modifier in the row
> description is equal to -1, which is not correct.
> 
> Does someone know where this modifier is calculated? Is this a bug or 
> intention
> behavior?

Postgres can't always propogate the type modifier for all expresions, so
it basically doesn't even try.  For example, the modifier for || would
be very complex.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Online checksums patch - once again

2021-03-09 Thread Bruce Momjian
On Mon, Feb 15, 2021 at 02:02:02PM +0100, Daniel Gustafsson wrote:
> > On 11 Feb 2021, at 14:10, Bruce Momjian  wrote:
> > I don't think anyone has done anything wrong --- rather, it is what we
> > are _trying_ to do that is complex.
> 
> Global state changes in a cluster are complicated, and are unlikely to never
> not be.  By writing patches to attempts such state changes we can see which
> pieces of infrastructure we're lacking to reduce complexity.  A good example 
> is
> the ProcSignalBarrier work that Andres and Robert did, inspired in part by 
> this
> patch IIUC.  The more we do, the more we learn.

Do we support or document the ability to create a standby with checksums
from a primary without it, and is that a better approach?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-09 Thread Tom Lane
David Steele  writes:
> 1) PL/SQL seems to be used in a few places where I believe PL/pgSQL is 
> meant. This was pre-existing but now seems like a good opportunity to 
> fix it, unless I am misunderstanding.

PL/SQL is Oracle's function language, which PL/pgSQL is modeled on.
At least some of the mentions of PL/SQL are probably intentional,
so you'll have to look closely not just search-and-replace.

regards, tom lane




Re: default result formats setting

2021-03-09 Thread Tom Lane
David Steele  writes:
> Andrew, Tom, does the latest patch address your concerns?

[ reads patch quickly... ]  I think the definition is fine now,
modulo possible bikeshedding on the GUC name.  (I have no
great suggestion on that right now, but the current proposal
seems mighty verbose.)

The implementation feels weird though, mainly in that I don't like
Peter's choices for where to put the code.  pquery.c is not where
I would have expected to find the support for this, and I do not
have any confidence that applying the format conversion while
filling portal->formats[] is enough to cover all cases.  I'd have
thought that access/common/printtup.c or somewhere near there
would be where to do it.

Likewise, the code associated with caching the results of the type
OID lookups seems like it should be someplace where you'd be more
likely to find (a) type name lookup and (b) caching logic.  I'm
not quite sure about the best place for that, but we could do
worse than put it in parse_type.c.  (As I recall, the parser
already has some caching related to operator lookup, so doing
part (b) there isn't too much of a stretch.)

Also, if we need YA string-splitting function, please put it
beside the ones that already exist (SplitIdentifierString etc in
varlena.c).  That way (a) it's available if some other code needs
it, and (b) when somebody gets around to refactoring all the
splitters, they won't have to dig into nooks and crannies to find
them.

Having said that, I wonder if we should define the parameter's
contents this way, i.e. as things that parseTypeString will
accept.  At best that's overspecification, e.g. should people
expect that varchar(7) and varchar(9) are different things
(and, perhaps, that such entries *don't* match varchars of other
lengths?)  I think a case could be made for requiring the entries
to be identifiers exactly matching pg_type.typname, possibly with
schema qualification.  This'd allow tighter verification of the
GUC value's format in the GUC check hook.

Or we could drop all of that and go back to having it be a list
of type OIDs, which would remove a *whole lot* of the complexity,
and I'm not sure that it's materially less friendly.  Applications
have had to deal with type OIDs in the protocol since forever.

BTW, I wonder whether we still need to restrict the GUC to not
be settable from postgresql.conf.  The fact that the client has
to explicitly pass -1 seems to reduce any security issues quite
a bit.

regards, tom lane




Re: Proposal: Save user's original authenticated identity for logging

2021-03-09 Thread Jacob Champion
On Mon, 2021-03-08 at 22:16 +, Jacob Champion wrote:
> On Sat, 2021-03-06 at 18:33 +0100, Magnus Hagander wrote:
> > As for log escaping, we report port->user_name already unescaped --
> > surely this shouldn't be a worse case than that?
> 
> Ah, that's a fair point. I'll remove the TODO.

v4 removes the TODO and the extra allocation for peer_user. I'll hold
off on the other two suggestions pending that conversation.

--Jacob
commit c323ba88d5a9823765c68cd4c2c169493e4c269a
Author: Jacob Champion 
Date:   Tue Mar 9 09:53:03 2021 -0800

fixup! Log authenticated identity from all auth backends

diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 65d10a5ad2..bfcffbdb3b 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -354,8 +354,6 @@ set_authn_id(Port *port, const char *id)
 {
Assert(id);
 
-   /* TODO: should the identity strings be escaped before being logged? */
-
if (port->authn_id)
{
/*
@@ -2002,7 +2000,6 @@ auth_peer(hbaPort *port)
gid_t   gid;
 #ifndef WIN32
struct passwd *pw;
-   char   *peer_user;
int ret;
 #endif
 
@@ -2038,12 +2035,9 @@ auth_peer(hbaPort *port)
 * Make a copy of static getpw*() result area. This is our authenticated
 * identity.
 */
-   peer_user = pstrdup(pw->pw_name);
-   set_authn_id(port, peer_user);
-
-   ret = check_usermap(port->hba->usermap, port->user_name, peer_user, 
false);
+   set_authn_id(port, pw->pw_name);
 
-   pfree(peer_user);
+   ret = check_usermap(port->hba->usermap, port->user_name, 
port->authn_id, false);
 
return ret;
 #else
From 22518e1ca0be717ce001f0c79214924fc62eb75c Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 3 Feb 2021 11:04:42 -0800
Subject: [PATCH v4 1/3] test/kerberos: only search forward in logs

The log content tests searched through the entire log file, from the
beginning, for every match. If two tests shared the same expected log
content, it was possible for the second test to get a false positive by
matching against the first test's output. (This could be seen by
modifying one of the expected-failure tests to expect the same output as
a previous happy-path test.)

Store the offset of the previous match, and search forward from there
during the next match, resetting the offset every time the log file
changes. This could still result in false positives if one test spits
out two or more matching log lines, but it should be an improvement over
what's there now.
---
 src/test/kerberos/t/001_auth.pl | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 079321bbfc..c721d24dbd 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -182,6 +182,9 @@ $node->safe_psql('postgres', 'CREATE USER test1;');
 
 note "running tests";
 
+my $current_log_name = '';
+my $current_log_position;
+
 # Test connection success or failure, and if success, that query returns true.
 sub test_access
 {
@@ -221,18 +224,37 @@ sub test_access
 		$lfname =~ s/^stderr //;
 		chomp $lfname;
 
+		if ($lfname ne $current_log_name)
+		{
+			$current_log_name = $lfname;
+
+			# Search from the beginning of this new file.
+			$current_log_position = 0;
+			note "current_log_position = $current_log_position";
+		}
+
 		# might need to retry if logging collector process is slow...
 		my $max_attempts = 180 * 10;
 		my $first_logfile;
 		for (my $attempts = 0; $attempts < $max_attempts; $attempts++)
 		{
 			$first_logfile = slurp_file($node->data_dir . '/' . $lfname);
-			last if $first_logfile =~ m/\Q$expect_log_msg\E/;
+
+			# Don't include previously matched text in the search.
+			$first_logfile = substr $first_logfile, $current_log_position;
+			if ($first_logfile =~ m/\Q$expect_log_msg\E/g)
+			{
+$current_log_position += pos($first_logfile);
+last;
+			}
+
 			usleep(100_000);
 		}
 
 		like($first_logfile, qr/\Q$expect_log_msg\E/,
 			 'found expected log file content');
+
+		note "current_log_position = $current_log_position";
 	}
 
 	return;
-- 
2.25.1

From f3e5040fd29c0143b7d9991d90c94d4151f5622a Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 8 Feb 2021 10:53:20 -0800
Subject: [PATCH v4 2/3] ssl: store client's DN in port->peer_dn

Original patch by Andrew Dunstan:

https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net

but I've taken out the clientname=DN functionality; all that will be
needed for the next patch is the DN string.
---
 src/backend/libpq/be-secure-openssl.c | 53 +++
 src/include/libpq/libpq-be.h  |  1 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 4c4f025eb1..d0184a2ce2 100644
--- 

Re: [PATCH] pg_permissions

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 04:01, Chapman Flack wrote:
> On Sat, Mar 06, 2021 at 08:03:17PM +0100, Joel Jacobson wrote:
> >regclass   |  obj_desc   | grantor | grantee |
> privilege_type | is_grantable
> >
> --+-+-+-++--
> 
> 1. Is there a reason not to make 'grantor' and 'grantee' of type regrole?
>In other words, s/rolname/oid::regrole/ throughout the view definition.
>It looks the same visually, but should be easier to build on in a larger
>query.
> 
>Hmm, ok, a grantee of 'public' can't be expressed as a regrole. This
>seems an annoying little corner.[1] It can be represented by 0::regrole,
>but that displays as '-'. Hmm again, you can even '-'::regrole and get 0.
> 
> 
> 2. Also to facilitate use in a larger query, how about columns for the
>objid and objsubid, in addition to the human-friendly obj_desc?
>And I'm not sure about using pg_attribute as the regclass for
>attributes; it's nice to look at, but could also plant the wrong idea
>that attributes have pg_attribute as their classid, when it's really
>pg_class with an objsubid. Anyway, there's the human-friendly obj_desc
>to tell you it's a column.

Thanks for coming up with these two good ideas. I was wrong, they are great.

Both have now been implemented.

New patch attached.

Example usage:

CREATE ROLE test_user;
CREATE ROLE test_group;
CREATE ROLE test_owner;
CREATE SCHEMA test AUTHORIZATION test_owner;
GRANT ALL ON SCHEMA test TO test_group;
GRANT test_group TO test_user;

SELECT * FROM pg_permissions WHERE grantor = 'test_owner'::regrole;
   classid| objid | objsubid |   objdesc   |  grantor   |  grantee   | 
privilege_type | is_grantable
--+---+--+-++++--
pg_namespace | 16390 |0 | schema test | test_owner | test_owner | USAGE 
 | f
pg_namespace | 16390 |0 | schema test | test_owner | test_owner | 
CREATE | f
pg_namespace | 16390 |0 | schema test | test_owner | test_group | USAGE 
 | f
pg_namespace | 16390 |0 | schema test | test_owner | test_group | 
CREATE | f
(4 rows)

SET ROLE TO test_user;
CREATE TABLE test.a ();
RESET ROLE;

SELECT * FROM pg_ownerships WHERE owner = 'test_owner'::regrole;
   classid| objid | objsubid |   objdesc   |   owner
--+---+--+-+
pg_namespace | 16390 |0 | schema test | test_owner
(1 row)

ALTER TABLE test.a OWNER TO test_owner;

SELECT * FROM pg_ownerships WHERE owner = 'test_owner'::regrole;
   classid| objid | objsubid |   objdesc   |   owner
--+---+--+-+
pg_class | 16391 |0 | table a | test_owner
pg_namespace | 16390 |0 | schema test | test_owner
pg_type  | 16393 |0 | type a  | test_owner
pg_type  | 16392 |0 | type a[]| test_owner
(4 rows)

GRANT INSERT ON test.a TO test_group;

SELECT * FROM pg_permissions WHERE grantee = 'test_group'::regrole;
   classid| objid | objsubid |   objdesc   |  grantor   |  grantee   | 
privilege_type | is_grantable
--+---+--+-++++--
pg_class | 16391 |0 | table a | test_owner | test_group | 
INSERT | f
pg_namespace | 16390 |0 | schema test | test_owner | test_group | USAGE 
 | f
pg_namespace | 16390 |0 | schema test | test_owner | test_group | 
CREATE | f
(3 rows)

/Joel

0003-pg_permissions-and-pg_ownerships.patch
Description: Binary data


Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-09 Thread Pavel Stehule
út 9. 3. 2021 v 18:03 odesílatel David Steele  napsal:

> On 11/30/20 10:37 AM, Pavel Stehule wrote:
> > po 30. 11. 2020 v 16:06 odesílatel David G. Johnston
> >
> > ok
> This patch looks reasonable to me overall.
>
> A few comments:
>
> 1) PL/SQL seems to be used in a few places where I believe PL/pgSQL is
> meant. This was pre-existing but now seems like a good opportunity to
> fix it, unless I am misunderstanding.
>

+1


> 2) I think:
>
> + makes the command behave like SELECT, which is
> described
>
> flows a little better as:
>
> + makes the command behave like SELECT, as described
>

I am not native speaker, so I am not able to evaluate it.

Regards

Pavel


> Regards,
> --
> -David
> da...@pgmasters.net
>


Re: Online verification of checksums

2021-03-09 Thread David Steele

On 11/30/20 6:38 PM, David Steele wrote:

On 11/30/20 9:27 AM, Stephen Frost wrote:

* Michael Paquier (mich...@paquier.xyz) wrote:

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:

* Magnus Hagander (mag...@hagander.net) wrote:
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier 
 wrote:

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.


Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.


FWIW, I find that scary.


There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less. 


I would say a lot less. First you'd need to corrupt one of the eight 
bytes that make up the LSN (pretty likely since corruption will probably 
affect the entire block) and then it would need to be updated to a value 
that falls within the current backup range, a 1 in 16 million chance if 
a terabyte of WAL is generated during the backup. Plus, the corruption 
needs to happen during the backup since we are going to check for that, 
and the corrupted LSN needs to be ascending, and the LSN originally read 
needs to be within the backup range (another 1 in 16 million chance) 
since pages written before the start backup checkpoint should not be torn.


So as far as I can see there are more likely to be false negatives from 
the checksum itself.


It would also be easy to add a few rounds of checks, i.e. test if the 
LSN ascends but stays in the backup LSN range N times.


Honestly, I'm much more worried about corruption zeroing the entire 
page. I don't know how likely that is, but I know none of our proposed 
solutions would catch it.


Andres, since you brought this issue up originally perhaps you'd like to 
weigh in?


I had another look at this patch and though I think my suggestions above 
would improve the patch, I have no objections to going forward as is (if 
that is the consensus) since this seems an improvement over what we have 
now.


It comes down to whether you prefer false negatives or false positives. 
With the LSN checking Stephen and I advocate it is theoretically 
possible to have a false negative but the chances of the LSN ascending N 
times but staying within the backup LSN range due to corruption seems to 
be approaching zero.


I think Michael's method is unlikely to throw false positives, but it 
seems at least possible that a block would be hot enough to be appear 
torn N times in a row. Torn pages themselves are really easy to reproduce.


If we do go forward with this method I would likely propose the 
LSN-based approach as a future improvement.


Regards,
--
-David
da...@pgmasters.net




Re: non-HOT update not looking at FSM for large tuple update

2021-03-09 Thread John Naylor
I wrote:

> That seems like the proper fix, and I see you've started a thread for
that. I don't think that change in behavior would be backpatchable, but
patch here might have a chance at that.

I remembered after the fact that truncating line pointers would only allow
for omitting the 2% slack logic (and has other benefits), but the rest of
this patch would be needed regardless.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: non-HOT update not looking at FSM for large tuple update

2021-03-09 Thread John Naylor
On Tue, Mar 9, 2021 at 9:40 AM Floris Van Nee 
wrote:
>
> Hi,
>
> >
> > This patch fails to consider that len may be bigger than
MaxHeapTupleSize *
> > 0.98, which in this case triggers a reproducable
> > PANIC:
>
> Good catch! I've adapted the patch with your suggested fix.

Thank you both for testing and for the updated patch. It seems we should
add a regression test, but it's not clear which file it belongs in.
Possibly insert.sql?

> > One different question I have, though, is why we can't "just" teach
vacuum
> > to clean up trailing unused line pointers. As in, can't we trim the
line pointer
> > array when vacuum detects that the trailing line pointers on the page
are all
> > unused?

That seems like the proper fix, and I see you've started a thread for that.
I don't think that change in behavior would be backpatchable, but patch
here might have a chance at that.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: document the hook system

2021-03-09 Thread Bruce Momjian
On Sat, Mar  6, 2021 at 08:32:43PM -0500, Tom Lane wrote:
> I think that the best you should hope for here is that people are
> willing to add a short, not-too-detailed para to a markup-free
> plain-text README file that lists all the hooks.  As soon as it
> gets any more complex than that, either the doco aspect will be
> ignored, or there simply won't be any more hooks.
> 
> (I'm afraid I likewise don't believe in the idea of carrying a test
> module for each hook.  Again, requiring that is a good way to
> ensure that new hooks just won't happen.)

Agreed.  If you document the hooks too much, it allows them to drift
away from matching the code, which makes the hook documentation actually
worse than having no hook documentation at all.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2021-03-09 Thread David Steele

On 11/30/20 10:37 AM, Pavel Stehule wrote:
po 30. 11. 2020 v 16:06 odesílatel David G. Johnston 


ok

This patch looks reasonable to me overall.

A few comments:

1) PL/SQL seems to be used in a few places where I believe PL/pgSQL is 
meant. This was pre-existing but now seems like a good opportunity to 
fix it, unless I am misunderstanding.


2) I think:

+ makes the command behave like SELECT, which is 
described


flows a little better as:

+ makes the command behave like SELECT, as described

Regards,
--
-David
da...@pgmasters.net




Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Tom Lane
"Joel Jacobson"  writes:
> Tom - can you please give details on your unpleasant experiences with 
> parallel arrays?

The problems I can recall running into were basically down to not having
an easy way to iterate through parallel arrays.  There are ways to do
that in SQL, certainly, but they all constrain how you write the query,
and usually force ugly stuff like splitting it into sub-selects.

As an example, presuming that regexp_positions is defined along the
lines of

regexp_positions(str text, pat text, out starts int[], out lengths int[])
returns setof record

then to actually get the identified substrings you'd have to do something
like

select
  substring([input string] from starts[i] for lengths[i])
from
  regexp_positions([input string], [pattern]) r,
  lateral
generate_series(1, array_length(starts, 1)) i;

I think the last time I confronted this, we didn't have multi-array
UNNEST.  Now that we do, we can get rid of the generate_series(),
but it's still not beautiful:

select
  substring([input string] from s for l)
from
  regexp_positions([input string], [pattern]) r,
  lateral
unnest(starts, lengths) u(s,l);

Having said that, the other alternative with a 2-D array:

regexp_positions(str text, pat text) returns setof int[]

seems to still need UNNEST, though now it's not the magic multi-array
UNNEST but this slicing version:

select
  substring([input string] from u[1] for u[2])
from
  regexp_positions([input string], [pattern]) r,
  lateral
unnest_slice(r, 1) u;

Anyway, I'd counsel trying to write out SQL implementations
of regexp_matches() and other useful things based on any
particular regexp_positions() API you might be thinking about.
Can we do anything useful without a LATERAL UNNEST thingie?
Are some of them more legible than others?

regards, tom lane




Re: [PATCH] pg_permissions

2021-03-09 Thread Chapman Flack
On 03/09/21 11:11, Joel Jacobson wrote:
> On Tue, Mar 9, 2021, at 07:34, Joel Jacobson wrote:
>> On Tue, Mar 9, 2021, at 04:01, Chapman Flack wrote:
>>> 1. Is there a reason not to make 'grantor' and 'grantee' of type regrole?
> 
> Having digested your idea, I actually agree with you.
> 
> Since we have the regrole-type, I agree we should use it,
> even though we need to cast, no biggie.

This does highlight [topicshift] one sort of
inconvenience I've observed before in other settings: how fussy
it may be to write WHERE grantee = 'bob' when there is no user 'bob'.

A simple cast 'bob'::regrole raises undefined_object (in class
"Syntax Error or Access Rule Violation") rather than just returning
no rows because no grantee is bob.

It's a more general issue: I first noticed it when I had proudly
implemented my first PostgreSQL type foo that would only accept
valid foos as values, and the next thing that happened was my
colleague in frontend development wrote mean Python comments about me
because he couldn't simply search for a foo in a table without either
first duplicating the validation of the value or trapping the error
if the user had entered a non-foo to search for.

We could solve that, of course, by implementing = and <> (foo,text)
to simply return false (resp. true) if the text arg isn't castable
to foo.

But the naïve way of writing such an operator repeats the castability
test for every row compared. If I were to build such an operator now,
I might explore whether a planner support function could be used
to check the castability once, and replace the whole comparison with
constant false if that fails.

And this strikes me as a situation that might be faced often enough
to wonder if some kind of meta-support-function would be worth supplying
that could do that for any type foo.
[/topicshift]

Regards,
-Chap




Re: partial heap only tuples

2021-03-09 Thread Bruce Momjian
On Mon, Feb 15, 2021 at 08:19:40PM +, Bossart, Nathan wrote:
> Yeah, this is something I'm concerned about.  I think adding a bitmap
> of modified columns to the header of PHOT-updated tuples improves
> matters quite a bit, even for single-page vacuuming.  Following is a
> strategy I've been developing (there may still be some gaps).  Here's
> a basic PHOT chain where all tuples are visible and the last one has
> not been deleted or updated:
> 
> idx10   1   2   3
> idx20   1   2
> idx30   2   3
> lp  1   2   3   4   5
> tuple   (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
> bitmap  -xx xx- --x x-x

First, I want to continue encouraging you to work on this because I
think it can yield big improvements.  Second, I like the wiki you
created.  Third, the diagram above seems to be more meaningful if read
from the bottom-up.  I suggest you reorder it on the wiki so it can be
read top-down, maybe:

> lp  1   2   3   4   5
> tuple   (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
> bitmap  -xx xx- --x x-x
> idx10   1   2   3
> idx20   1   2
> idx30   2   3

Fourth, I know in the wiki you said create/drop index needs more
research, but I suggest you avoid any design that will be overly complex
for create/drop index.  For example, a per-row bitmap that is based on
what indexes exist at time of row creation might cause unacceptable
problems in handling create/drop index.  Would you number indexes?  I am
not saying you have to solve all the problems now, but you have to keep
your eye on obstacles that might block your progress later.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: A problem about partitionwise join

2021-03-09 Thread David Steele

On 11/27/20 7:05 AM, Ashutosh Bapat wrote:

On Tue, Nov 10, 2020 at 2:43 PM Richard Guo  wrote:


To recap, the problem we are fixing here is when generating join clauses
from equivalence classes, we only select the joinclause with the 'best
score', or the first joinclause with a score of 3. This may cause us to
miss some joinclause on partition keys and thus fail to generate
partitionwise join.

The initial idea for the fix is to create all the RestrictInfos from ECs
in order to check whether there exist equi-join conditions involving
pairs of matching partition keys of the relations being joined for all
partition keys. And then Tom proposed a much better idea which leverages
function exprs_known_equal() to tell whether the partkeys can be found
in the same eclass, which is the current implementation in the latest
patch.


In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

The loop traversing the partition keys at a given position, may be
optimized further if we pass lists to exprs_known_equal() which in
turns checks whether one expression from each list is member of a
given EC. This will avoid traversing all equivalence classes for each
partition key expression, which can be a huge improvement when there
are many ECs. But I think if one of the partition key expression at a
given position is member of an equivalence class all the other
partition key expressions at that position should be part of that
equivalence class since there should be an equi-join between those. So
the loop in loop may not be required to start with.


Richard, any thoughts on Ashutosh's comments?

Regards,
--
-David
da...@pgmasters.net




Re: Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Mark Dilger



> On Mar 9, 2021, at 7:13 AM, Matthias van de Meent 
>  wrote:
> 
> Hi,
> 
> The heap AMs' pages only grow their pd_linp array, and never shrink
> when trailing entries are marked unused. This means that up to 14% of
> free space (=291 unused line pointers) on a page could be unusable for
> data storage, which I think is a shame. With a patch in the works that
> allows the line pointer array to grow up to one third of the size of
> the page [0], it would be quite catastrophic for the available data
> space on old-and-often-used pages if this could not ever be reused for
> data.
> 
> The shrinking of the line pointer array is already common practice in
> indexes (in which all LP_UNUSED items are removed), but this specific
> implementation cannot be used for heap pages due to ItemId
> invalidation. One available implementation, however, is that we
> truncate the end of this array, as mentioned in [1]. There was a
> warning at the top of PageRepairFragmentation about not removing
> unused line pointers, but I believe that was about not removing
> _intermediate_ unused line pointers (which would imply moving in-use
> line pointers); as far as I know there is nothing that relies on only
> growing page->pd_lower, and nothing keeping us from shrinking it
> whilst holding a pin on the page.
> 
> Please find attached a fairly trivial patch for which detects the last
> unused entry on a page, and truncates the pd_linp array to that entry,
> effectively freeing 4 bytes per line pointer truncated away (up to
> 1164 bytes for pages with MaxHeapTuplesPerPage unused lp_unused
> lines).
> 
> One unexpected benefit from this patch is that the PD_HAS_FREE_LINES
> hint bit optimization can now be false more often, increasing the
> chances of not having to check the whole array to find an empty spot.
> 
> Note: This does _not_ move valid ItemIds, it only removes invalid
> (unused) ItemIds from the end of the space reserved for ItemIds on a
> page, keeping valid linepointers intact.
> 
> 
> Enjoy,
> 
> Matthias van de Meent
> 
> [0] 
> https://www.postgresql.org/message-id/flat/cad21aod0ske11fmw4jd4renawbmcw1wasvnwpjvw3tvqpoq...@mail.gmail.com
> [1] 
> https://www.postgresql.org/message-id/CAEze2Wjf42g8Ho%3DYsC_OvyNE_ziM0ZkXg6wd9u5KVc2nTbbYXw%40mail.gmail.com
> 

For a prior discussion on this topic:

https://www.postgresql.org/message-id/2e78013d0709130606l56539755wb9dbe17225ffe90a%40mail.gmail.com

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Tom Lane
"Joel Jacobson"  writes:
> On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:
>> you can do unnest(array1, array2, ...)

> Right, I had forgotten about that variant.
> But isn't this a bit surprising then:
> ...
> Should there be an entry there showing the VARIADIC anyelement version as 
> well?

No, because there's no such pg_proc entry.  Multi-argument UNNEST is
special-cased by the parser, cf transformRangeFunction().

(Which is something I'd momentarily forgotten.  Forget my suggestion
that we could define unnest(anyarray, int) ... it has to be another
name.)

regards, tom lane




Re: [PATCH] pg_permissions

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 07:34, Joel Jacobson wrote:
> On Tue, Mar 9, 2021, at 04:01, Chapman Flack wrote:
>> 1. Is there a reason not to make 'grantor' and 'grantee' of type regrole?

Having digested your idea, I actually agree with you.

Since we have the regrole-type, I agree we should use it,
even though we need to cast, no biggie.

I realized my arguments were silly since I already exposed the class as 
regclass,
which has the same problem.

I'll send a new patch soon.

/Joel

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2021-03-09 Thread David Steele

On 3/9/21 10:08 AM, David G. Johnston wrote:


On Tuesday, March 9, 2021, David Steele > wrote:


Further, I think we should close this entry at the end of the CF if
it does not attract committer interest. Tom is not in favor of the
patch and it appears Alexander decided not to commit it.

Pavel re-reviewed it and was fine with ready-to-commit so that status 
seems fine.


Ah yes, that was my mistake.

Regards,
--
-David
da...@pgmasters.net




Lowering the ever-growing heap->pd_lower

2021-03-09 Thread Matthias van de Meent
Hi,

The heap AMs' pages only grow their pd_linp array, and never shrink
when trailing entries are marked unused. This means that up to 14% of
free space (=291 unused line pointers) on a page could be unusable for
data storage, which I think is a shame. With a patch in the works that
allows the line pointer array to grow up to one third of the size of
the page [0], it would be quite catastrophic for the available data
space on old-and-often-used pages if this could not ever be reused for
data.

The shrinking of the line pointer array is already common practice in
indexes (in which all LP_UNUSED items are removed), but this specific
implementation cannot be used for heap pages due to ItemId
invalidation. One available implementation, however, is that we
truncate the end of this array, as mentioned in [1]. There was a
warning at the top of PageRepairFragmentation about not removing
unused line pointers, but I believe that was about not removing
_intermediate_ unused line pointers (which would imply moving in-use
line pointers); as far as I know there is nothing that relies on only
growing page->pd_lower, and nothing keeping us from shrinking it
whilst holding a pin on the page.

Please find attached a fairly trivial patch for which detects the last
unused entry on a page, and truncates the pd_linp array to that entry,
effectively freeing 4 bytes per line pointer truncated away (up to
1164 bytes for pages with MaxHeapTuplesPerPage unused lp_unused
lines).

One unexpected benefit from this patch is that the PD_HAS_FREE_LINES
hint bit optimization can now be false more often, increasing the
chances of not having to check the whole array to find an empty spot.

Note: This does _not_ move valid ItemIds, it only removes invalid
(unused) ItemIds from the end of the space reserved for ItemIds on a
page, keeping valid linepointers intact.


Enjoy,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/cad21aod0ske11fmw4jd4renawbmcw1wasvnwpjvw3tvqpoq...@mail.gmail.com
[1] 
https://www.postgresql.org/message-id/CAEze2Wjf42g8Ho%3DYsC_OvyNE_ziM0ZkXg6wd9u5KVc2nTbbYXw%40mail.gmail.com
From f9be3079cf0ff26b8ef603a9b0c8bc5d27561499 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 9 Mar 2021 14:42:52 +0100
Subject: [PATCH v1] Truncate a pages' line pointer array when it has trailing
 unused ItemIds.

This will allow reuse of what is effectively free space for data as well as
new line pointers, instead of keeping it reserved for line pointers only.

An additional benefit is that the HasFreeLinePointers hint-bit optimization
now doesn't hint for free line pointers at the end of the array, slightly
increasing the specificity of where the free lines are; and saving us from
needing to search to the end of the array if all other entries are already
filled.
---
 src/backend/storage/page/bufpage.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..10d8f26ad0 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -672,7 +672,11 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  * PageRepairFragmentation
  *
  * Frees fragmented space on a page.
- * It doesn't remove unused line pointers! Please don't change this.
+ * It doesn't remove intermediate unused line pointers (that would mean
+ * moving ItemIds, and that would imply invalidating indexed values), but it
+ * does truncate the page->pd_linp array to the last unused line pointer, so
+ * that this space may also be reused for data, instead of only for line
+ * pointers.
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
@@ -691,6 +695,7 @@ PageRepairFragmentation(Page page)
 	int			nline,
 nstorage,
 nunused;
+	OffsetNumber lastUsed = InvalidOffsetNumber;
 	int			i;
 	Size		totallen;
 	bool		presorted = true;	/* For now */
@@ -724,6 +729,7 @@ PageRepairFragmentation(Page page)
 		lp = PageGetItemId(page, i);
 		if (ItemIdIsUsed(lp))
 		{
+			lastUsed = i;
 			if (ItemIdHasStorage(lp))
 			{
 itemidptr->offsetindex = i - 1;
@@ -771,6 +777,11 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
+	if (lastUsed != nline) {
+		((PageHeader) page)->pd_lower = SizeOfPageHeaderData + (sizeof(ItemIdData) * lastUsed);
+		nunused = nunused - (nline - lastUsed);
+	}
+
 	/* Set hint bit for PageAddItem */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
-- 
2.20.1



Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2021-03-09 Thread David G. Johnston
On Tuesday, March 9, 2021, David Steele  wrote:

>
> Further, I think we should close this entry at the end of the CF if it
> does not attract committer interest. Tom is not in favor of the patch and
> it appears Alexander decided not to commit it.
>

Pavel re-reviewed it and was fine with ready-to-commit so that status seems
fine.

Frankly, I am hoping for a bit more constructive feedback and even
collaboration from a committer, specifically Tom, on this one given the
outstanding user complaints received on the topic, our disagreement
regarding fixing it (which motivates the patch to better document and add
tests), and professional courtesy given to a fellow consistent community
contributor.

So, no, making it just go away because one of the dozens of committers
can’t make time to try and make it work doesn’t sit well with me.  If a
committer wants to actively reject the patch with an explanation then so be
it.

David J.


Re: row filtering for logical replication

2021-03-09 Thread Rahila Syed
Hi Euler,

Please find some comments below:

1. If the where clause contains non-replica identity columns, the delete
performed on a replicated row
 using DELETE from pub_tab where repl_ident_col = n;
is not being replicated, as logical replication does not have any info
whether the column has
to be filtered or not.
Shouldn't a warning be thrown in this case to notify the user that the
delete is not replicated.

2. Same for update, even if I update a row to match the quals on publisher,
it is still not being replicated to
the subscriber. (if the quals contain non-replica identity columns). I
think for UPDATE at least, the new value
of the non-replicate identity column is available which can be used to
filter and replicate the update.

3. 0001.patch,
Why is the name of the existing ExclusionWhereClause node being changed, if
the exact same definition is being used?

For 0002.patch,
4.   +
 +   memset(lrel, 0, sizeof(LogicalRepRelation));

Is this needed, apart from the above, patch does not use or update lrel at
all in that function.

5.  PublicationRelationQual and PublicationTable have similar fields, can
PublicationTable
be used in place of PublicationRelationQual instead of defining a new
struct?

Thank you,
Rahila Syed


Re: authtype parameter in libpq

2021-03-09 Thread Peter Eisentraut

On 08.03.21 10:57, Peter Eisentraut wrote:

On 04.03.21 16:06, Daniel Gustafsson wrote:
authtype is completely dead in terms of reading back the value, to the 
point of

it being a memleak if it indeed was found in as an environment variable.

But I tend to think we should remove them both altogether (modulo ABI 
and API preservation).


No disagreement from me, the attached takes a stab at that to get an 
idea what
it would look like.  PQtty is left to maintain API stability but the 
parameters

are removed from the conn object as thats internal to libpq.


This looks like this right idea to me.


committed, with some tweaks




Re: default result formats setting

2021-03-09 Thread David Steele

On 11/25/20 2:06 AM, Peter Eisentraut wrote:

On 2020-11-16 16:15, Andrew Dunstan wrote:

I think this is conceptually OK, although it feels a bit odd.

Might it be better to have the values as typename={binary,text} pairs
instead of oid={0,1} pairs, which are fairly opaque? That might make
things easier for things like UDTs where the oid might not be known or
constant.


Yes, type names would be better.  I was hesitant because of all the 
parsing work involved, but I bit the bullet and did it in the new patch.


To simplify the format, I changed the parameter so it's just a list of 
types that you want in binary, rather than type=value pairs.  If we ever 
want to add another format, we would revisit this, but it seems unlikely 
in the near future.


Also, I have changed the naming of the parameter since this is no longer 
the "default" but something you choose explicitly.  I'm thinking in the 
direction of "auto" mode for the naming.  Obviously, the name is easy to 
tweak in any case.


Andrew, Tom, does the latest patch address your concerns?

Regards,
--
-David
da...@pgmasters.net




Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2021-03-09 Thread David Steele

Hi David,

On 11/24/20 10:28 PM, Euler Taveira wrote:


I also reviewed your patch. This feature would be really useful for 
replication
scenarios. Supporting this feature means that you don't need to use a 
table to

pass messages from one node to another one. Here are a few comments/ideas.


Do you know when you'll have a chance to look at Euler's suggestions? 
Also, have Andres' suggestions/concerns upthread been addressed?


Marked Waiting on Author.

Regards,
--
-David
da...@pgmasters.net




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread Fujii Masao




On 2021/03/09 23:19, James Coleman wrote:

On Tue, Mar 9, 2021 at 9:17 AM Alvaro Herrera  wrote:


On 2021-Mar-09, James Coleman wrote:


Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
disabled." message, but that alternative meant not needing to add any
new signals and pm states, correct?


Ah, I see!  I was thinking that you still needed the state and signal in
order to print the correct message in hot-standby mode, but that's
(obviously!) wrong.  So you're right that no signal/state are needed.


Cool. And yes, I'm planning to update the patch soon.


+1. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: shared-memory based stats collector

2021-03-09 Thread Fujii Masao




On 2021/03/09 16:51, Kyotaro Horiguchi wrote:

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao  
wrote in



On 2021/03/05 17:18, Kyotaro Horiguchi wrote:

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi
 wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.


Thanks for rebasing the patches!

I think that 0003 patch is self-contained and useful, for example
which
enables us to monitor archiver process in pg_stat_activity. So IMO
it's worth pusing 0003 patch firstly.


I'm not sure archiver process is worth realtime monitoring, but I
agree that the patch makes it possible.  Anyway it is requried in this
patchset and I'm happy to see it committed beforehand.

Thanks for the review.


Here are the review comments for 0003 patch.

+   /* Archiver process's latch */
+   Latch  *archiverLatch;
+   /* Current shared estimate of appropriate spins_per_delay value */

The last line in the above seems not necessary.


Oops. It seems like a garbage after a past rebasing. Removed.


In proc.h, NUM_AUXILIARY_PROCS needs to be incremented.


Right.  Increased to 5 and rewrote the comment.


 /* --
  * Functions called from postmaster
  * --
  */
 extern int pgarch_start(void);

In pgarch.h, the above is not necessary.


Removed.


+extern void XLogArchiveWakeup(void);

This seems no longer necessary.

+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);

These seem also no longer necessary.


Sorry for many garbages. Removed all of them.


PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-   LogChildExit(LOG, _("archiver process"),
-pid, exitstatus);
-   if (PgArchStartupAllowed())
-   PgArchPID = pgarch_start();
+   HandleChildCrash(pid, exitstatus,
+_("archiver 
process"));

I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.


Sounds reasonable. Now archiver is treated the same way to wal
receiver.  Specifically exit(1) doesn't cause server restart.


Thanks!

-   if (PgArchStartupAllowed())
-   PgArchPID = pgarch_start();

In the latest patch, why did you remove the code to restart new archiver
in reaper()? When archiver dies, I think new archiver should be restarted like
the current reaper() does. Otherwise, the restart of archiver can be
delayed until the next cycle of ServerLoop, which may take time.





- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
   *
   * The objectives here are to clean up our local state about the child
   * process, and to signal all other remaining children to quickdie.
@@ -3609,6 +3606,18 @@ HandleChildCrash(int pid, int exitstatus, const
char *procname)
signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
}
  + /* Take care of the archiver too */
+   if (pid == PgArchPID)
+   PgArchPID = 0;
+   else if (PgArchPID != 0 && take_action)
+   {
+   ereport(DEBUG2,
+   (errmsg_internal("sending %s to process %d",
+(SendStop ? "SIGSTOP" : 
"SIGQUIT"),
+(int) 
PgArchPID)));
+   signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+   }
+

Same as above.


Mmm.  In the first place, I found that I forgot to remove existing
code to handle archiver... Removed it instead of the above, which is
added by this patch.  Since the process becomes an auxiliary process,
no reason to differentiate from other auxiliary processes in handling?


Yes, ok.





In xlogarchive.c, "#include "storage/pmsignal.h"" is no longer
necessary.


Removed.


pgarch_forkexec() should be removed from pgarch.c because it's no
longer used.


Right. Removed. EXEC_BACKEND still fails for another reason of
prototype mismatch of PgArchiverMain. Fixed it toghether.


 /* 
  * Public functions called from postmaster follow
  * 

Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 9:17 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-09, James Coleman wrote:
>
> > Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
> > disabled." message, but that alternative meant not needing to add any
> > new signals and pm states, correct?
>
> Ah, I see!  I was thinking that you still needed the state and signal in
> order to print the correct message in hot-standby mode, but that's
> (obviously!) wrong.  So you're right that no signal/state are needed.

Cool. And yes, I'm planning to update the patch soon.

Thanks,
James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread Alvaro Herrera
On 2021-Mar-09, James Coleman wrote:

> Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
> disabled." message, but that alternative meant not needing to add any
> new signals and pm states, correct?

Ah, I see!  I was thinking that you still needed the state and signal in
order to print the correct message in hot-standby mode, but that's
(obviously!) wrong.  So you're right that no signal/state are needed.

Thanks

-- 
Álvaro Herrera   Valdivia, Chile
Si no sabes adonde vas, es muy probable que acabes en otra parte.




Re: Enhance traceability of wal_level changes for backup management

2021-03-09 Thread David Steele

On 3/7/21 9:45 PM, osumi.takami...@fujitsu.com wrote:

On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut 
 wrote:

On 28.01.21 01:44, osumi.takami...@fujitsu.com wrote:

(1) writing the time or LSN in the control file to indicate
when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.


It's not clear to me what this is supposed to accomplish.  I read the thread,
but it's still not clear.
What is one supposed to do with this information?

OK. The basic idea is to enable backup management
tools to recognize wal_level drop between *snapshots*.
When you have a snapshot of the cluster at one time and another one
at different time, with this new parameter, you can see if
anything that causes discontinuity from the drop happens
in the middle of the two snapshots without efforts to have a look at the WALs 
in between.


As a backup software author, I don't see this feature as very useful.

The problem is that there are lots of ways for WAL to go missing so  
monitoring the WAL archive for gaps is essential and this feature would  
not replace that requirement. The only extra information you'd get is  
the ability to classify the most recent gap as "intentional", maybe.


So, -1 from me.

Regards,
--
-David
da...@pgmasters.net




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread Magnus Hagander
On Tue, Mar 9, 2021 at 3:07 PM Alvaro Herrera  wrote:
>
> On 2021-Mar-09, James Coleman wrote:
>
> > On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  
> > wrote:
> > >
> > > On 2021-Mar-07, Magnus Hagander wrote:
> > >
> > > > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao 
> > > >  wrote:
> > >
> > > Great, so we're agreed on the messages to emit.  James, are you updating
> > > your patch, considering Fujii's note about the new signal and pmstate
> > > that need to be added?
> >
> > Perhaps I'm missing something, but I was under the impression the
> > "prefer the former message" meant we were not adding a new signal and
> > pmstate?
>
> Eh, I read that differently.  I was proposing two options for the DETAIL
> line in that case:
>
> >  DETAIL:  Hot standby mode is disabled.
> >  or maybe
> >  DETAIL:  Consistent state has not yet been reached, and hot standby mode 
> > is disabled.
>
> and both Fujii and Magnus said they prefer the first option over the
> second option.  I don't read any of them as saying that they would like
> to do something else (including not doing anything).
>
> Maybe I misinterpreted them?

That is indeed what I meant as well.

The reference to "the former" as being the "first or the two new
options", not the "old option". That is, "DETAIL:  Hot standby mode is
disabled.".

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 9:07 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-09, James Coleman wrote:
>
> > On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  
> > wrote:
> > >
> > > On 2021-Mar-07, Magnus Hagander wrote:
> > >
> > > > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao 
> > > >  wrote:
> > >
> > > Great, so we're agreed on the messages to emit.  James, are you updating
> > > your patch, considering Fujii's note about the new signal and pmstate
> > > that need to be added?
> >
> > Perhaps I'm missing something, but I was under the impression the
> > "prefer the former message" meant we were not adding a new signal and
> > pmstate?
>
> Eh, I read that differently.  I was proposing two options for the DETAIL
> line in that case:
>
> >  DETAIL:  Hot standby mode is disabled.
> >  or maybe
> >  DETAIL:  Consistent state has not yet been reached, and hot standby mode 
> > is disabled.
>
> and both Fujii and Magnus said they prefer the first option over the
> second option.  I don't read any of them as saying that they would like
> to do something else (including not doing anything).
>
> Maybe I misinterpreted them?

Yes, I think they both agreed on the "DETAIL:  Hot standby mode is
disabled." message, but that alternative meant not needing to add any
new signals and pm states, correct?

Thanks,
James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread Alvaro Herrera
On 2021-Mar-09, James Coleman wrote:

> On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  wrote:
> >
> > On 2021-Mar-07, Magnus Hagander wrote:
> >
> > > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao  
> > > wrote:
> >
> > Great, so we're agreed on the messages to emit.  James, are you updating
> > your patch, considering Fujii's note about the new signal and pmstate
> > that need to be added?
> 
> Perhaps I'm missing something, but I was under the impression the
> "prefer the former message" meant we were not adding a new signal and
> pmstate?

Eh, I read that differently.  I was proposing two options for the DETAIL
line in that case:

>  DETAIL:  Hot standby mode is disabled.
>  or maybe
>  DETAIL:  Consistent state has not yet been reached, and hot standby mode is 
> disabled.

and both Fujii and Magnus said they prefer the first option over the
second option.  I don't read any of them as saying that they would like
to do something else (including not doing anything).

Maybe I misinterpreted them?

-- 
Álvaro Herrera   Valdivia, Chile




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread James Coleman
On Tue, Mar 9, 2021 at 8:47 AM Alvaro Herrera  wrote:
>
> On 2021-Mar-07, Magnus Hagander wrote:
>
> > On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao  
> > wrote:
> >
> > > > Here's an idea:
> > > >
> > > > * hot_standby=on, before reaching consistent state
> > > >FATAL:  database is not accepting connections
> > > >DETAIL:  Consistent state has not yet been reached.
> > > >
> > > > * hot_standby=off, past consistent state
> > > >FATAL:  database is not accepting connections
> > > >DETAIL:  Hot standby mode is disabled.
> > > >
> > > > * hot_standby=off, before reaching consistent state
> > > >FATAL:  database is not accepting connections
> [...]
> > > >DETAIL:  Hot standby mode is disabled.
>
> > > I prefer the former message. Because the latter message meams that
> > > we need to output the different messages based on whether the consistent
> > > state is reached or not, and the followings would be necessary to 
> > > implement
> > > that. This looks a bit overkill to me against the purpose, at least for 
> > > me.
> >
> > Agreed. If hot standby is off, why would the admin care about whether
> > it's consistent yet or not?
>
> Great, so we're agreed on the messages to emit.  James, are you updating
> your patch, considering Fujii's note about the new signal and pmstate
> that need to be added?

Perhaps I'm missing something, but I was under the impression the
"prefer the former message" meant we were not adding a new signal and
pmstate?

James




Re: Nicer error when connecting to standby with hot_standby=off

2021-03-09 Thread Alvaro Herrera
On 2021-Mar-07, Magnus Hagander wrote:

> On Sun, Mar 7, 2021 at 3:39 PM Fujii Masao  
> wrote:
>
> > > Here's an idea:
> > >
> > > * hot_standby=on, before reaching consistent state
> > >FATAL:  database is not accepting connections
> > >DETAIL:  Consistent state has not yet been reached.
> > >
> > > * hot_standby=off, past consistent state
> > >FATAL:  database is not accepting connections
> > >DETAIL:  Hot standby mode is disabled.
> > >
> > > * hot_standby=off, before reaching consistent state
> > >FATAL:  database is not accepting connections
[...]
> > >DETAIL:  Hot standby mode is disabled.

> > I prefer the former message. Because the latter message meams that
> > we need to output the different messages based on whether the consistent
> > state is reached or not, and the followings would be necessary to implement
> > that. This looks a bit overkill to me against the purpose, at least for me.
> 
> Agreed. If hot standby is off, why would the admin care about whether
> it's consistent yet or not?

Great, so we're agreed on the messages to emit.  James, are you updating
your patch, considering Fujii's note about the new signal and pmstate
that need to be added?

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2021-03-09 Thread David Steele

Hi David,

On 11/23/20 3:31 PM, Anastasia Lubennikova wrote:

On 30.09.2020 05:00, David G. Johnston wrote:


v5 attached, looking at this fresh and with some comments to consider.

I ended up just combining both patches into one.

I did away with the glossary changes altogether, and the invention of 
the new term.  I ended up limiting "type's type" to just domain usage 
but did a couple of a additional tweaks that tried to treat domains as 
not being actual types even though, at least in PostgreSQL, they are 
(at least as far as DROP TYPE is concerned - and since I don't have 
any understanding of the SQL Standard's decision to separate out 
create domain and create type I'll just stick to the implementation in 
front of me.


Reminder from a CF manager, as this thread was inactive for a while.
Alexander, I see you signed up as a committer for this entry. Are you 
going to continue this work?


This patch was marked Ready for Committer on July 14 but received a 
significant update on August 30. So, I have marked it Needs Review.


Further, I think we should close this entry at the end of the CF if it 
does not attract committer interest. Tom is not in favor of the patch 
and it appears Alexander decided not to commit it.


Regards,
--
-David
da...@pgmasters.net




RE: non-HOT update not looking at FSM for large tuple update

2021-03-09 Thread Floris Van Nee
Hi,

> 
> This patch fails to consider that len may be bigger than MaxHeapTupleSize *
> 0.98, which in this case triggers a reproducable
> PANIC:

Good catch! I've adapted the patch with your suggested fix.

> 
> One different question I have, though, is why we can't "just" teach vacuum
> to clean up trailing unused line pointers. As in, can't we trim the line 
> pointer
> array when vacuum detects that the trailing line pointers on the page are all
> unused?
> 
> The only documentation that I could find that this doesn't happen is in the
> comment on PageIndexTupleDelete and PageRepairFragmentation, both not
> very descriptive on why we can't shrink the page->pd_linp array. One is
> "Unlike heap pages, we compact out the line pointer for the removed tuple."
> (Jan. 2002), and the other is "It doesn't remove unused line pointers! Please
> don't change this." (Oct. 2000), but I can't seem to find the documentation /
> conversations on the implications that such shrinking would have.
> 

This is an interesting alternative indeed. I also can't find any 
documentation/conversation about this and the message is rather cryptic.
Hopefully someone on the list still remembers the reasoning behind this rather 
cryptic comment in PageRepairFragmentation.

-Floris


v3-Allow-inserting-tuples-into-almost-empty-pages.patch
Description: v3-Allow-inserting-tuples-into-almost-empty-pages.patch


Re: cleanup temporary files after crash

2021-03-09 Thread Euler Taveira
On Tue, Mar 9, 2021, at 9:31 AM, Michael Paquier wrote:
> On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:
> > Let's move this patch forward. Based on the responses, I agree the
> > default behavior should be to remove the temp files, and I think we
> > should have the GUC (on the off chance that someone wants to preserve
> > the temporary files for debugging or whatever other reason).
> 
> Thanks for taking care of this.  I am having some second-thoughts
> about changing this behavior by default, still that's much more useful
> this way.
> 
> > I propose to rename the GUC to remove_temp_files_after_crash, I think
> > "remove" is a bit clearer than "cleanup". I've also reworded the sgml
> > docs a little bit.
> 
> "remove" sounds fine to me.
+1.

> > Attached is a patch with those changes. Barring objections, I'll get
> > this committed in the next couple days.
> 
> +When set to on, PostgreSQL will 
> automatically
> Nit: using a  markup for the "on" value.
> 
> +#remove_temp_files_after_crash = on# remove temporary files after
> +#  # backend crash?
> The indentation of the second line is incorrect here (Incorrect number
> of spaces in tabs perhaps?), and there is no need for the '#' at the
> beginning of the line.
> --
> Michael
That was my fault. Editor automatically added #.

I'm not sure Tomas will include the tests. If so. the terminology should be 
adjusted too.

+++ b/src/test/recovery/t/022_crash_temp_files.pl
@@ -0,0 +1,194 @@
+#
+# Test cleanup of temporary files after a crash.

s/cleanup/remove/


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 13:16, Pavel Stehule wrote:
> út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson  napsal:
>> __On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
>>> My experience with working with parallel arrays in SQL has been unpleasant.
>> 
>> Could you please give an example on such an unpleasant experience?
> 
> it was more complex application with 3D data of some points in 2D array. 
> Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x, ...

Not sure I understand, but my question was directed to Tom, who wrote about his 
experiences in a previous message up thread.

Tom - can you please give details on your unpleasant experiences with parallel 
arrays?

/Joel

Re: ResourceOwner refactoring

2021-03-09 Thread Heikki Linnakangas

On 08/03/2021 18:47, Ibrar Ahmed wrote:

The patchset does not apply successfully, there are some hunk failures.

http://cfbot.cputube.org/patch_32_2834.log 



v6-0002-Make-resowners-more-easily-extensible.patch

1 out of 6 hunks FAILED -- saving rejects to file 
src/backend/utils/cache/plancache.c.rej
2 out of 15 hunks FAILED -- saving rejects to file 
src/backend/utils/resowner/resowner.c.rej



Can we get a rebase?


Here you go.

- Heikki
>From 8b852723212c51d3329267fcbf8e7a28f49dbbdd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 13 Jan 2021 12:21:28 +0200
Subject: [PATCH v7 1/3] Move a few ResourceOwnerEnlarge() calls for safety and
 clarity.

These are functions where quite a lot of things happen between the
ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important that
there are no unrelated ResourceOwnerRemember() calls in the code
inbetween, otherwise the entry reserved by the ResourceOwnerEnlarge() call
might be used up by the intervening ResourceOwnerRemember() and not be
available at the intended ResourceOwnerRemember() call anymore. The longer
the code path between them is, the harder it is to verify that.

In bufmgr.c, there is a function similar to ResourceOwnerEnlarge(),
to ensure that the private refcount array has enough space. The
ReservePrivateRefCountEntry() calls, analogous to ResourceOwnerEnlarge(),
were made at different places than the ResourceOwnerEnlarge() calls.
Move the ResourceOwnerEnlarge() calls together with the
ReservePrivateRefCountEntry() calls for consistency.
---
 src/backend/storage/buffer/bufmgr.c   | 39 +++
 src/backend/storage/buffer/localbuf.c |  3 +++
 src/backend/utils/cache/catcache.c| 13 ++---
 3 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..cde869e7d64 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -738,9 +738,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	*hit = false;
 
-	/* Make sure we will have room to remember the buffer pin */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
-
 	isExtend = (blockNum == P_NEW);
 
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
@@ -1093,9 +1090,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	{
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
+		 * refcount entry and that the resoure owner has room to remember the
+		 * pin.
 		 */
 		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
@@ -1595,8 +1594,6 @@ ReleaseAndReadBuffer(Buffer buffer,
  * taking the buffer header lock; instead update the state variable in loop of
  * CAS operations. Hopefully it's just a single CAS.
  *
- * Note that ResourceOwnerEnlargeBuffers must have been done already.
- *
  * Returns true if buffer is BM_VALID, else false.  This provision allows
  * some callers to avoid an extra spinlock cycle.
  */
@@ -1607,6 +1604,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1687,7 +1686,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
  * The spinlock is released before return.
  *
  * As this function is called with the spinlock held, the caller has to
- * previously call ReservePrivateRefCountEntry().
+ * previously call ReservePrivateRefCountEntry() and
+ * ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  *
  * Currently, no callers of this function want to modify the buffer's
  * usage_count at all, so there's no need for a strategy parameter.
@@ -1857,9 +1857,6 @@ BufferSync(int flags)
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
 
-	/* Make sure we can handle the pin inside SyncOneBuffer */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
-
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
 	 * we write only permanent, dirty buffers.  But at shutdown or end of
@@ -2334,9 +2331,6 @@ BgBufferSync(WritebackContext *wb_context)
 	 * requirements, or hit the bgwriter_lru_maxpages limit.
 	 */
 
-	/* Make sure we can handle the pin inside SyncOneBuffer */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
-
 	num_to_scan = bufs_to_lap;
 	num_written = 0;
 	reusable_buffers = reusable_buffers_est;
@@ -2418,8 +2412,6 @@ BgBufferSync(WritebackContext *wb_context)
  *
  * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
  * after locking it, but we don't care all that much.)
- *
- * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
 SyncOneBuffer(int buf_id, bool 

Re: cleanup temporary files after crash

2021-03-09 Thread Michael Paquier
On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:
> Let's move this patch forward. Based on the responses, I agree the
> default behavior should be to remove the temp files, and I think we
> should have the GUC (on the off chance that someone wants to preserve
> the temporary files for debugging or whatever other reason).

Thanks for taking care of this.  I am having some second-thoughts
about changing this behavior by default, still that's much more useful
this way.

> I propose to rename the GUC to remove_temp_files_after_crash, I think
> "remove" is a bit clearer than "cleanup". I've also reworded the sgml
> docs a little bit.

"remove" sounds fine to me.

> Attached is a patch with those changes. Barring objections, I'll get
> this committed in the next couple days.

+When set to on, PostgreSQL will 
automatically
Nit: using a  markup for the "on" value.

+#remove_temp_files_after_crash = on# remove temporary files after
+#  # backend crash?
The indentation of the second line is incorrect here (Incorrect number
of spaces in tabs perhaps?), and there is no need for the '#' at the
beginning of the line.
--
Michael


signature.asc
Description: PGP signature


Re: SQL-standard function body

2021-03-09 Thread Peter Eisentraut

On 05.03.21 06:58, Jaime Casanova wrote:

I was making some tests with this patch and found this problem:

"""
CREATE OR REPLACE FUNCTION public.make_table()
  RETURNS void
  LANGUAGE sql
BEGIN ATOMIC
   CREATE TABLE created_table AS SELECT * FROM int8_tbl;
END;
ERROR:  unrecognized token: "?"
CONTEXT:  SQL function "make_table"
"""


I see.  The problem is that we don't have serialization and 
deserialization support for most utility statements.  I think I'll need 
to add that eventually.  For now, I have added code to prevent utility 
statements.  I think it's still useful that way for now.


From 29de4ec1ae12d3f9c1f6a31cf626a19b2421ae7a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 9 Mar 2021 13:25:39 +0100
Subject: [PATCH v9] SQL-standard function body

This adds support for writing CREATE FUNCTION and CREATE PROCEDURE
statements for language SQL with a function body that conforms to the
SQL standard and is portable to other implementations.

Instead of the PostgreSQL-specific AS $$ string literal $$ syntax,
this allows writing out the SQL statements making up the body
unquoted, either as a single statement:

CREATE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE SQL
RETURN a + b;

or as a block

CREATE PROCEDURE insert_data(a integer, b integer)
LANGUAGE SQL
BEGIN ATOMIC
  INSERT INTO tbl VALUES (a);
  INSERT INTO tbl VALUES (b);
END;

The function body is parsed at function definition time and stored as
expression nodes in a new pg_proc column prosqlbody.  So at run time,
no further parsing is required.

However, this form does not support polymorphic arguments, because
there is no more parse analysis done at call time.

Dependencies between the function and the objects it uses are fully
tracked.

A new RETURN statement is introduced.  This can only be used inside
function bodies.  Internally, it is treated much like a SELECT
statement.

psql needs some new intelligence to keep track of function body
boundaries so that it doesn't send off statements when it sees
semicolons that are inside a function body.

Also, per SQL standard, LANGUAGE SQL is the default, so it does not
need to be specified anymore.

Discussion: 
https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c0...@2ndquadrant.com
---
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/create_function.sgml | 126 +-
 doc/src/sgml/ref/create_procedure.sgml|  62 -
 src/backend/catalog/pg_aggregate.c|   1 +
 src/backend/catalog/pg_proc.c | 116 +
 src/backend/commands/aggregatecmds.c  |   2 +
 src/backend/commands/functioncmds.c   | 130 --
 src/backend/commands/typecmds.c   |   4 +
 src/backend/executor/functions.c  |  79 +++---
 src/backend/nodes/copyfuncs.c |  15 ++
 src/backend/nodes/equalfuncs.c|  13 +
 src/backend/nodes/outfuncs.c  |  12 +
 src/backend/nodes/readfuncs.c |   1 +
 src/backend/optimizer/util/clauses.c  | 126 +++---
 src/backend/parser/analyze.c  |  35 +++
 src/backend/parser/gram.y | 129 +++---
 src/backend/tcop/postgres.c   |   3 +-
 src/backend/utils/adt/ruleutils.c | 132 +-
 src/bin/pg_dump/pg_dump.c |  45 +++-
 src/bin/psql/describe.c   |  15 +-
 src/fe_utils/psqlscan.l   |  23 +-
 src/include/catalog/pg_proc.dat   |   4 +
 src/include/catalog/pg_proc.h |   6 +-
 src/include/commands/defrem.h |   2 +
 src/include/executor/functions.h  |  15 ++
 src/include/fe_utils/psqlscan_int.h   |   2 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|  13 +
 src/include/parser/kwlist.h   |   2 +
 src/include/tcop/tcopprot.h   |   1 +
 src/interfaces/ecpg/preproc/ecpg.addons   |   6 +
 src/interfaces/ecpg/preproc/ecpg.trailer  |   4 +-
 .../regress/expected/create_function_3.out| 231 +-
 .../regress/expected/create_procedure.out |  65 +
 src/test/regress/sql/create_function_3.sql|  96 +++-
 src/test/regress/sql/create_procedure.sql |  33 +++
 36 files changed, 1346 insertions(+), 214 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b1de6d0674..146bf0f0b0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5980,6 +5980,16 @@ pg_proc Columns
   
  
 
+ 
+  
+   prosqlbody pg_node_tree
+  
+  
+   Pre-parsed SQL function body.  This will be used for language SQL
+   functions if the body is not specified as a string constant.
+   
+ 
+
  
   
proconfig text[]
diff --git 

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Pavel Stehule
út 9. 3. 2021 v 11:32 odesílatel Joel Jacobson  napsal:

> On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
>
> My experience with working with parallel arrays in SQL has been unpleasant.
>
>
> Could you please give an example on such an unpleasant experience?
>

it was more complex application with 3D data of some points in 2D array.
Everywhere was a[d, 0], a[d, 1], a[d, 2], instead a[d] or instead a[d].x,
...



> I can see a problem if the arrays could possibly have difference
> dimensionality/cardinality,
> but regexp_positions() could guarantee they won't, so I don't see a
> problem here,
> but there is probably something I'm missing here?
>

I think so the functions based on arrays can work, why not. But the
semantic is lost.


> /Joel
>


Outdated comments about proc->sem in lwlock.c

2021-03-09 Thread Thomas Munro
Hi,

In passing I noticed that lwlock.c contains 3 comments about bogus
wakeups due to sharing proc->sem with the heavyweight lock manager and
ProcWaitForSignal.  Commit 675f55e (9.5) switched those things
from proc->sem to proc->procLatch.  ProcArrayGroupClearXid() and
TransactionGroupUpdateXidStatus() also use proc->sem though, and I
haven't studied how those might overlap with with LWLockWait(), so I'm
not sure what change to suggest.




Questions about CommandIsReadOnly

2021-03-09 Thread houzj.f...@fujitsu.com
Hi hackers,

When reading the code, I found that in function CommandIsReadOnly[1], "select 
for update/share" is defined as "not read only".
[1]-
if (pstmt->rowMarks != NIL)
return false;   /* SELECT FOR [KEY] UPDATE/SHARE */
-

And from the comment [2], I think it means we need to CCI for "select for 
update/share ",
I am not very familiar this, is there some reason that we have to do CCI for 
"select for update/share " ?
Or Did I misunderstand ?

[2]-
* the query must be *in truth* read-only, because the caller wishes
* not to do CommandCounterIncrement for it.
-

Best regards,
houzj






Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread John Naylor
On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar 
wrote:
>
> Hi,
>
> Just a quick question before I move on to review the patch ... The
> improvement looks like it is only meant for x86 platforms.

Actually it's meant to be faster for all platforms, since the C fallback is
quite a bit different from HEAD. I've found it to be faster on ppc64le. An
earlier version of the patch was a loser on 32-bit Arm because of alignment
issues, but if you could run the test script attached to [1] on 64-bit Arm,
I'd be curious to see how it does on 0002, and whether 0003 and 0004 make
things better or worse. If there is trouble building on non-x86 platforms,
I'd want to fix that also.

(Note: 0001 is not my patch, and I just include it for the tests)

> Can this be
> done in a portable way by arranging for auto-vectorization ? Something
> like commit 88709176236caf. This way it would benefit other platforms
> as well.

I'm fairly certain that the author of a compiler capable of doing that in
this case would be eligible for some kind of AI prize. :-)

[1]
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5...@iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com


Enlarge IOS vm cache

2021-03-09 Thread Arseny Sher
Hi,

Our customer experienced a significant slowdown on queries involving
Index Only Scan. As it turned out, the problem was constant pin-unpin of
the visibility map page. IOS caches only one vm page, which corresponds
to 8192 * 8 / 2 * 8192 bytes = 256 MB of data; if the table is larger
and the order of access (index) doesn't match the order of data, vm page
will be replaced on each tuple processing. That's costly. Attached
ios.sql script emulates this worst case behaviour. In current master,
select takes

[local]:5432 ars@postgres:21052=# explain analyse select * from test order by 
id;
QUERY PLAN
---
 Index Only Scan using test_idx on test  (cost=0.44..1159381.24 rows=59013120 
width=8) (actual time=0.015..9094.532 rows=59013120 loops=1)
   Heap Fetches: 0
 Planning Time: 0.043 ms
 Execution Time: 10508.576 ms


Attached straightforward patch increases the cache to store 64 pages (16
GB of data). With it, we get

[local]:5432 ars@postgres:17427=# explain analyse select * from test order by 
id;
QUERY PLAN
---
 Index Only Scan using test_idx on test  (cost=0.44..1159381.24 rows=59013120 
width=8) (actual time=0.040..3469.299 rows=59013120 loops=1)
   Heap Fetches: 0
 Planning Time: 0.118 ms
 Execution Time: 4871.124 ms

(I believe the whole index is cached in these tests)

You might say 16GB is also somewhat arbitrary border. Well, it is. We
could make it GUC-able, but I'm not sure here as the setting is rather
low-level, and at the same time having several dozens of additionally
pinned buffers doesn't sound too criminal, i.e. I doubt there is a real
risk of "no unpinned buffers available" or something (e.g. even default
32MB shared_buffers contain 4096 pages). However, forcing IOS to be
inefficient if the table is larger is also illy. Any thoughts?

(the code is by K. Knizhnik, testing by M. Zhilin and R. Zharkov, I've
only polished the things up)



ios.sql
Description: application/sql
commit 3726b6ffdba5461dcbf82cce5de13760a2642468
Author: Konstantin Knizhnik 
Date:   Fri Jan 15 12:57:46 2021 +0300

Increase IOS vm cache.

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e198df65d82..98e249b0c2a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -99,24 +99,6 @@
 
 /*#define TRACE_VISIBILITYMAP */
 
-/*
- * Size of the bitmap on each visibility map page, in bytes. There's no
- * extra headers, so the whole page minus the standard page header is
- * used for the bitmap.
- */
-#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-
-/* Number of heap blocks we can represent in one byte */
-#define HEAPBLOCKS_PER_BYTE (BITS_PER_BYTE / BITS_PER_HEAPBLOCK)
-
-/* Number of heap blocks we can represent in one visibility map page. */
-#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
-
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_OFFSET(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
-
 /* Masks for counting subsets of bits in the visibility map. */
 #define VISIBLE_MASK64	UINT64CONST(0x) /* The lower bit of each
 		 * bit pair */
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0754e28a9aa..47c6e8929a3 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -101,7 +101,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+
+		for (int i = 0; i < VMBUF_SIZE; i++)
+			node->ioss_VMBuffer[i] = InvalidBuffer;
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -121,6 +123,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
+		Buffer		*vm_buf;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -158,9 +161,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
+		vm_buf = >ioss_VMBuffer[HEAPBLK_TO_MAPBLOCK(
+ItemPointerGetBlockNumber(tid)) % VMBUF_SIZE];
 		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
 			ItemPointerGetBlockNumber(tid),
-			>ioss_VMBuffer))
+			vm_buf))
 		{
 			/*
 			 * Rats, we have to visit 

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-09 Thread Amit Kapila
On Tue, Mar 9, 2021 at 3:22 PM Ajin Cherian  wrote:
>

Few comments:
==

1.
+/*
+ * Handle the PREPARE spoolfile (if any)
+ *
+ * It can be necessary to redirect the PREPARE messages to a spoolfile (see
+ * apply_handle_begin_prepare) and then replay them back at the COMMIT PREPARED
+ * time. If needed, this is the common function to do that file redirection.
+ *

I think the last sentence ("If needed, this is the ..." in the above
comments is not required.

2.
+prepare_spoolfile_exists(char *path)
+{
+ bool found;
+
+ File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
+
+ found = fd >= 0;
+ if (fd >= 0)
+ FileClose(fd);

Can we avoid using bool variable in the above code with something like below?

File fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);

if (fd >= 0)
{
FileClose(fd);
return true;
}

return false;

3. In prepare_spoolfile_replay_messages(), it is better to free the
memory allocated for temporary strings buffer and s2.

4.
+ /* check if the file already exists. */
+ file_found = prepare_spoolfile_exists(path);
+
+ if (!file_found)
+ {
+ elog(DEBUG1, "Not found file \"%s\". Create it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create file \"%s\": %m", path)));
+ }
+ else
+ {
+ /*
+ * Open the file and seek to the beginning because we always want to
+ * create/overwrite this file.
+ */
+ elog(DEBUG1, "Found file \"%s\". Overwrite it.", path);
+ psf_cur.vfd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_TRUNC | PG_BINARY);
+ if (psf_cur.vfd < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+ }

Here, whether the file exists or not you are using the same flags to
open it which seems correct to me but the code looks a bit odd. Why do
we in this case even bother to check if it exists? Is it for DEBUG
message, if so not sure if that is worth it? I am also thinking why
not have a function prepare_spoolfile_open similar to *_close and call
it from all the places with the mode where you can indicate whether
you want to create or open the file.

5. I think prepare_spoolfile_close can be extended to take PsfFile as
input and then it can be also used from
prepare_spoolfile_replay_messages.

6. I think we should also write some commentary about prepared
transactions atop of worker.c as we have done for streamed
transactions.

-- 
With Regards,
Amit Kapila.




Re: Improvements and additions to COPY progress reporting

2021-03-09 Thread Josef Šimánek
út 9. 3. 2021 v 6:34 odesílatel Michael Paquier  napsal:
>
> On Mon, Mar 08, 2021 at 05:33:40PM +0100, Matthias van de Meent wrote:
> > Seems reasonable. PFA updated patches. I've renamed the previous 0003
> > to 0002 to keep git-format-patch easy.
>
> Thanks for updating the patch.  0001 has been applied, after tweaking
> a bit comments, indentation and the docs.
>
> > This is keeping current behaviour of the implementation as committed
> > with 8a4f618e, with the rationale of that patch being that this number
> > should mirror the number returned by the copy command.
> >
> > I am not opposed to adding another column for `tuples_inserted` and
> > changing the logic accordingly (see prototype 0003), but that was not
> > in the intended scope of this patchset. Unless you think that this
> > should be included in this current patchset, I'll spin that patch out
> > into a different thread, but I'm not sure that would make it into
> > pg14.
>
> Okay, point taken.  If there is demand for it in the future, we could
> extend the existing set of columns.  After thinking more about it the
> usecase if not completely clear to me from a monitoring point of
> view.
>
> I have not looked at 0002 in details yet, but I am wondering first if
> the size estimations in the expected output are actually portable.
> Second, I doubt a bit that the extra cycles spent on that are actually
> worth the coverage, even if the trick with an AFTER INSERT trigger is
> interesting.

Those extra cycles are in there to cover at least parts of the COPY
progress from being accidentally broken. I have seen various patches
modifying COPY command being currently in progress. It would be nice
to ensure at least basic functionality works well in an automated way.
On my machine there is no huge overhead added by adding those tests
(they finish almost instantly).

> --
> Michael




Re: Make stream_prepare an optional callback

2021-03-09 Thread Amit Kapila
On Tue, Mar 9, 2021 at 3:41 PM Markus Wanner
 wrote:
>
> On 09.03.21 10:37, Amit Kapila wrote:
> I guess I don't quite understand the initial motivation for the patch.
> It states: "This allows plugins to not allow the enabling of streaming
> and two_phase at the same time in logical replication."   That's beyond
> me ... "allows [..] to not allow"?  Why not, an output plugin can still
> reasonably request both.  And that's a good thing, IMO.  What problem
> does the patch try to solve?
>

AFAIU, Ajin doesn't want to mandate streaming with two_pc option. But,
maybe you are right that it doesn't make sense for the user to provide
both options but doesn't provide stream_prepare callback, and giving
an error in such a case should be fine. I think if we have to follow
Ajin's idea then we need to skip 2PC in such a case (both prepare and
commit prepare) and make this a regular transaction.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Thu, Mar 4, 2021, at 16:40, Tom Lane wrote:
> My experience with working with parallel arrays in SQL has been unpleasant.

Could you please give an example on such an unpleasant experience?

I can see a problem if the arrays could possibly have difference 
dimensionality/cardinality,
but regexp_positions() could guarantee they won't, so I don't see a problem 
here,
but there is probably something I'm missing here?

/Joel

Re: PROXY protocol support

2021-03-09 Thread Magnus Hagander
On Sat, Mar 6, 2021 at 5:30 PM Magnus Hagander  wrote:
>
> On Sat, Mar 6, 2021 at 4:17 PM Magnus Hagander  wrote:
> >
> > On Fri, Mar 5, 2021 at 8:11 PM Jacob Champion  wrote:
> > >
> > > On Fri, 2021-03-05 at 10:22 +0100, Magnus Hagander wrote:
> > > > On Fri, Mar 5, 2021 at 12:21 AM Jacob Champion  
> > > > wrote:
> > > > > The original-host logging isn't working for me:
> > > > >
> > > > > [...]
> > > >
> > > > That's interesting -- it works perfectly fine here. What platform are
> > > > you testing on?
> > >
> > > Ubuntu 20.04.
> >
> > Curious. It doesn't show up on my debian.
> >
> > But either way -- it was clearly wrong :)
> >
> >
> > > > (I sent for sizeof(SockAddr) to make it
> > > > easier to read without having to look things up, but the net result is
> > > > the same)
> > >
> > > Cool. Did you mean to attach a patch?
> >
> > I didn't, I had some other hacks that were broken :) I've attached one
> > now which includes those changes.
> >
> >
> > > == More Notes ==
> > >
> > > (Stop me if I'm digging too far into a proof of concept patch.)
> >
> > Definitely not -- much appreciated, and just what was needed to take
> > it from poc to a proper one!
> >
> >
> > > > + proxyaddrlen = pg_ntoh16(proxyheader.len);
> > > > +
> > > > + if (proxyaddrlen > sizeof(proxyaddr))
> > > > + {
> > > > + ereport(COMMERROR,
> > > > + (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > > +  errmsg("oversized proxy packet")));
> > > > + return STATUS_ERROR;
> > > > + }
> > >
> > > I think this is not quite right -- if there's additional data beyond
> > > the IPv6 header size, that just means there are TLVs tacked onto the
> > > header that we should ignore. (Or, eventually, use.)
> >
> > Yeah, you're right. Fallout of too much moving around. I think inthe
> > end that code should just be removed, in favor of the discard path as
> > you mentinoed below.
> >
> >
> > > Additionally, we need to check for underflow as well. A misbehaving
> > > proxy might not send enough data to fill up the address block for the
> > > address family in use.
> >
> > I used to have that check. I seem to have lost it in restructuring. Added 
> > back!
> >
> >
> > > > + /* If there is any more header data present, skip past it */
> > > > + if (proxyaddrlen > sizeof(proxyaddr))
> > > > + pq_discardbytes(proxyaddrlen - sizeof(proxyaddr));
> > >
> > > This looks like dead code, given that we'll error out for the same
> > > check above -- but once it's no longer dead code, the return value of
> > > pq_discardbytes should be checked for EOF.
> >
> > Yup.
> >
> >
> > > > + else if (proxyheader.fam == 0x11)
> > > > + {
> > > > + /* TCPv4 */
> > > > + port->raddr.addr.ss_family = AF_INET;
> > > > + port->raddr.salen = sizeof(struct sockaddr_in);
> > > > + ((struct sockaddr_in *) 
> > > > >raddr.addr)->sin_addr.s_addr = proxyaddr.ip4.src_addr;
> > > > + ((struct sockaddr_in *) >raddr.addr)->sin_port = 
> > > > proxyaddr.ip4.src_port;
> > > > + }
> > >
> > > I'm trying to reason through the fallout of setting raddr and not
> > > laddr. I understand why we're not setting laddr -- several places in
> > > the code rely on the laddr to actually refer to a machine-local address
> > > -- but the fact that there is no actual connection from raddr to laddr
> > > could cause shenanigans. For example, the ident auth protocol will just
> > > break (and it might be nice to explicitly disable it for PROXY
> > > connections). Are there any other situations where a "faked" raddr
> > > could throw off Postgres internals?
> >
> > That's a good point to discuss. I thought about it initially and
> > figured it'd be even worse to actually copy over laddr since that
> > woudl then suddenly have the IP address belonging to a different
> > machine.. And then I forgot to enumerate the other cases.
> >
> > For ident, disabling the method seems reasonable.
> >
> > Another thing that shows up with added support for running the proxy
> > protocol over Unix sockets, is that PostgreSQL refuses to do SSL over
> > Unix sockets. So that check has to be updated to allow it over proxy
> > connections. Same for GSSAPI.
> >
> > An interesting thing is what to do about
> > inet_server_addr/inet_server_port. That sort of loops back up to the
> > original question of where/how to expose the information about the
> > proxy in general (since right now it just logs). Right now you can
> > actually use inet_server_port() to see if the connection was proxied
> > (as long as it was over tcp).
> >
> > Attached is an updated, which covers your comments, as well as adds
> > unix socket support (per your question and Alvaros confirmed usecase).
> > It allows proxy connections over unix sockets, but I saw no need to
> > get into unix sockets over the proxy protocol (dealing with paths
> > between machines etc).
> >
> > I 

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

2021-03-09 Thread Joel Jacobson
On Tue, Mar 9, 2021, at 10:18, Pavel Stehule wrote:
> What do you mean?
>> More than one unnest() in the same query, e.g. SELECT unnest(..), unnest(..)?
> 
> you can do unnest(array1, array2, ...)

Right, I had forgotten about that variant.

But isn't this a bit surprising then:

\df unnest
List of functions
   Schema   |  Name  | Result data type |   
Argument data types| Type
++--+--+--
pg_catalog | unnest | SETOF anyelement | anyarray   
  | func
pg_catalog | unnest | SETOF record | tsvector tsvector, OUT lexeme text, 
OUT positions smallint[], OUT weights text[] | func
(2 rows)

Should there be an entry there showing the VARIADIC anyelement version as well?

I know it's a documented feature, but \df seems out-of-sync with the docs.

/Joel


Re: Make stream_prepare an optional callback

2021-03-09 Thread Markus Wanner

On 09.03.21 10:37, Amit Kapila wrote:

AFAICS, the error is removed by the patch as per below change:


Ah, well, that does not seem right, then.  We cannot just silently 
ignore the callback but not skip the prepare, IMO.  That would lead to 
the output plugin missing the PREPARE, but still seeing a COMMIT 
PREPARED for the transaction, potentially missing changes that went out 
with the prepare, no?



oh, right, in that case, it will skip the stream_prepare even though
that is required. I guess in FilterPrepare, we should check if
rbtxn_is_streamed and stream_prepare_cb is not provided, then we
return true.


Except that FilterPrepare doesn't (yet) have access to a 
ReorderBufferTXN struct (see the other thread I just started).


Maybe we need to do a ReorderBufferTXNByXid lookup already prior to (or 
as part of) FilterPrepare, then also skip (rather than silently ignore) 
the prepare if no stream_prepare_cb callback is given (without even 
calling filter_prepare_cb, because the output plugin has already stated 
it cannot handle that by not providing the corresponding callback).


However, I also wonder what's the use case for an output plugin enabling 
streaming and two-phase commit, but not providing a stream_prepare_cb. 
Maybe the original ERROR is the simpler approach?  I.e. making the 
stream_prepare_cb mandatory, if and only if both are enabled (and 
filter_prepare doesn't skip).  (As in the original comment that says: 
"in streaming mode with two-phase commits, stream_prepare_cb is required").


I guess I don't quite understand the initial motivation for the patch. 
It states: "This allows plugins to not allow the enabling of streaming 
and two_phase at the same time in logical replication."   That's beyond 
me ... "allows [..] to not allow"?  Why not, an output plugin can still 
reasonably request both.  And that's a good thing, IMO.  What problem 
does the patch try to solve?


Regards

Markus




  1   2   >