Re: Release SPI plans for referential integrity with DISCARD ALL

2021-01-18 Thread yuzuko
Hi Corey,

Thank you for sharing.

> Amit's patch is now available in this thread [1]. I'm curious if it has any 
> effect on your memory pressure issue.
>
I just found that thread.  I'll check the patch.

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center




Re: simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 12:00 PM Zhihong Yu  wrote:
> +   if (mapped_partkey_attnums[i] == pk_attnums[j])
> +   {
> +   partkey_vals[i] = pk_vals[j];
> +   partkey_isnull[i] = pk_nulls[j] == 'n' ? true : false;
> +   i++;
> +   break;
>
> The way counter (i) is incremented is out of my expectation.
> In the rare case, where some i doesn't have corresponding pk_attnums[j], 
> wouldn't there be a dead loop ?
>
> I think the goal of adding the assertion should be not loop infinitely even 
> if the invariant is not satisfied.
>
> I guess a counter other than i would be better for this purpose.

I have done that in v3.  Thanks.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 3:46 PM Corey Huinker  wrote:
> v2 patch applies and passes make check and make check-world. Perhaps, given 
> the missing break at line 418 without any tests failing, we could add another 
> regression test if we're into 100% code path coverage. As it is, I think the 
> compiler warning was a sufficient alert.

Thanks for the review.  I will look into checking the coverage.

> The code is easy to read, and the comments touch on the major points of what 
> complexities arise from partitioned tables.
>
> A somewhat pedantic complaint I have brought up off-list is that this patch 
> continues the pattern of the variable and function names making the 
> assumption that the foreign key is referencing the primary key of the 
> referenced table. Foreign key constraints need only reference a unique index, 
> it doesn't have to be the primary key. Granted, that unique index is behaving 
> exactly as a primary key would, so conceptually it is very similar, but 
> keeping with the existing naming (pk_rel, pk_type, etc) can lead a developer 
> to think that it would be just as correct to find the referenced relation and 
> get the primary key index from there, which would not always be correct. This 
> patch correctly grabs the index from the constraint itself, so no problem 
> there.

I decided not to deviate from pk_ terminology so that the new code
doesn't look too different from the other code in the file.  Although,
I guess we can at least call the main function
ri_ReferencedKeyExists() instead of ri_PrimaryKeyExists(), so I've
changed that.

> I like that this patch changes the absolute minimum of the code in order to 
> get a very significant performance benefit. It does so in a way that should 
> reduce resource pressure found in other places [1]. This will in turn reduce 
> the performance penalty of "doing the right thing" in terms of defining 
> enforced foreign keys. It seems to get a clearer performance boost than was 
> achieved with previous efforts at statement level triggers.
>
> [1] 
> https://www.postgresql.org/message-id/cakkq508z6r5e3jdqhfpwszsajlpho3oyyoamfesaupto5vg...@mail.gmail.com

Thanks.  I hadn't noticed [1] before today, but after looking it over,
it seems that what is being proposed there can still be of use.  As
long as SPI is used in ri_trigger.c, it makes sense to consider any
tweaks addressing its negative impact, especially if they are not very
invasive.  There's this patch too from the last month:
https://commitfest.postgresql.org/32/2930/

> This patch completely sidesteps the DELETE case, which has more insidious 
> performance implications, but is also far less common, and whose solution 
> will likely be very different.

Yeah, we should continue looking into the ways to make referenced-side
RI checks be less bloated.

I've attached the updated patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v3-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


v3-0001-Export-get_partition_for_tuple.patch
Description: Binary data


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier

Hi,

On 18.01.21 23:42, Tom Lane wrote:

David Geier  writes:

On 18.01.21 19:46, Tom Lane wrote:

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

That makes sense. Thanks for the explanation.

OK, cool.  I was afraid you'd argue that you really needed your CustomScan
node to be transparent in such cases.  We could imagine inventing an
additional custom-scan-provider callback to embed the necessary knowledge,
but I'd rather not add the complexity until someone has a use-case.


I was thinking about that. Generally, having such possibility would be 
very useful. Unfortunately, I believe that in my specific case even 
having such functionality wouldn't allow for the query to work with 
CURRENT OF, because my CSP behaves similarly to a materialize node.


My understanding is only such plans are supported which consist of nodes 
that guarantee that the tuple returned by the plan is the unmodified 
tuple a scan leaf node is currently positioned on?


Still, if there's interest I would be happy to draft a patch. Instead of 
a separate CSP callback, we could also provide an additional flag like 
CUSTOMPATH_SUPPORT_CURRENT_OF. The advantage of the callback would be 
that we could delay the decision until execution time where potentially 
more information is available.

I updated the patch to match your proposal.

WFM, will push in a bit.

regards, tom lane

Best regards,
David
Swarm64




RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> Tsunakawa-San, does this address your concerns around locking the
> target relation in the required cases? It would be good to test but I
> don't see any problems in the scenarios you mentioned.

Thank you, understood.  RevalidateCachedQuery() does parse analysis, that's the 
trick.


Regards
Takayuki Tsunakawa



Re: Release SPI plans for referential integrity with DISCARD ALL

2021-01-18 Thread Corey Huinker
On Wed, Jan 13, 2021 at 1:03 PM Corey Huinker 
wrote:

> In addition to that, a following case would be solved with this approach:
>> When many processes are referencing many tables defined foreign key
>> constraints thoroughly, a huge amount of memory will be consumed
>> regardless of whether referenced tables are partitioned or not.
>>
>> Attached the patch.  Any thoughts?
>>
>
> Amit Langote has done some great work at eliminating SPI from
> INSERT/UPDATE triggers entirely, thus reducing the number of cached plans
> considerably.
>
> I think he was hoping to have a patch formalized this week, if time
> allowed.
>
> It doesn't have DELETE triggers in it, so this patch might still have good
> value for deletes on a commonly used enumeration table.
>
> However, our efforts might be better focused on eliminating SPI from
> delete triggers as well, an admittedly harder task.
>

Amit's patch is now available in this thread [1]. I'm curious if it has any
effect on your memory pressure issue.

[1]
https://www.postgresql.org/message-id/ca+hiwqgkfjfydeq5vhph6eqpkjsbfpddy+j-kxyfepqedts...@mail.gmail.com


Re: ResourceOwner refactoring

2021-01-18 Thread Michael Paquier
On Mon, Jan 18, 2021 at 02:22:33PM +0200, Heikki Linnakangas wrote:
> diff --git a/src/common/cryptohash_openssl.c b/src/common/cryptohash_openssl.c
> index 551ec392b60..642e72d8c0f 100644
> --- a/src/common/cryptohash_openssl.c
> +++ b/src/common/cryptohash_openssl.c
[...]
> +/* ResourceOwner callbacks to hold JitContexts  */

Slight copy-paste error here.

>   /*
>* Ensure, while the spinlock's not yet held, that there's a free
> -  * refcount entry.
> +  * refcount entry and that the resoure owner has room to remember the
> +  * pin.
s/resoure/resource/.
--
Michael


signature.asc
Description: PGP signature


Re: simplifying foreign key/RI checks

2021-01-18 Thread Corey Huinker
On Mon, Jan 18, 2021 at 9:45 PM Amit Langote 
wrote:

> On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu  wrote:
> >
> > Hi,
> > I was looking at this statement:
> >
> > insert into f select generate_series(1, 200, 2);
> >
> > Since certain generated values (the second half) are not in table p,
> wouldn't insertion for those values fail ?
> > I tried a scaled down version (1000th) of your example:
> >
> > yugabyte=# insert into f select generate_series(1, 2000, 2);
> > ERROR:  insert or update on table "f" violates foreign key constraint
> "f_a_fkey"
> > DETAIL:  Key (a)=(1001) is not present in table "p".
>
> Sorry, a wrong copy-paste by me.  Try this:
>
> create table p (a numeric primary key);
> insert into p select generate_series(1, 200);
> create table f (a bigint references p);
>
> -- Unpatched
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 6527.652 ms (00:06.528)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 8108.310 ms (00:08.108)
>
> -- Patched:
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 3312.193 ms (00:03.312)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 4292.807 ms (00:04.293)
>
> > For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch :
> >
> > +* Collect partition key values from the unique key.
> >
> > At the end of the nested loop, should there be an assertion that
> partkey->partnatts partition key values have been found ?
> > This can be done by using a counter (initialized to 0) which is
> incremented when a match is found by the inner loop.
>
> I've updated the patch to add the Assert.  Thanks for taking a look.
>
> --
> Amit Langote
> EDB: http://www.enterprisedb.com


v2 patch applies and passes make check and make check-world. Perhaps, given
the missing break at line 418 without any tests failing, we could add
another regression test if we're into 100% code path coverage. As it is, I
think the compiler warning was a sufficient alert.

The code is easy to read, and the comments touch on the major points of
what complexities arise from partitioned tables.

A somewhat pedantic complaint I have brought up off-list is that this patch
continues the pattern of the variable and function names making the
assumption that the foreign key is referencing the primary key of the
referenced table. Foreign key constraints need only reference a unique
index, it doesn't have to be the primary key. Granted, that unique index is
behaving exactly as a primary key would, so conceptually it is very
similar, but keeping with the existing naming (pk_rel, pk_type, etc) can
lead a developer to think that it would be just as correct to find the
referenced relation and get the primary key index from there, which would
not always be correct. This patch correctly grabs the index from the
constraint itself, so no problem there.

I like that this patch changes the absolute minimum of the code in order to
get a very significant performance benefit. It does so in a way that should
reduce resource pressure found in other places [1]. This will in turn
reduce the performance penalty of "doing the right thing" in terms of
defining enforced foreign keys. It seems to get a clearer performance boost
than was achieved with previous efforts at statement level triggers.

This patch completely sidesteps the DELETE case, which has more insidious
performance implications, but is also far less common, and whose solution
will likely be very different.

[1]
https://www.postgresql.org/message-id/cakkq508z6r5e3jdqhfpwszsajlpho3oyyoamfesaupto5vg...@mail.gmail.com


Re: [PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer
On Tue, 19 Jan 2021 at 12:44, Craig Ringer 
wrote:

>
> > We're about halfway there already, see 7e784d1dc.  I didn't do the
>> > other half because it wasn't necessary to the problem, but exposing
>> > the shutdown state more fully seems reasonable.
>>
>
> Excellent, I'll take a look. Thanks.
>

That looks very handy already.

Extending it to be set before SIGTERM too would be handy.

My suggestion, which I'm happy to post in patch form if you think it's
reasonable:

* Change QuitSignalReason to ExitSignalReason to cover both SIGTERM (fast)
and SIGQUIT (immediate)

* Rename PMQUIT_FOR_STOP to PMEXIT_IMMEDIATE_SHUTDOWN

* Add enumeration values PMEXIT_SMART_SHUTDOWN  and PMEXIT_FAST_SHUTDOWN

* For a fast shutdown, in pmdie()'s SIGINT case call
SetExitSignalReason(PMEXIT_FAST_SHUTDOWN), so that when
PostmasterStateMachine() calls SignalSomeChildren(SIGTERM, ...) in response
to PM_STOP_BACKENDS, the reason is already available.

* For smart shutdown, in pmdie()'s SIGTERM case call
SetExitSignalReason(PMEXIT_SMART_SHUTDOWN) and set the latch of every live
backend. There isn't any appropriate PROCSIG so unless we want to overload
PROCSIG_WALSND_INIT_STOPPING (ick), but I think it'd generally be
sufficient to check GetExitSignalReason() in backend main loops.

The fast shutdown case seems like a no-brainer extension of your existing
patch.

I'm not entirely sure about the smart shutdown case. I don't want to add a
ProcSignal slot just for this and the info isn't urgent anyway. I think
that by checking for postmaster shutdown in the backend main loop we'd be
able to support eager termination of idle backends on smart shutdown
(immediately, or after an idle grace period), which is something I've
wanted for quite some time. It shouldn't be significantly expensive
especially in the idle loop.

Thoughts?

(Also I want a hook in PostgresMain's idle loop for things like this).


Re: Is Recovery actually paused?

2021-01-18 Thread Dilip Kumar
On Tue, Jan 19, 2021 at 10:30 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 19 Jan 2021 11:41:18 +0900, Yugo NAGATA  wrote in
> > On Sun, 17 Jan 2021 11:33:52 +0530
> > Dilip Kumar  wrote:
> > > > > > >
> > > > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > > > purpose;
> > > > > > > this waits recovery to actually get paused. If we want to limit 
> > > > > > > this API's
> > > > > > > purpose only to return the pause state, it seems better to fix 
> > > > > > > this to return
> > > > > > > the actual state at the cost of lacking the backward 
> > > > > > > compatibility. If we want
> > > > > > > to know whether pause is requested, we may add a new API like
> > > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > > > recovery to actually
> > > > > > > get paused, we may add an option to pg_wal_replay_pause() for 
> > > > > > > this purpose.
> > > > > > >
> > > > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I 
> > > > > > > don't care either.
> > > > > >
> > > > > > I don't think that it will be blocked ever, because
> > > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > > > recovery process will not be stuck on waiting for the WAL.
> > > >
> > > > Yes, there is no stuck on waiting for the WAL. However, it can be stuck 
> > > > during resolving
> > > > a recovery conflict. The process could wait for 
> > > > max_standby_streaming_delay or
> > > > max_standby_archive_delay at most before recovery get completely paused.
> > >
> > > Okay, I agree that it is possible so for handling this we have a
> > > couple of options
> > > 1. pg_is_wal_replay_paused(), interface will wait for recovery to
> > > actually get paused, but user have an option to cancel that.  So I
> > > agree that there is currently no option to just know that recovery
> > > pause is requested without waiting for its actually get paused if it
> > > is requested.  So one option is we can provide an another interface as
> > > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just
> > > return the request status.  I am not sure how useful it is.
> >
> > If it is acceptable that pg_is_wal_replay_paused() makes users wait,
> > I'm ok for the current interface. I don't feel the need of
> > pg_is_wal_replay_paluse_requeseted().
>
> FWIW, the name "pg_is_wal_replay_paused" is suggesting "to know
> whether recovery is paused or not at present" and it would be
> surprising to see it to wait for the recovery actually paused by
> default.
>
> I think there's no functions to wait for some situation at least for
> now.  If we wanted to wait for some condition to make, we would loop
> over check-and-wait using plpgsql.
>
> If you desire to wait to replication to pause by a function, I would
> do that by adding a parameter to the function.
>
> pg_is_wal_replay_paused(OPTIONAL bool wait_for_pause)

This seems to be a fair point to me.  So I will add an option to the
API,  and if that is passed true then we will wait for recovery to get
paused.
otherwise, this will just return true if the pause is requested same
as the current behavior.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Additional Chapter for Tutorial - arch-dev.sgml

2021-01-18 Thread Jürgen Purtz

On 18.01.21 15:13, Heikki Linnakangas wrote:

On 20/11/2020 23:52, Erik Rijkers wrote:

(smallish) Changes to arch-dev.sgml


This looks good to me. One little complaint:


@@ -125,7 +122,7 @@
 use a supervisor process (also
 master process) that spawns a new
 server process every time a connection is requested. This 
supervisor

-    process is called postgres and listens at a
+    process is called postgres (formerly 
'postmaster') and listens at a

 specified TCP/IP port for incoming connections. Whenever a request
 for a connection is detected the postgres
 process spawns a new server process. The server tasks


I believe we still call it the postmaster process. We renamed the 
binary a long time ago (commit 5266f221a2), and the above text was 
changed as part of that commit. I think that was a mistake, and this 
should say simply:


... This supervisor process is called postmaster 
and ...


like it did before we renamed the binary.

Barring objections, I'll commit this with that change (as attached).

- Heikki


I fear that the patch 'Additional chapter for Tutorial' grows beyond 
manageable limits. It runs since nearly one year, the size of 228 KB is 
very huge, many people havemade significant contributions. But a commit 
seems to be in far distance. Having said that, I'm pleased with Heikki's 
proposal to split changes in the existing file 'arch-dev.sgml' from the 
rest of the patch and commit them separately.


But I have some concerns with the chapter '51.2. How Connections Are 
Established'. It uses central terms like 'client process', 'server 
process','supervisor process', 'master process', 'server tasks', 
'backend (server)', 'frontend (client)', 'server', 'client'. Some month 
ago, we have cleared his terminology in the new chapter 'glossary'. As 
long as it leads to readable text, we shall use the glossary-terms 
instead of the current ones. And we shall include some links to the 
glossary.


I propose to start a new thread which contains only changes to 
'arch-dev.sgml'. In pgsql-hackers or in pgsql-docs list? Initialized by 
Heikki or by me?


--

Jürgen Purtz




RE: Global snapshots

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
Hello, Andrey-san, all,


Based on the request at HighGo's sharding meeting, I'm re-sending the 
information on Commitment Ordering that could be used for global visibility.  
Their patents have already expired.



--
Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  
Although I haven't understood this yet, it insists that no concurrency control 
information including timestamps needs to be exchanged among the cluster nodes. 
 I'd appreciate it if someone could give an opinion.



Commitment Ordering Based Distributed Concurrency Control for Bridging Single 
and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on 
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, 
Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
https://ieeexplore.ieee.org/document/281924?arnumber=281924



The author of the above paper, Yoav Raz, seems to have had strong passion at 
least until 2011 about making people believe the mightiness of Commitment 
Ordering (CO) for global serializability.  However, he complains (sadly) that 
almost all researchers ignore his theory, as written in his following  site and 
wikipedia page for Commitment Ordering.  Does anyone know why CO is ignored?


--
* Or, maybe we can use the following Commitment ordering that doesn't require 
the timestamp or any other information to be transferred among the cluster 
nodes.  However, this seems to have to track the order of read and write 
operations among concurrent transactions to ensure the correct commit order, so 
I'm not sure about the performance.  The MVCO paper seems to present the 
information we need, but I haven't understood it well yet (it's difficult.)  
Could you anybody kindly interpret this?



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



--
Could you please try interpreting MVCO and see if we have any hope in this?  
This doesn't fit in my small brain.  I'll catch up with understanding this when 
I have time.



MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf



MVCO is a multiversion member of Commitment Ordering algorithms described below:



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering



Related patents are as follows.  The last one is MVCO.



US5504900A - Commitment ordering for guaranteeing serializability across 
distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900



US5504899A - Guaranteeing global serializability by applying commitment 
ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899



US5701480A - Distributed multi-version commitment ordering protocols for 
guaranteeing serializability during transaction processing
https://patents.google.com/patent/US5701480A/en?oq=US5701480


Regards
Takayuki Tsunakawa



Re: simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 2:56 PM Corey Huinker  wrote:
>> In file included from 
>> /home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0,
>>  from 
>> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24:
>> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c: 
>> In function ‘ri_PrimaryKeyExists’:
>> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5: 
>> warning: this statement may fall through [-Wimplicit-fallthrough=]
>>   do { \
>>  ^
>> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2: note: 
>> in expansion of macro ‘ereport_domain’
>>   ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
>>   ^~
>> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2: note: 
>> in expansion of macro ‘ereport’
>>   ereport(elevel, errmsg_internal(__VA_ARGS__))
>>   ^~~
>> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5:
>>  note: in expansion of macro ‘elog’
>>  elog(ERROR, "unexpected table_tuple_lock status: %u", res);
>>  ^~~~
>> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4:
>>  note: here
>> default:
>> ^~~

Thanks, will fix it.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: POC: postgres_fdw insert batching

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Amit Langote 
> > I apologize in advance for being maybe overly pedantic, but I noticed
> > that, in ExecInitModifyTable(), you decided to place the call outside
> > the loop that goes over resultRelations (shown below), although my
> > intent was to ask to place it next to the BeginForeignModify() in that
> > loop.
>
> Actually, I tried to do it (adding the GetModifyBatchSize() call after 
> BeginForeignModify()), but it failed.  Because 
> postgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and 
> ResultRelInfo->projectReturning is created after the above part.  Considering 
> the context where GetModifyBatchSize() implementations may want to know the 
> environment, I placed the call as late as possible in the initialization 
> phase.  As for the future(?) multi-target DML statements, I think we can 
> change this together with other many(?) parts that assume a single target 
> table.

Okay, sometime later then.

I wasn't sure if bringing it up here would be appropriate, but there's
a patch by me to refactor ModfiyTable result relation allocation that
will have to remember to move this code along to an appropriate place
[1].  Thanks for the tip about the dependency on how RETURNING is
handled.  I will remember it when rebasing my patch over this.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/31/2621/




How to expose session vs txn lock info in pg_locks view?

2021-01-18 Thread Craig Ringer
Presently there doesn't seem to be a way to tell whether a lock is
session-level or transaction-level in the pg_locks view.

I was expecting this to be a quick patch, but the comment on the definition
of PROCLOCKTAG in lock.h notes that shmem state for heavyweight locks does
not track whether the lock is session-level or txn-level. That explains why
it's not already exposed in pg_locks.

AFAICS it'd be necessary to expand PROCLOG to expose this in shmem.
Probably by adding a small bitfield where bit 0 is set if there's a txn
level lock and bit 1 is set if there's a session level lock. But I'm not
convinced that expanding PROCLOCK is justifiable for this. sizeof(PROCLOCK)
is 64 on a typical x64 machine. Adding anything to it increases it to 72
bytes.

(gdb) ptype /o struct PROCLOCK
/* offset|  size */  type = struct PROCLOCK {
/*0  |16 */PROCLOCKTAG tag;
/*   16  | 8 */PGPROC *groupLeader;
/*   24  | 4 */LOCKMASK holdMask;
/*   28  | 4 */LOCKMASK releaseMask;
/*   32  |16 */SHM_QUEUE lockLink;
/*   48  |16 */SHM_QUEUE procLink;
/*   64  | 1 */unsigned char locktypes;
/* XXX  7-byte padding  */

   /* total size (bytes):   72 */
 }

Going over 64 sets off possible alarm bells about cache line sizing to me,
but maybe it's not that critical? It'd also require (8 * max_locks_per_xact
* (MaxBackends+max_prepared_xacts)) extra shmem space; that could land up
being 128k on a default setup and a couple of megabytes on a big system.
Not huge, but not insignificant if it's hot data.

It's frustrating to be unable to tell the difference between session-level
and txn-level locks in diagnostic output. And the deadlock detector has no
way to tell the difference when selecting a victim for a deadlock abort -
it'd probably make sense to prefer to send a deadlock abort for txn-only
lockers. But I'm not sure I see a sensible way to add the info - PROCLOCK
is already free of any padding, and I wouldn't want to use hacks like
pointer-tagging.

Thoughts anyone?


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Amit Kapila
On Tue, Jan 19, 2021 at 10:32 AM Greg Nancarrow  wrote:
>
> On Tue, Jan 19, 2021 at 1:37 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > > The table has ALREADY been locked (by the caller) during the
> > > parse/analyze phase.
> >
> > Isn't there any case where planning is done but parse analysis is not done 
> > immediately before?  e.g.
> >
> > * Alteration of objects invalidates cached query plans, and the next 
> > execution of the plan rebuilds it.  (But it seems that parse analysis is 
> > done in this case in plancache.c.)
> >
> > * Execute a prepared statement with a different parameter value, which 
> > builds a new custom plan or a generic plan.
> >
>
> I don't know, but since NoLock is used in other parts of the planner,
> I'd expect those to fail if such cases existed.
>

I think I know how for both the above cases, we ensure that the locks
are acquired before we reach the planner. It seems we will call
GetCachedPlan during these scenarios which will acquire the required
locks in RevalidateCachedQuery both when the cached plan is invalid
and when it is valid. So, we should be fine even when the
custom/generic plan needs to be formed due to a different parameter.

> > Is the cached query plan invalidated when some alteration is done to change 
> > the parallel safety, such as adding a trigger with a parallel-unsafe 
> > trigger action?
> >
>
> Needs to be tested,
>

Yeah, it would be good to test it but I think even if the plan is
invalidated, we will reacquire the required locks as mentioned above.

Tsunakawa-San, does this address your concerns around locking the
target relation in the required cases? It would be good to test but I
don't see any problems in the scenarios you mentioned.

-- 
With Regards,
Amit Kapila.




Re: [PATCH 1/1] Initial mach based shared memory support.

2021-01-18 Thread Tom Lane
James Hilliard  writes:
> from my understanding due to mach semantics a number of the sanity checks
> we are doing for sysv shmem are probably unnecessary when using mach
> API's due to the mach port task bindings(mach_task_self()) effectively
> ensuring our maps are already task bound and won't conflict with other tasks.

Really?  If so, this whole patch is pretty much dead in the water,
because the fact that sysv shmem is globally accessible is exactly
why we use it (well, that and the fact that you can find out how many
processes are attached to it).  It's been a long time since we cared
about sysv shmem as a source of shared storage.  What we really use
it for is as a form of mutual exclusion, i.e. being able to detect
whether any live children remain from a dead postmaster.  That is,
PGSharedMemoryIsInUse is not some minor ancillary check, it's the main
reason this module exists at all.  If POSIX-style mmap'd shmem had the
same capability we'd likely have dropped sysv altogether long ago.

I've occasionally wondered if we should take another look at file locking
as a mechanism for ensuring only one postmaster+children process group
can access a data directory.  Back in the day it was too untrustworthy,
but maybe that has changed.

regards, tom lane




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Fujii Masao




On 2021/01/19 9:53, Hou, Zhijie wrote:

+1 to add it after "dropped (Note )", how about as follows
with slight changes?

dropped (Note that server name of an invalid connection can be NULL
if the server is dropped), and then such .


Yes, I like this one. One question is; "should" or "is" is better
than "can" in this case because the server name of invalid connection
is always NULL when its server is dropped?


I think "dropped (Note that server name of an invalid connection will
be NULL if the server is dropped), and then such ."


Sounds good to me. So patch attached.


+1


Thanks! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-18 Thread Michael Paquier
On Mon, Jan 18, 2021 at 02:36:48PM -0500, Robert Haas wrote:
> Here's a patch to remove CheckpointLock completely. For
> ProcessInterrupts() to do anything, one of the following things would
> have to be true:
>
> [...]
> 
> So I don't see any problem with just ripping this out entirely, but
> I'd like to know if anybody else does.

Agreed, +1.
--
Michael


signature.asc
Description: PGP signature


Re: simplifying foreign key/RI checks

2021-01-18 Thread Corey Huinker
>
>
> In file included from
> /home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0,
>  from
> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24:
> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:
> In function ‘ri_PrimaryKeyExists’:
> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5:
> warning: this statement may fall through [-Wimplicit-fallthrough=]
>   do { \
>  ^
> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2:
> note: in expansion of macro ‘ereport_domain’
>   ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
>   ^~
> /home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2:
> note: in expansion of macro ‘ereport’
>   ereport(elevel, errmsg_internal(__VA_ARGS__))
>   ^~~
> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5:
> note: in expansion of macro ‘elog’
>  elog(ERROR, "unexpected table_tuple_lock status: %u", res);
>  ^~~~
> /home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4:
> note: here
> default:
> ^~~
>
> --
> Regrads,
> Japin Li.
> ChengDu WenWu Information Technology Co.,Ltd.
>

I also get this warning. Adding a "break;" at line 418 resolves the warning.


Re: [patch] Help information for pg_dump

2021-01-18 Thread Bharath Rupireddy
On Tue, Jan 19, 2021 at 9:07 AM Zhang, Jie  wrote:
>
> Hi all
>
> After executing command [pg_dump -?], some help information is as follows.
>
> pg_dump -?
> -
>   -N, --exclude-schema=PATTERN do NOT dump the specified schema(s)  ※
>   -T, --exclude-table=PATTERN  do NOT dump the specified table(s)   ※
>   -x, --no-privileges  do not dump privileges (grant/revoke)
>   --exclude-table-data=PATTERN do NOT dump data for the specified table(s)  ※
>   --no-commentsdo not dump comments
>   --no-publicationsdo not dump publications
>   --no-security-labels do not dump security label assignments
>   --no-subscriptions   do not dump subscriptions
>   --no-synchronized-snapshots  do not use synchronized snapshots in parallel 
> jobs
>   --no-tablespaces do not dump tablespace assignments
>   --no-unlogged-table-data do not dump unlogged table data
> 
>
> I think it would be better to change [do NOT dump] to [do not dump].
>
> Here is a patch.

+1. Looks like SQL keywords are mentioned in capital letters in both
pg_dump and pg_dumpall cases, so changing "do NOT" to "do not" seems
okay to me.

Patch LGTM.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH 1/1] Initial mach based shared memory support.

2021-01-18 Thread James Hilliard
On Mon, Jan 18, 2021 at 5:29 PM Tom Lane  wrote:
>
> James Hilliard  writes:
> > OSX implements sysv shmem support via a mach wrapper, however the mach
> > sysv shmem wrapper has some severe restrictions that prevent us from
> > allocating enough memory segments in some cases.
> > ...
> > In order to avoid hitting these limits we can bypass the wrapper layer
> > and just use mach directly.
>
> I wanted to review this, but it's impossible because the kernel calls
> you're using have you've-got-to-be-kidding documentation like this:
>
> https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc
FYI there's a good chance I'm not using some of these info API's correctly,
from my understanding due to mach semantics a number of the sanity checks
we are doing for sysv shmem are probably unnecessary when using mach
API's due to the mach port task bindings(mach_task_self()) effectively
ensuring our maps are already task bound and won't conflict with other tasks.

The mach vm API for our use case does appear to be superior to the shmem API
on top of not being bound by the sysv wrapper limits due to it being
natively task
bound as opposed to normal sysv shared memory maps which appears to be
system wide global maps and thus requires lots of key conflict checks which
should be entirely unnecessary when using task bound mach vm maps.

For example PGSharedMemoryIsInUse() from my understanding should always
return false for the mach vm version as it is already task scoped.

I'm not 100% sure here as I'm not very familiar with all the ways postgres uses
shared memory however. Is it true that postgresql shared memory is strictly used
by child processes fork()'d from the parent postmaster? If that is the
case I think
as long as we ensure all our map operations are bound to the parent postmaster
mach_task_self() port then we should never conflict with another postmaster's
process tree's memory maps.

I tried to largely copy existing sysv shmem semantics in this first
proof of concept as
I'm largely unfamiliar with postgresql internals but I suspect we can
safely remove
much of the map key conflict checking code entirely.

Here's a few additional references I found which may be helpful.

This one is more oriented for kernel programming, however much of the info is
relevant to the userspace interface as well it would appear:
https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/vm/vm.html#//apple_ref/doc/uid/TP3905-CH210-TPXREF109

This book gives an overview of the mach userspace interface as well, although
I don't think it's a fully complete API reference, but it covers some
of the basic use
cases:
https://flylib.com/books/en/3.126.1.89/1/

Apparently the hurd kernel uses a virtual memory interface that is based on
the mach kernel, and it has more detailed API docs, note that these tend to
not have the "mach_" prefix but appear to largely function the same:
https://www.gnu.org/software/hurd/gnumach-doc/Virtual-Memory-Interface.html

There is additional documentation from the OSFMK kernel(which was combined with
XNU by apple from my understanding) here related to the virtual memory
interface,
like the herd docs these tend to not have the "mach_" prefix:
http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/
>
> Google finds the source code too, but that's equally bereft of useful
> documentation.  So I wonder where you obtained the information necessary
> to write this patch.
>
> I did notice however that mach_shmem.c seems to be 95% copy-and-paste from
> sysv_shmem.c, including a lot of comments that don't feel particularly
> true or relevant here.  I wonder whether we need a separate file as
> opposed to a few #ifdef's around the kernel calls.
>
> regards, tom lane




Re: Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Michael Paquier
On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:
> Right, then we don't need any strange theories about autovacuum,
> just bad timing luck.  whelk does seem pretty slow, so it's not
> much of a stretch to imagine that it's more susceptible to this
> corner case than faster machines.
>
> So, do we have any other tests that are invoking a manual vacuum
> and assuming it won't skip any pages?  By this theory, they'd
> all be failures waiting to happen.

That looks possible by looking at the code around lazy_scan_heap(),
but that's narrow.

check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
seems to me that we had better be sure that no pages are skipped for
their cases?

The duplicated query result looks to be an oversight from 58b4cb3 when
the thing got rewritten, so it can just go away.  Good catch.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Amit Kapila
On Tue, Jan 19, 2021 at 9:19 AM Greg Nancarrow  wrote:
>
> On Tue, Jan 19, 2021 at 2:03 PM Amit Kapila  wrote:
> >
> > > >
> > > > You have not raised a WARNING for the second case.
> > >
> > > The same checks in current Postgres code also don't raise a WARNING
> > > for that case, so I'm just being consistent with existing Postgres
> > > code (which itself isn't consistent for those two cases).
> > >
> >
> > Search for the string "too few entries in indexprs list" and you will
> > find a lot of places in code raising ERROR for the same condition.
> >
>
> Yes, but raising an ERROR stops processing (not just logs an error
> message). Raising a WARNING logs a warning message and continues
> processing. It's a big difference.
> So, for the added parallel-safety-checking code, it was suggested by
> Vignesh (and agreed by me) that, for these rare and highly unlikely
> conditions, it would be best not to just copy the error-handling code
> verbatim from other cases in the Postgres code (as I had originally
> done) and just stop processing dead with an error, but to instead
> return PARALLEL_UNSAFE, so that processing continues as it would for
> current non-parallel processing, which would most likely error-out
> anyway along the current error-handling checks and paths when those
> bad attributes/fields are referenced.
> I will add some Asserts() and don't mind adding a WARNING message for
> the 2nd case.
> If you really feel strongly about this, I can just restore the
> original code, which will stop dead with an ERROR in the middle of
> parallel-safety checking should one of these rare conditions ever
> occur.
>

I am expecting that either we raise a WARNING and return
parallel_unsafe for all such checks (shouldn't reach cases) in the
patch or simply raise an ERROR as we do in other parts of the patch. I
personally prefer the latter alternative but I am fine with the former
one as well.

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Amit Kapila
On Mon, Jan 18, 2021 at 3:50 PM Greg Nancarrow  wrote:
>
> On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila  wrote:
> >
> > > 1)
> > >
> > > >Here, it seems we are accessing the relation descriptor without any
> > > >lock on the table which is dangerous considering the table can be
> > > >modified in a parallel session. Is there a reason why you think this
> > > >is safe? Did you check anywhere else such a coding pattern?
> > >
> > > Yes, there's a very good reason and I certainly have checked for the
> > > same coding pattern elsewhere, and not just randomly decided that
> > > locking can be ignored.
> > > The table has ALREADY been locked (by the caller) during the
> > > parse/analyze phase.
> > >
> >
> > Fair enough. I suggest adding a comment saying the same so that the
> > reader doesn't get confused about the same.
> >
>
> OK, I'll add a comment.
>
> > > (This is not the case for a partition, in which case the patch code
> > > uses AccessShareLock, as you will see).
> >
> > Okay, but I see you release the lock on partition rel immediately
> > after checking parallel-safety. What if a user added some
> > parallel-unsafe constraint (via Alter Table) after that check?
> >
> > >
>
> I'm not sure. But there would be a similar concern for current
> Parallel SELECT functionality, right?
>

I don't think so because, for Selects, we take locks on the required
partitions and don't release them immediately. We do parallel safety
checks after acquiring those locks. From the code perspective, we lock
individual partitions via
expand_inherited_rtentry->expand_partitioned_rtentry and then check
parallel-safety at a later point via
set_append_rel_size->set_rel_consider_parallel. Also, I am not sure if
there is anything we check for Selects at each partition relation
level that can be changed by a concurrent session. Do you have a
different understanding?

Similarly, we do retain locks on indexes, see get_relation_info, which
we are not doing in the patch.

> My recollection is that ALTER TABLE obtains an exclusive lock on the
> table which it retains until the end of the transaction, so that will
> result in blocking at certain points, during parallel-checks and
> execution, but there may still be a window.
>

Once the Select has acquired locks in the above code path, I don't
think Alter for a particular partition would be able to proceed unless
those locks are non-conflicting.

-- 
With Regards,
Amit Kapila.




RE: POC: postgres_fdw insert batching

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Amit Langote 
> I apologize in advance for being maybe overly pedantic, but I noticed
> that, in ExecInitModifyTable(), you decided to place the call outside
> the loop that goes over resultRelations (shown below), although my
> intent was to ask to place it next to the BeginForeignModify() in that
> loop.

Actually, I tried to do it (adding the GetModifyBatchSize() call after 
BeginForeignModify()), but it failed.  Because postgresfdwGetModifyBatchSize() 
wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is 
created after the above part.  Considering the context where 
GetModifyBatchSize() implementations may want to know the environment, I placed 
the call as late as possible in the initialization phase.  As for the future(?) 
multi-target DML statements, I think we can change this together with other 
many(?) parts that assume a single target table.


Regards
Takayuki Tsunakawa



Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Greg Nancarrow
On Tue, Jan 19, 2021 at 1:37 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> > The table has ALREADY been locked (by the caller) during the
> > parse/analyze phase.
>
> Isn't there any case where planning is done but parse analysis is not done 
> immediately before?  e.g.
>
> * Alteration of objects invalidates cached query plans, and the next 
> execution of the plan rebuilds it.  (But it seems that parse analysis is done 
> in this case in plancache.c.)
>
> * Execute a prepared statement with a different parameter value, which builds 
> a new custom plan or a generic plan.
>

I don't know, but since NoLock is used in other parts of the planner,
I'd expect those to fail if such cases existed.

> Is the cached query plan invalidated when some alteration is done to change 
> the parallel safety, such as adding a trigger with a parallel-unsafe trigger 
> action?
>

Needs to be tested, but I'd expect the cached plan to get invalidated
in this case - surely the same potential issue exists in Postgres for
the current Parallel SELECT functionality - for example, for a column
with a default value that is an expression (which could be altered
from being parallel-safe to parallel-unsafe).

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Is Recovery actually paused?

2021-01-18 Thread Kyotaro Horiguchi
At Tue, 19 Jan 2021 11:41:18 +0900, Yugo NAGATA  wrote in 
> On Sun, 17 Jan 2021 11:33:52 +0530
> Dilip Kumar  wrote:
> > > > > >
> > > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > > purpose;
> > > > > > this waits recovery to actually get paused. If we want to limit 
> > > > > > this API's
> > > > > > purpose only to return the pause state, it seems better to fix this 
> > > > > > to return
> > > > > > the actual state at the cost of lacking the backward compatibility. 
> > > > > > If we want
> > > > > > to know whether pause is requested, we may add a new API like
> > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > > recovery to actually
> > > > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > > > purpose.
> > > > > >
> > > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I 
> > > > > > don't care either.
> > > > >
> > > > > I don't think that it will be blocked ever, because
> > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > > recovery process will not be stuck on waiting for the WAL.
> > >
> > > Yes, there is no stuck on waiting for the WAL. However, it can be stuck 
> > > during resolving
> > > a recovery conflict. The process could wait for 
> > > max_standby_streaming_delay or
> > > max_standby_archive_delay at most before recovery get completely paused.
> > 
> > Okay, I agree that it is possible so for handling this we have a
> > couple of options
> > 1. pg_is_wal_replay_paused(), interface will wait for recovery to
> > actually get paused, but user have an option to cancel that.  So I
> > agree that there is currently no option to just know that recovery
> > pause is requested without waiting for its actually get paused if it
> > is requested.  So one option is we can provide an another interface as
> > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just
> > return the request status.  I am not sure how useful it is.
> 
> If it is acceptable that pg_is_wal_replay_paused() makes users wait, 
> I'm ok for the current interface. I don't feel the need of
> pg_is_wal_replay_paluse_requeseted().

FWIW, the name "pg_is_wal_replay_paused" is suggesting "to know
whether recovery is paused or not at present" and it would be
surprising to see it to wait for the recovery actually paused by
default.

I think there's no functions to wait for some situation at least for
now.  If we wanted to wait for some condition to make, we would loop
over check-and-wait using plpgsql.

If you desire to wait to replication to pause by a function, I would
do that by adding a parameter to the function.

pg_is_wal_replay_paused(OPTIONAL bool wait_for_pause)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: POC: postgres_fdw insert batching

2021-01-18 Thread Amit Langote
Tsunakawa-san,

On Tue, Jan 19, 2021 at 12:50 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Tomas Vondra 
> > OK. Can you prepare a final patch, squashing all the commits into a
> > single one, and perhaps use the function in create_foreign_modify?
>
> Attached, including the message fix pointed by Zaihong-san.

Thanks for adopting my suggestions regarding GetForeignModifyBatchSize().

I apologize in advance for being maybe overly pedantic, but I noticed
that, in ExecInitModifyTable(), you decided to place the call outside
the loop that goes over resultRelations (shown below), although my
intent was to ask to place it next to the BeginForeignModify() in that
loop.

resultRelInfo = mtstate->resultRelInfo;
i = 0;
forboth(l, node->resultRelations, l1, node->plans)
{
...
/* Also let FDWs init themselves for foreign-table result rels */
if (!resultRelInfo->ri_usesFdwDirectModify &&
resultRelInfo->ri_FdwRoutine != NULL &&
resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
{
List   *fdw_private = (List *) list_nth(node->fdwPrivLists, i);

resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,
 resultRelInfo,
 fdw_private,
 i,
 eflags);
}

Maybe it's fine today because we only care about inserts and there's
always only one entry in the resultRelations list in that case, but
that may not remain the case in the future.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Wrong usage of RelationNeedsWAL

2021-01-18 Thread Kyotaro Horiguchi
At Mon, 18 Jan 2021 17:30:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Sun, 17 Jan 2021 23:02:18 -0800, Noah Misch  wrote in 
> > On Sun, Jan 17, 2021 at 10:36:31PM -0800, Noah Misch wrote:
> > > I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check 
> > > in
> > > TestForOldSnapshot().  If the LSN isn't important, what else explains
> > > RelationAllowsEarlyPruning() checking RelationNeedsWAL()?
> > 
> > Thinking about it more, some RelationAllowsEarlyPruning() callers need
> > different treatment.  Above, I was writing about the case of deciding 
> > whether
> > to do early pruning.  The other RelationAllowsEarlyPruning() call sites are
> > deciding whether the relation might be lacking old data.  For that case, we
> > should check RELPERSISTENCE_PERMANENT, not RelationNeedsWAL().  Otherwise, 
> > we
> > could fail to report an old-snapshot error in a case like this:
> > 
A> > setup: CREATE TABLE t AS SELECT ...;
B> > xact1: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1;  -- take snapshot
C> > xact2: DELETE FROM t;
D> > (plenty of time passes)
E> > xact3: SELECT count(*) FROM t;  -- early pruning
F> > xact1: SAVEPOINT q; SELECT count(*) FROM t; ROLLBACK TO q;  -- "snapshot 
too old"
G> > xact1: ALTER TABLE t SET TABLESPACE something;  -- start skipping WAL
H> > xact1: SELECT count(*) FROM t;  -- no error, wanted "snapshot too old"
> > 
> > Is that plausible?
> 
> Thank you for the consideration and yes. But I get "snapshot too old"
> from the last query with the patched version. maybe I'm missing
> something. I'm going to investigate the case.

Ah. I took that in reverse way. (caught by the discussion on page
LSN.) Ok, the "current" code works that way. So we need to perform the
check the new way in RelationAllowsEarlyPruning. So in a short answer
for the last question in regard to the reference side is yes.

I understand that you are suggesting that at least
TransactionIdLimitedForOldSnapshots should follow not only relation
persistence but RelationNeedsWAL, based on the discussion on the
check on LSN of TestForOldSnapshot().

I still don't think that LSN in the WAL-skipping-created relfilenode
harm.  ALTER TABLE SET TABLESPACE always makes a dead copy of every
block (except checksum) including page LSN regardless of wal_level. In
the scenario above, the last select at H correctly reads page LSN set
by E then copied by G, which is larger than the snapshot LSN at B. So
doesn't go the fast-return path before actual check performed by
RelationAllowsEarlyPruning.

As the result still RelationAllowsEarlyPruning is changed to check
only for the persistence of the table in this version. (In other
words, all the callers of the function works based on the same
criteria.)

- Removed RelationIsWalLoggeed which was forgotten to remove in the
  last version.

- Enclosed parameter of RelationAllowsEarlyPruning.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From d688abbc04459b11bc2801f3c7f1955a86ef7a51 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 18 Jan 2021 14:47:21 +0900
Subject: [PATCH v3] Do not use RelationNeedsWAL to identify relation
 persistence

RelationNeedsWAL() may return false for a permanent relation when
wal_level=minimal and the relation is created or truncated in the
current transaction. Directly examine relpersistence instead of using
the function to know relation persistence.
---
 src/backend/catalog/pg_publication.c   |  2 +-
 src/backend/optimizer/util/plancat.c   |  3 ++-
 src/include/utils/rel.h|  2 +-
 src/include/utils/snapmgr.h|  2 +-
 src/test/subscription/t/001_rep_changes.pl | 20 +++-
 5 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 5f8e1c64e1..84d2efcfd2 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
  errdetail("System tables cannot be added to publications.")));
 
 	/* UNLOGGED and TEMP relations cannot be part of publication. */
-	if (!RelationNeedsWAL(targetrel))
+	if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("table \"%s\" cannot be replicated",
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index da322b453e..177e6e336a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	relation = table_open(relationObjectId, NoLock);
 
 	/* Temporary and unlogged relations are inaccessible during recovery. */
-	if (!RelationNeedsWAL(relation) && RecoveryInProgress())
+	if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT &&
+		RecoveryInProgress())
 		ereport(ERROR,
 

Re: [PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer
On Tue, 19 Jan 2021, 02:01 Robert Haas,  wrote:

> On Mon, Jan 18, 2021 at 11:56 AM Tom Lane  wrote:
> > > I've wanted this in the past, too, so +1 from me.
> >
> > I dunno, this seems pretty scary and easily abusable.  There's not all
> > that much that can be done safely in ProcessInterrupts(), and we should
> > not be encouraging extensions to think they can add random processing
> > there.
>
> We've had this disagreement before about other things, and I just
> don't agree. If somebody uses a hook for something wildly unsafe, that
> will break their stuff, not ours.


Generally yeah.

And we have no shortage of hooks with plenty of error or abuse potential
and few safeguards already. I'd argue that in C code any external code is
inherently unsafe anyway. So it's mainly down to whether the hook actively
encourages unsafe actions without providing commensurate benefits, and
whether there's a better/safer way to achieve the same thing.

That's not to say I endorse adding

hooks for random purposes in random places. In particular, if it's
> impossible to use a particular hook in a reasonably safe way, that's a
> sign that the hook is badly-designed and that we should not have it.
>

Yep. Agreed.

Any hook is possible to abuse or write incorrectly, from simple fmgr
loadable functions right on up.

The argument that a hook could be abused would apply just as well to
exposing pqsignal() itself to extensions. Probably more so. Also to
anything like ProcessUtility_hook.


> > We're about halfway there already, see 7e784d1dc.  I didn't do the
> > other half because it wasn't necessary to the problem, but exposing
> > the shutdown state more fully seems reasonable.
>

Excellent, I'll take a look. Thanks.


Re: simplifying foreign key/RI checks

2021-01-18 Thread Pavel Stehule
út 19. 1. 2021 v 3:08 odesílatel Amit Langote 
napsal:

> On Tue, Jan 19, 2021 at 3:01 AM Pavel Stehule 
> wrote:
> > po 18. 1. 2021 v 13:40 odesílatel Amit Langote 
> napsal:
> >> I started with the check that's performed when inserting into or
> >> updating the referencing table to confirm that the new row points to a
> >> valid row in the referenced relation.  The corresponding SQL is this:
> >>
> >> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x
> >>
> >> $1 is the value of the foreign key of the new row.  If the query
> >> returns a row, all good.  Thanks to SPI, or its use of plan caching,
> >> the query is re-planned only a handful of times before making a
> >> generic plan that is then saved and reused, which looks like this:
> >>
> >>   QUERY PLAN
> >> --
> >>  LockRows
> >>->  Index Scan using pk_pkey on pk x
> >>  Index Cond: (a = $1)
> >> (3 rows)
> >
> >
> > What is performance when the referenced table is small? - a lot of
> codebooks are small between 1000 to 10K rows.
>
> I see the same ~2x improvement.
>
> create table p (a numeric primary key);
> insert into p select generate_series(1, 1000);
> create table f (a bigint references p);
>
> Unpatched:
>
> insert into f select i%1000+1 from generate_series(1, 100) i;
> INSERT 0 100
> Time: 5461.377 ms (00:05.461)
>
>
> Patched:
>
> insert into f select i%1000+1 from generate_series(1, 100) i;
> INSERT 0 100
> Time: 2357.440 ms (00:02.357)
>
> That's expected because the overhead of using SPI to check the PK
> table, which the patch gets rid of, is the same no matter the size of
> the index to be scanned.
>

It looks very well.

Regards

Pavel


> --
> Amit Langote
> EDB: http://www.enterprisedb.com
>


RE: POC: postgres_fdw insert batching

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> OK. Can you prepare a final patch, squashing all the commits into a
> single one, and perhaps use the function in create_foreign_modify?

Attached, including the message fix pointed by Zaihong-san.


Regards
Takayuki Tsunakawa



v12-0001-Add-bulk-insert-for-foreign-tables.patch
Description: v12-0001-Add-bulk-insert-for-foreign-tables.patch


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Greg Nancarrow
On Tue, Jan 19, 2021 at 2:03 PM Amit Kapila  wrote:
>
> > >
> > > You have not raised a WARNING for the second case.
> >
> > The same checks in current Postgres code also don't raise a WARNING
> > for that case, so I'm just being consistent with existing Postgres
> > code (which itself isn't consistent for those two cases).
> >
>
> Search for the string "too few entries in indexprs list" and you will
> find a lot of places in code raising ERROR for the same condition.
>

Yes, but raising an ERROR stops processing (not just logs an error
message). Raising a WARNING logs a warning message and continues
processing. It's a big difference.
So, for the added parallel-safety-checking code, it was suggested by
Vignesh (and agreed by me) that, for these rare and highly unlikely
conditions, it would be best not to just copy the error-handling code
verbatim from other cases in the Postgres code (as I had originally
done) and just stop processing dead with an error, but to instead
return PARALLEL_UNSAFE, so that processing continues as it would for
current non-parallel processing, which would most likely error-out
anyway along the current error-handling checks and paths when those
bad attributes/fields are referenced.
I will add some Asserts() and don't mind adding a WARNING message for
the 2nd case.
If you really feel strongly about this, I can just restore the
original code, which will stop dead with an ERROR in the middle of
parallel-safety checking should one of these rare conditions ever
occur.

Regards,
Greg Nancarrow
Fujitsu Australia




回复:Re: Cache relation sizes?

2021-01-18 Thread 陈佳昕(步真)
Hi Thomas
I want to share a patch with you, I change the replacement algorithm from fifo 
to a simple lru.

Buzhen

0001-update-fifo-to-lru-to-sweep-a-valid-cache.patch
Description: Binary data


[patch] Help information for pg_dump

2021-01-18 Thread Zhang, Jie
Hi all

After executing command [pg_dump -?], some help information is as follows.

pg_dump -?
-
  -N, --exclude-schema=PATTERN do NOT dump the specified schema(s)  ※
  -T, --exclude-table=PATTERN  do NOT dump the specified table(s)   ※
  -x, --no-privileges  do not dump privileges (grant/revoke)
  --exclude-table-data=PATTERN do NOT dump data for the specified table(s)  ※
  --no-commentsdo not dump comments
  --no-publicationsdo not dump publications
  --no-security-labels do not dump security label assignments
  --no-subscriptions   do not dump subscriptions
  --no-synchronized-snapshots  do not use synchronized snapshots in parallel 
jobs
  --no-tablespaces do not dump tablespace assignments
  --no-unlogged-table-data do not dump unlogged table data


I think it would be better to change [do NOT dump] to [do not dump].

Here is a patch.

Best Regards!







pg_dump.patch
Description: pg_dump.patch


Re: simplifying foreign key/RI checks

2021-01-18 Thread japin


On Tue, 19 Jan 2021 at 10:45, Amit Langote  wrote:
> On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu  wrote:
>>
>> Hi,
>> I was looking at this statement:
>>
>> insert into f select generate_series(1, 200, 2);
>>
>> Since certain generated values (the second half) are not in table p, 
>> wouldn't insertion for those values fail ?
>> I tried a scaled down version (1000th) of your example:
>>
>> yugabyte=# insert into f select generate_series(1, 2000, 2);
>> ERROR:  insert or update on table "f" violates foreign key constraint 
>> "f_a_fkey"
>> DETAIL:  Key (a)=(1001) is not present in table "p".
>
> Sorry, a wrong copy-paste by me.  Try this:
>
> create table p (a numeric primary key);
> insert into p select generate_series(1, 200);
> create table f (a bigint references p);
>
> -- Unpatched
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 6527.652 ms (00:06.528)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 8108.310 ms (00:08.108)
>
> -- Patched:
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 3312.193 ms (00:03.312)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 4292.807 ms (00:04.293)
>
>> For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch :
>>
>> +* Collect partition key values from the unique key.
>>
>> At the end of the nested loop, should there be an assertion that 
>> partkey->partnatts partition key values have been found ?
>> This can be done by using a counter (initialized to 0) which is incremented 
>> when a match is found by the inner loop.
>
> I've updated the patch to add the Assert.  Thanks for taking a look.

After apply the v2 patches, here are some warnings:

In file included from 
/home/japin/Codes/postgresql/Debug/../src/include/postgres.h:47:0,
 from 
/home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:24:
/home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c: In 
function ‘ri_PrimaryKeyExists’:
/home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:134:5: warning: 
this statement may fall through [-Wimplicit-fallthrough=]
  do { \
 ^
/home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:156:2: note: in 
expansion of macro ‘ereport_domain’
  ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
  ^~
/home/japin/Codes/postgresql/Debug/../src/include/utils/elog.h:229:2: note: in 
expansion of macro ‘ereport’
  ereport(elevel, errmsg_internal(__VA_ARGS__))
  ^~~
/home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:417:5:
 note: in expansion of macro ‘elog’
 elog(ERROR, "unexpected table_tuple_lock status: %u", res);
 ^~~~
/home/japin/Codes/postgresql/Debug/../src/backend/utils/adt/ri_triggers.c:419:4:
 note: here
default:
^~~

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Bharath Rupireddy
On Mon, Jan 18, 2021 at 9:11 PM Fujii Masao  wrote:
> > Attaching v12 patch set. 0001 is for postgres_fdw_disconnect()
> > function, 0002 is for keep_connections GUC and 0003 is for
> > keep_connection server level option.
>
> Thanks!
>
> >
> > Please review it further.
>
> +   server = GetForeignServerByName(servername, true);
> +
> +   if (!server)
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
> +errmsg("foreign server \"%s\" does 
> not exist", servername)));
>
> ISTM we can simplify this code as follows.
>
>  server = GetForeignServerByName(servername, false);

Done.

> +   hash_seq_init(, ConnectionHash);
> +   while ((entry = (ConnCacheEntry *) hash_seq_search()))
>
> When the server name is specified, even if its connection is successfully
> closed, postgres_fdw_disconnect() scans through all the entries to check
> whether there are active connections. But if "result" is true and
> active_conn_exists is true, we can get out of this loop to avoid unnecessary
> scans.

My initial thought was that it's possible to have two entries with the
same foreign server name but with different user mappings, looks like
it's not possible. I tried associating a foreign server with two
different user mappings [1], then the cache entry is getting
associated initially with the user mapping that comes first in the
pg_user_mappings, if this user mapping is dropped then the cache entry
gets invalidated, so next time the second user mapping is used.

Since there's no way we can have two cache entries with the same
foreign server name, we can get out of the loop when we find the cache
entry match with the given server. I made the changes.

[1]
postgres=# select * from pg_user_mappings ;
 umid  | srvid |  srvname  | umuser | usename | umoptions
---+---+---++-+---
 16395 | 16394 | loopback1 | 10 | bharath |-> cache entry
is initially made with this user mapping.
 16399 | 16394 | loopback1 |  0 | public  |   -> if the
above user mapping is dropped, then the cache entry is made with this
user mapping.

> +   /*
> +* Destroy the cache if we discarded all active connections i.e. if 
> there
> +* is no single active connection, which we can know while scanning 
> the
> +* cached entries in the above loop. Destroying the cache is better 
> than to
> +* keep it in the memory with all inactive entries in it to save some
> +* memory. Cache can get initialized on the subsequent queries to 
> foreign
> +* server.
>
> How much memory is assumed to be saved by destroying the cache in
> many cases? I'm not sure if it's really worth destroying the cache to save
> the memory.

I removed the cache destroying code, if somebody complains in
future(after the feature commit), we can really revisit then.

> +  a warning is issued and false is returned. 
> false
> +  is returned when there are no open connections. When there are some 
> open
> +  connections, but there is no connection for the given foreign server,
> +  then false is returned. When no foreign server 
> exists
> +  with the given name, an error is emitted. Example usage of the 
> function:
>
> When a non-existent server name is specified, postgres_fdw_disconnect()
> emits an error if there is at least one open connection, but just returns
> false otherwise. At least for me, this behavior looks inconsitent and strange.
> In that case, IMO the function always should emit an error.

Done.

Attaching v13 patch set, please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 0731c6244ac228818916d62cc51ea1434178c5be Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 19 Jan 2021 08:23:55 +0530
Subject: [PATCH v13] postgres_fdw function to discard cached connections

This patch introduces a new function postgres_fdw_disconnect().
When called with a foreign server name, it discards the associated
connection with the server. When called without any argument, it
discards all the existing cached connections.
---
 contrib/postgres_fdw/connection.c | 147 ++
 .../postgres_fdw/expected/postgres_fdw.out|  93 +++
 .../postgres_fdw/postgres_fdw--1.0--1.1.sql   |  10 ++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  35 +
 doc/src/sgml/postgres-fdw.sgml|  59 +++
 5 files changed, 344 insertions(+)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index a1404cb6bb..287a047c80 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -80,6 +80,7 @@ static bool xact_got_connection = false;
  * SQL functions
  */
 PG_FUNCTION_INFO_V1(postgres_fdw_get_connections);
+PG_FUNCTION_INFO_V1(postgres_fdw_disconnect);
 
 /* prototypes 

Re: Is Recovery actually paused?

2021-01-18 Thread Dilip Kumar
On Tue, 19 Jan 2021 at 8:12 AM, Yugo NAGATA  wrote:

> On Sun, 17 Jan 2021 11:33:52 +0530
> Dilip Kumar  wrote:
>
> > On Thu, Jan 14, 2021 at 6:49 PM Yugo NAGATA  wrote:
> > >
> > > On Wed, 13 Jan 2021 17:49:43 +0530
> > > Dilip Kumar  wrote:
> > >
> > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar 
> wrote:
> > > > >
> > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA 
> wrote:
> > > > > >
> > > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > > > Dilip Kumar  wrote:
> > > > > >
> > > > > > > > > However, I wonder users don't expect
> pg_is_wal_replay_paused to wait.
> > > > > > > > > Especially, if max_standby_streaming_delay is -1, this
> will be blocked forever,
> > > > > > > > > although this setting may not be usual. In addition, some
> users may set
> > > > > > > > > recovery_min_apply_delay for a large.  If such users call
> pg_is_wal_replay_paused,
> > > > > > > > > it could wait for a long time.
> > > > > > > > >
> > > > > > > > > At least, I think we need some descriptions on document to
> explain
> > > > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > > > >
> > > > > > > > Ok
> > > > > > >
> > > > > > > Fixed this, added some comments in .sgml as well as in
> function header
> > > > > >
> > > > > > Thank you for fixing this.
> > > > > >
> > > > > > Also, is it better to fix the description of pg_wal_replay_pause
> from
> > > > > > "Pauses recovery." to "Request to pause recovery." in according
> with
> > > > > > pg_is_wal_replay_paused?
> > > > >
> > > > > Okay
> > > > >
> > > > > >
> > > > > > > > > Also, how about adding a new boolean argument to
> pg_is_wal_replay_paused to
> > > > > > > > > control whether this waits for recovery to get paused or
> not? By setting its
> > > > > > > > > default value to true or false, users can use the old
> format for calling this
> > > > > > > > > and the backward compatibility can be maintained.
> > > > > > > >
> > > > > > > > So basically, if the wait_recovery_pause flag is false then
> we will
> > > > > > > > immediately return true if the pause is requested?  I agree
> that it is
> > > > > > > > good to have an API to know whether the recovery pause is
> requested or
> > > > > > > > not but I am not sure is it good idea to make this API serve
> both the
> > > > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > > > >
> > > > > >
> > > > > > I think the current pg_is_wal_replay_paused() already has
> another purpose;
> > > > > > this waits recovery to actually get paused. If we want to limit
> this API's
> > > > > > purpose only to return the pause state, it seems better to fix
> this to return
> > > > > > the actual state at the cost of lacking the backward
> compatibility. If we want
> > > > > > to know whether pause is requested, we may add a new API like
> > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait
> recovery to actually
> > > > > > get paused, we may add an option to pg_wal_replay_pause() for
> this purpose.
> > > > > >
> > > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I
> don't care either.
> > > > >
> > > > > I don't think that it will be blocked ever, because
> > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > > recovery process will not be stuck on waiting for the WAL.
> > >
> > > Yes, there is no stuck on waiting for the WAL. However, it can be
> stuck during resolving
> > > a recovery conflict. The process could wait for
> max_standby_streaming_delay or
> > > max_standby_archive_delay at most before recovery get completely
> paused.
> >
> > Okay, I agree that it is possible so for handling this we have a
> > couple of options
> > 1. pg_is_wal_replay_paused(), interface will wait for recovery to
> > actually get paused, but user have an option to cancel that.  So I
> > agree that there is currently no option to just know that recovery
> > pause is requested without waiting for its actually get paused if it
> > is requested.  So one option is we can provide an another interface as
> > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just
> > return the request status.  I am not sure how useful it is.
>
> If it is acceptable that pg_is_wal_replay_paused() makes users wait,
> I'm ok for the current interface. I don't feel the need of
> pg_is_wal_replay_paluse_requeseted().
>
> >
> > 2. Pass an option to pg_is_wal_replay_paused whether to wait for
> > recovery to actually get paused or not.
> >
> > 3. Pass an option to pg_wal_replay_pause(), whether to wait for
> > recovery pause or just request and return.
> >
> > I like the option 1, any other opinion on this?
> >
> > > Also, it could wait for recovery_min_apply_delay if it has a valid
> value. It is possible
> > > that a user set this parameter to a large value, so it could wait for
> a long time. However,
> > > this will be avoided by calling recoveryPausesHere() or
> 

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Amit Kapila
On Mon, Jan 18, 2021 at 3:50 PM Greg Nancarrow  wrote:
>
> On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila  wrote:
> >
> > > >It is not clear why the above two are shouldn't happen cases and if so
> > > >why we want to treat them as unsafe. Ideally, there should be an
> > > >Assert if these can't happen but it is difficult to decide without
> > > >knowing why you have considered them unsafe?
> > >
> > > The checks being done here for "should never happen" cases are THE
> > > SAME as other parts of the Postgres code.
> > > For example, search Postgres code for "null conbin" and you'll find 6
> > > other places in the Postgres code which actually ERROR out if conbin
> > > (binary representation of the constraint) in a pg_constraint tuple is
> > > found to be null.
> > > The cases that you point out in the patch used to also error out in
> > > the same way, but Vignesh suggested changing them to just return
> > > parallel-unsafe instead of erroring-out, which I agree with.
> > >
> >
> > You have not raised a WARNING for the second case.
>
> The same checks in current Postgres code also don't raise a WARNING
> for that case, so I'm just being consistent with existing Postgres
> code (which itself isn't consistent for those two cases).
>

Search for the string "too few entries in indexprs list" and you will
find a lot of places in code raising ERROR for the same condition.

-- 
With Regards,
Amit Kapila.




Re: list of extended statistics on psql

2021-01-18 Thread Tatsuro Yamada

Hi,


The above query is so simple so that we would better to use the following query:

# This query works on PG10 or later
SELECT
     es.stxnamespace::pg_catalog.regnamespace::text AS "Schema",
     es.stxname AS "Name",
     pg_catalog.format('%s FROM %s',
     (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ')
  FROM pg_catalog.unnest(es.stxkeys) s(attnum)
  JOIN pg_catalog.pg_attribute a
  ON (es.stxrelid = a.attrelid
  AND a.attnum = s.attnum
  AND NOT a.attisdropped)),
     es.stxrelid::regclass) AS "Definition",
     CASE WHEN 'd' = any(es.stxkind) THEN 'defined'
     END AS "Ndistinct",
     CASE WHEN 'f' = any(es.stxkind) THEN 'defined'
     END AS "Dependencies",
     CASE WHEN 'm' = any(es.stxkind) THEN 'defined'
     END AS "MCV"
FROM pg_catalog.pg_statistic_ext es
ORDER BY 1, 2;

  Schema |    Name    |    Definition    | Ndistinct | Dependencies | 
Dependencies
++--+---+--+--
  public | hoge1_ext  | a, b FROM hoge1  | defined   | defined  | defined
  public | hoge_t_ext | a, b FROM hoge_t | defined   | defined  | defined
(2 rows)


I'm going to create the WIP patch to use the above query.
Any comments welcome. :-D



Attached patch is WIP patch.

The changes are:
  - Use pg_statistic_ext only
  - Remove these statuses: "required" and "built"
  - Add new status: "defined"
  - Remove the size columns
  - Fix document

I'll create and send the regression test on the next patch if there is
no objection. Is it Okay?

Regards,
Tatsuro Yamada

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..36a79d9e3f 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1918,6 +1918,26 @@ testdb=
 
 
   
+  
+  
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the
+pattern are listed.
+
+
+
+The column of the kind of extended stats (e.g. Ndistinct) shows its 
status.
+NULL means that it doesn't exists. "defined" means that it is declared.
+You can use pg_stats_ext if you'd like to know whether 
+ANALYZE was run and statistics are available 
to the
+planner.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 303e7c3ad8..c98e3d31d0 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -928,6 +928,9 @@ exec_command_d(PsqlScanState scan_state, bool 
active_branch, const char *cmd)
else
success = listExtensions(pattern);
break;
+   case 'X':   /* Extended Statistics 
*/
+   success = listExtendedStats(pattern);
+   break;
case 'y':   /* Event Triggers */
success = listEventTriggers(pattern, 
show_verbose);
break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index caf97563f4..899fe5d85c 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4392,6 +4392,89 @@ listEventTriggers(const char *pattern, bool verbose)
return true;
 }
 
+/*
+ * \dX
+ *
+ * Describes extended statistics.
+ */
+bool
+listExtendedStats(const char *pattern)
+{
+   PQExpBufferData buf;
+   PGresult   *res;
+   printQueryOpt myopt = pset.popt;
+
+   if (pset.sversion < 10)
+   {
+   charsverbuf[32];
+
+   pg_log_error("The server (version %s) does not support extended 
statistics.",
+formatPGVersionNumber(pset.sversion, 
false,
+   
   sverbuf, sizeof(sverbuf)));
+   return true;
+   }
+
+   initPQExpBuffer();
+   printfPQExpBuffer(,
+ "SELECT \n"
+ 
"es.stxnamespace::pg_catalog.regnamespace::text AS \"%s\", \n"
+ "es.stxname AS \"%s\", \n"
+ "pg_catalog.format('%%s FROM %%s', \n"
+ "  (SELECT 
pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ') \n"
+ "   FROM 
pg_catalog.unnest(es.stxkeys) s(attnum) \n"
+ "   JOIN pg_catalog.pg_attribute a \n"
+ "   ON (es.stxrelid = a.attrelid \n"
+ "   AND a.attnum = s.attnum \n"
+   

Re: simplifying foreign key/RI checks

2021-01-18 Thread Zhihong Yu
Thanks for the quick response.

+   if (mapped_partkey_attnums[i] == pk_attnums[j])
+   {
+   partkey_vals[i] = pk_vals[j];
+   partkey_isnull[i] = pk_nulls[j] == 'n' ? true : false;
+   i++;
+   break;

The way counter (i) is incremented is out of my expectation.
In the rare case, where some i doesn't have corresponding pk_attnums[j],
wouldn't there be a dead loop ?

I think the goal of adding the assertion should be not loop infinitely even
if the invariant is not satisfied.

I guess a counter other than i would be better for this purpose.

Cheers

On Mon, Jan 18, 2021 at 6:45 PM Amit Langote 
wrote:

> On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu  wrote:
> >
> > Hi,
> > I was looking at this statement:
> >
> > insert into f select generate_series(1, 200, 2);
> >
> > Since certain generated values (the second half) are not in table p,
> wouldn't insertion for those values fail ?
> > I tried a scaled down version (1000th) of your example:
> >
> > yugabyte=# insert into f select generate_series(1, 2000, 2);
> > ERROR:  insert or update on table "f" violates foreign key constraint
> "f_a_fkey"
> > DETAIL:  Key (a)=(1001) is not present in table "p".
>
> Sorry, a wrong copy-paste by me.  Try this:
>
> create table p (a numeric primary key);
> insert into p select generate_series(1, 200);
> create table f (a bigint references p);
>
> -- Unpatched
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 6527.652 ms (00:06.528)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 8108.310 ms (00:08.108)
>
> -- Patched:
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 3312.193 ms (00:03.312)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 4292.807 ms (00:04.293)
>
> > For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch :
> >
> > +* Collect partition key values from the unique key.
> >
> > At the end of the nested loop, should there be an assertion that
> partkey->partnatts partition key values have been found ?
> > This can be done by using a counter (initialized to 0) which is
> incremented when a match is found by the inner loop.
>
> I've updated the patch to add the Assert.  Thanks for taking a look.
>
> --
> Amit Langote
> EDB: http://www.enterprisedb.com
>


Re: list of extended statistics on psql

2021-01-18 Thread Tomas Vondra




On 1/19/21 1:44 AM, Tatsuro Yamada wrote:

Hi Tomas,


As for how to deal with this, I can think of about three ways:

1) simplify the command to only print information from 
pg_statistic_ext (so on information about which stats are built or 
sizes)


2) extend pg_stats_ext with necessary information (e.g. sizes)

3) create a new system view, with necessary information (so that 
pg_stats_ext does not need to be modified)


4) add functions returning the necessary information, possibly only 
for statistics the user can access (similarly to what pg_stats_ext 
does)


Options 2-4 have the obvious disadvantage that this won't work on 
older releases (we can't add views or functions there). So I'm 
leaning towards #1 even if that means we have to remove some of the 
details. We can consider adding that for new releases, though.



Thanks for the useful advice. I go with option 1).
The following query is created by using pg_stats_ext instead of 
pg_statistic_ext and pg_statistic_ext_data. However, I was confused

about writing a part of the query for calculating MCV size because
there are four columns related to MCV. For example, most_common_vals, 
most_common_val_nulls, most_common_freqs, and most_common_base_freqs.

Currently, I don't know how to calculate the size of MCV by using the
four columns. Thoughts? :-)


Well, my suggestion was to use pg_statistic_ext, because that lists 
all statistics, while pg_stats_ext is filtering statistics depending 
on access privileges. I think that's more appropriate for \dX, the 
contents should not change depending on the user.


Also, let me clarify - with option (1) we'd not show the sizes at all. 
The size of the formatted statistics may be very different from the 
on-disk representation, so I see no point in showing it in \dX.


We might show other stats (e.g. number of MCV items, or the fraction 
of data represented by the MCV list), but the user can inspect 
pg_stats_ext if needed.


What we might do is to show those stats when a superuser is running 
this command, but I'm not sure that's a good idea (or how difficult 
would it be to implement).



Thanks for clarifying.
I see that your suggestion was to use pg_statistic_ext, not pg_stats_ext.
And we don't need the size of stats.

If that's the case, we also can't get the status of stats since PG12 or 
later
because we can't use pg_statistic_ext_data, as you know. Therefore, it 
would be
better to replace the query with the old query that I sent five months 
ago like this:


# the old query
SELECT
     stxnamespace::pg_catalog.regnamespace AS "Schema",
     stxrelid::pg_catalog.regclass AS "Table",
     stxname AS "Name",
     (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ')
  FROM pg_catalog.unnest(stxkeys) s(attnum)
  JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND
  a.attnum = s.attnum AND NOT attisdropped)) AS "Columns",
     'd' = any(stxkind) AS "Ndistinct",
     'f' = any(stxkind) AS "Dependencies",
     'm' = any(stxkind) AS "MCV"
FROM pg_catalog.pg_statistic_ext stat
ORDER BY 1,2;

  Schema | Table  |    Name    | Columns | Ndistinct | Dependencies | MCV
+++-+---+--+-
  public | hoge1  | hoge1_ext  | a, b    | t | t    | t
  public | hoge_t | hoge_t_ext | a, b    | t | t    | t
(2 rows)


The above query is so simple so that we would better to use the 
following query:


# This query works on PG10 or later
SELECT
     es.stxnamespace::pg_catalog.regnamespace::text AS "Schema",
     es.stxname AS "Name",
     pg_catalog.format('%s FROM %s',
     (SELECT 
pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ')

  FROM pg_catalog.unnest(es.stxkeys) s(attnum)
  JOIN pg_catalog.pg_attribute a
  ON (es.stxrelid = a.attrelid
  AND a.attnum = s.attnum
  AND NOT a.attisdropped)),
     es.stxrelid::regclass) AS "Definition",
     CASE WHEN 'd' = any(es.stxkind) THEN 'defined'
     END AS "Ndistinct",
     CASE WHEN 'f' = any(es.stxkind) THEN 'defined'
     END AS "Dependencies",
     CASE WHEN 'm' = any(es.stxkind) THEN 'defined'
     END AS "MCV"
FROM pg_catalog.pg_statistic_ext es
ORDER BY 1, 2;

  Schema |    Name    |    Definition    | Ndistinct | Dependencies | 
Dependencies
++--+---+--+-- 

  public | hoge1_ext  | a, b FROM hoge1  | defined   | defined  | 
defined
  public | hoge_t_ext | a, b FROM hoge_t | defined   | defined  | 
defined

(2 rows)


I'm going to create the WIP patch to use the above queriy.
Any comments welcome. :-D



Yes, I think using this simpler query makes sense. If we decide we need 
something more elaborate, we can improve that by in future PostgreSQL 
versions (after adding view/function to core), but I'd leave that as a 
work for the future.


Apologies for all the extra work - I haven't realized this flaw when 
pushing 

Re: simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 2:47 AM Zhihong Yu  wrote:
>
> Hi,
> I was looking at this statement:
>
> insert into f select generate_series(1, 200, 2);
>
> Since certain generated values (the second half) are not in table p, wouldn't 
> insertion for those values fail ?
> I tried a scaled down version (1000th) of your example:
>
> yugabyte=# insert into f select generate_series(1, 2000, 2);
> ERROR:  insert or update on table "f" violates foreign key constraint 
> "f_a_fkey"
> DETAIL:  Key (a)=(1001) is not present in table "p".

Sorry, a wrong copy-paste by me.  Try this:

create table p (a numeric primary key);
insert into p select generate_series(1, 200);
create table f (a bigint references p);

-- Unpatched
insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 6527.652 ms (00:06.528)

update f set a = a + 1;
UPDATE 100
Time: 8108.310 ms (00:08.108)

-- Patched:
insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 3312.193 ms (00:03.312)

update f set a = a + 1;
UPDATE 100
Time: 4292.807 ms (00:04.293)

> For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch :
>
> +* Collect partition key values from the unique key.
>
> At the end of the nested loop, should there be an assertion that 
> partkey->partnatts partition key values have been found ?
> This can be done by using a counter (initialized to 0) which is incremented 
> when a match is found by the inner loop.

I've updated the patch to add the Assert.  Thanks for taking a look.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v2-0001-Export-get_partition_for_tuple.patch
Description: Binary data


v2-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


configurable the threshold for warning due to run out of transaction ID

2021-01-18 Thread Masahiro Ikeda

Hi,

PostgreSQL has the feature to warn about running out of transaction ID.
The following message is an example.

```
2021-01-19 10:59:27 JST [client backend] WARNING:  database "postgres" 
must be vacuumed within xxx transactions
2021-01-19 10:59:27 JST [client backend] HINT:  To avoid a database 
shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared 
transactions, or drop stale replication slots.

```

But, the threshold for the warning is not configurable.
The value is hard-coded to 40M.

```
varsup.c
/*
	 * We'll start complaining loudly when we get within 40M transactions 
of

 * data loss.  This is kind of arbitrary, but if you let your gas gauge
	 * get down to 2% of full, would you be looking for the next gas 
station?
	 * We need to be fairly liberal about this number because there are 
lots
	 * of scenarios where most transactions are done by automatic clients 
that

 * won't pay attention to warnings.  (No, we're not gonna make this
	 * configurable.  If you know enough to configure it, you know enough 
to

 * not get in this kind of trouble in the first place.)
 */
```

I think it's useful to configure the threshold for warning due to  run 
out of

transaction ID like "checkpoint_warning" parameter.

Actually, when a user's workload is too write-heavy,
there was a case we want to get the warning message earlier.


I understood that there is another way to handle it.
For example, to monitor frozen transaction ID to execute the following 
query

and check to see if the custom threshold is exceeded.

```
SELECT max(age(datfrozenxid)) FROM pg_database;
```

But, I think to warn to a server log is a simpler way.


I would like to know your opinion.
If it's useful for us, I'll make patches.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: Is Recovery actually paused?

2021-01-18 Thread Yugo NAGATA
On Sun, 17 Jan 2021 11:33:52 +0530
Dilip Kumar  wrote:

> On Thu, Jan 14, 2021 at 6:49 PM Yugo NAGATA  wrote:
> >
> > On Wed, 13 Jan 2021 17:49:43 +0530
> > Dilip Kumar  wrote:
> >
> > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > > > >
> > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > > > Dilip Kumar  wrote:
> > > > >
> > > > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > > > wait.
> > > > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > > > blocked forever,
> > > > > > > > although this setting may not be usual. In addition, some users 
> > > > > > > > may set
> > > > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > > > pg_is_wal_replay_paused,
> > > > > > > > it could wait for a long time.
> > > > > > > >
> > > > > > > > At least, I think we need some descriptions on document to 
> > > > > > > > explain
> > > > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > > > >
> > > > > > > Ok
> > > > > >
> > > > > > Fixed this, added some comments in .sgml as well as in function 
> > > > > > header
> > > > >
> > > > > Thank you for fixing this.
> > > > >
> > > > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > > > "Pauses recovery." to "Request to pause recovery." in according with
> > > > > pg_is_wal_replay_paused?
> > > >
> > > > Okay
> > > >
> > > > >
> > > > > > > > Also, how about adding a new boolean argument to 
> > > > > > > > pg_is_wal_replay_paused to
> > > > > > > > control whether this waits for recovery to get paused or not? 
> > > > > > > > By setting its
> > > > > > > > default value to true or false, users can use the old format 
> > > > > > > > for calling this
> > > > > > > > and the backward compatibility can be maintained.
> > > > > > >
> > > > > > > So basically, if the wait_recovery_pause flag is false then we 
> > > > > > > will
> > > > > > > immediately return true if the pause is requested?  I agree that 
> > > > > > > it is
> > > > > > > good to have an API to know whether the recovery pause is 
> > > > > > > requested or
> > > > > > > not but I am not sure is it good idea to make this API serve both 
> > > > > > > the
> > > > > > > purpose?  Anyone else have any thoughts on this?
> > > > > > >
> > > > >
> > > > > I think the current pg_is_wal_replay_paused() already has another 
> > > > > purpose;
> > > > > this waits recovery to actually get paused. If we want to limit this 
> > > > > API's
> > > > > purpose only to return the pause state, it seems better to fix this 
> > > > > to return
> > > > > the actual state at the cost of lacking the backward compatibility. 
> > > > > If we want
> > > > > to know whether pause is requested, we may add a new API like
> > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> > > > > recovery to actually
> > > > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > > > purpose.
> > > > >
> > > > > However, this might be a bikeshedding. If anyone don't care that
> > > > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > > > care either.
> > > >
> > > > I don't think that it will be blocked ever, because
> > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > > > recovery process will not be stuck on waiting for the WAL.
> >
> > Yes, there is no stuck on waiting for the WAL. However, it can be stuck 
> > during resolving
> > a recovery conflict. The process could wait for max_standby_streaming_delay 
> > or
> > max_standby_archive_delay at most before recovery get completely paused.
> 
> Okay, I agree that it is possible so for handling this we have a
> couple of options
> 1. pg_is_wal_replay_paused(), interface will wait for recovery to
> actually get paused, but user have an option to cancel that.  So I
> agree that there is currently no option to just know that recovery
> pause is requested without waiting for its actually get paused if it
> is requested.  So one option is we can provide an another interface as
> you mentioned pg_is_wal_replay_paluse_requeseted(), which can just
> return the request status.  I am not sure how useful it is.

If it is acceptable that pg_is_wal_replay_paused() makes users wait, 
I'm ok for the current interface. I don't feel the need of
pg_is_wal_replay_paluse_requeseted().

> 
> 2. Pass an option to pg_is_wal_replay_paused whether to wait for
> recovery to actually get paused or not.
> 
> 3. Pass an option to pg_wal_replay_pause(), whether to wait for
> recovery pause or just request and return.
> 
> I like the option 1, any other opinion on this?
> 
> > Also, it could wait for recovery_min_apply_delay if it has a valid value. 
> > It is possible
> > that a user set this parameter to a large value, so it could wait for a 
> > long time. However,
> > this will be avoided by calling recoveryPausesHere() 

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Greg Nancarrow 
> On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila  wrote:
> >Here, it seems we are accessing the relation descriptor without any
> >lock on the table which is dangerous considering the table can be
> >modified in a parallel session. Is there a reason why you think this
> >is safe? Did you check anywhere else such a coding pattern?
> 
> Yes, there's a very good reason and I certainly have checked for the
> same coding pattern elsewhere, and not just randomly decided that
> locking can be ignored.
> The table has ALREADY been locked (by the caller) during the
> parse/analyze phase.

Isn't there any case where planning is done but parse analysis is not done 
immediately before?  e.g.

* Alteration of objects invalidates cached query plans, and the next execution 
of the plan rebuilds it.  (But it seems that parse analysis is done in this 
case in plancache.c.)

* Execute a prepared statement with a different parameter value, which builds a 
new custom plan or a generic plan.

Is the cached query plan invalidated when some alteration is done to change the 
parallel safety, such as adding a trigger with a parallel-unsafe trigger action?


Regards
Takayuki Tsunakawa



Re: POC: postgres_fdw insert batching

2021-01-18 Thread Tomas Vondra

On 1/19/21 2:28 AM, tsunakawa.ta...@fujitsu.com wrote:

From: Tomas Vondra 

I took a look at this - there's a bit of bitrot due to
708d165ddb92c, so attached is a rebased patch (0001) fixing that.

0002 adds a couple comments and minor tweaks

0003 addresses a couple shortcomings related to explain - we
haven't been showing the batch size for EXPLAIN (VERBOSE), because
there'd be no FdwState, so this tries to fix that. Furthermore,
there were no tests for EXPLAIN output with batch size, so I added
a couple.


Thank you, good additions.  They all look good. Only one point: I
think the code for retrieving batch_size in create_foreign_modify()
can be replaced with a call to the new function in 0003.



OK. Can you prepare a final patch, squashing all the commits into a 
single one, and perhaps use the function in create_foreign_modify?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgindent for worker.c

2021-01-18 Thread Amit Kapila
On Tue, Jan 19, 2021 at 6:53 AM Peter Smith  wrote:
>
> PSA a trivial patch just to pgindent the file
> src/backend/replication/logical/worker.c
>
> (I am modifying this file in a separate patch, but every time I used
> pgindent for my own code I would keep seeing these existing format
> problems).
>

Sorry for the inconvenience. This seems to be a leftover from my
commit 0926e96c49, so I will take care of this. I think we need to
change this file in the upcoming patches for logical replication of
2PC so, I'll push this change separately.

-- 
With Regards,
Amit Kapila.




Re: [PATCH 1/1] Initial mach based shared memory support.

2021-01-18 Thread James Hilliard
On Mon, Jan 18, 2021 at 5:29 PM Tom Lane  wrote:
>
> James Hilliard  writes:
> > OSX implements sysv shmem support via a mach wrapper, however the mach
> > sysv shmem wrapper has some severe restrictions that prevent us from
> > allocating enough memory segments in some cases.
> > ...
> > In order to avoid hitting these limits we can bypass the wrapper layer
> > and just use mach directly.
>
> I wanted to review this, but it's impossible because the kernel calls
> you're using have you've-got-to-be-kidding documentation like this:
>
> https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc
Yeah, the documentation situation with Apple tends to be some degree of terrible
in general. Ultimately I was referencing the sysv shmem wrapper code
itself a lot:
https://github.com/apple/darwin-xnu/blob/master/bsd/kern/sysv_shm.c

I also used the browser shmem implementations as references:
https://github.com/chromium/chromium/blob/master/base/memory/platform_shared_memory_region_mac.cc
https://github.com/mozilla/gecko-dev/blob/master/ipc/glue/SharedMemoryBasic_mach.mm
>
> Google finds the source code too, but that's equally bereft of useful
> documentation.  So I wonder where you obtained the information necessary
> to write this patch.
Mostly trial and error while using the above references.
>
> I did notice however that mach_shmem.c seems to be 95% copy-and-paste from
> sysv_shmem.c, including a lot of comments that don't feel particularly
> true or relevant here.  I wonder whether we need a separate file as
> opposed to a few #ifdef's around the kernel calls.
Well I basically started with copying sysv_shmem.c and then started
replacing the sysv
shmem calls with their mach equivalents. I kept it as a separate file
since the mach
calls use different semantics and that's also how the other projects
seem to separate
out their mach shared memory handlers. I left a number of the original
comments since
I wasn't entirely sure what best to replace them with.
>
> regards, tom lane




Re: simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote
On Tue, Jan 19, 2021 at 3:01 AM Pavel Stehule  wrote:
> po 18. 1. 2021 v 13:40 odesílatel Amit Langote  
> napsal:
>> I started with the check that's performed when inserting into or
>> updating the referencing table to confirm that the new row points to a
>> valid row in the referenced relation.  The corresponding SQL is this:
>>
>> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x
>>
>> $1 is the value of the foreign key of the new row.  If the query
>> returns a row, all good.  Thanks to SPI, or its use of plan caching,
>> the query is re-planned only a handful of times before making a
>> generic plan that is then saved and reused, which looks like this:
>>
>>   QUERY PLAN
>> --
>>  LockRows
>>->  Index Scan using pk_pkey on pk x
>>  Index Cond: (a = $1)
>> (3 rows)
>
>
> What is performance when the referenced table is small? - a lot of codebooks 
> are small between 1000 to 10K rows.

I see the same ~2x improvement.

create table p (a numeric primary key);
insert into p select generate_series(1, 1000);
create table f (a bigint references p);

Unpatched:

insert into f select i%1000+1 from generate_series(1, 100) i;
INSERT 0 100
Time: 5461.377 ms (00:05.461)


Patched:

insert into f select i%1000+1 from generate_series(1, 100) i;
INSERT 0 100
Time: 2357.440 ms (00:02.357)

That's expected because the overhead of using SPI to check the PK
table, which the patch gets rid of, is the same no matter the size of
the index to be scanned.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: psql \df choose functions by their arguments

2021-01-18 Thread Ian Lawrence Barwick
2021年1月15日(金) 1:46 Greg Sabino Mullane :

> Thanks for the feedback: new version v5 (attached) has int8, plus the
> suggested code formatting.
>
> Cheers,
> Greg
>

Thanks for the update.

In my preceding mail I meant we should add int2, int4 and int8 for
completeness
(apologies, I was a bit unclear there), as AFAICS that covers all aliases,
even if these
three are less widely used.

FWIW one place where these do get used in substantial numbers is in the
regression tests themselves:

  $ for L in 2 4 8; do git grep int$L src/test/regress/ | wc -l; done
  544
  2332
  1353

Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com


Re: PoC/WIP: Extended statistics on expressions

2021-01-18 Thread Tomas Vondra



On 1/18/21 4:48 PM, Dean Rasheed wrote:

Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().



Thanks, I'll fix all of that.



A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.



Well, I tried this but unfortunately it's not that simple. We still need 
to build the bitmaps, so I don't think add_expression_to_attributes can 
be just removed. I mean, we need to do the offsetting somewhere, even if 
we change how we do it.


But the main issue is that in some cases the number of expressions is 
not really limited by STATS_MAX_DIMENSIONS - for example when applying 
functional dependencies, we "merge" multiple statistics, so we may end 
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.


Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts 
the order of processing (so far we've assumed expressions are after 
regular attnums). So the changes are more extensive - I tried doing that 
anyway, and I'm still struggling with crashes and regression failures. 
Of course, that doesn't mean we shouldn't do it, but it's far from 
mechanical. (Some of that is probably a sign this code needs a bit more 
work to polish.)


But I wonder if it'd be easier to just calculate the actual max attnum 
and then use it instead of MaxHeapAttributeNumber ...




Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-+---+--+
  s   | foo   |  |
(1 row)

but after populating and analysing the table, this becomes:

  statistics_name | tablename |  expr   | n_distinct
-+---+-+
  s   | foo   | (a + b) | 11
  s   | foo   | (a * b) | 11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

  statistics_name | tablename | expr | n_distinct
-+---+--+
  s   | foo   |  |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+ LEFT JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+ JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON true;


Will fix.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: POC: postgres_fdw insert batching

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
Tomas-san, Zhihong-san,

From: Zhihong Yu  
> +   if (batch_size <= 0)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("%s requires a non-negative integer value",
> 
> It seems the message doesn't match the check w.r.t. the batch size of 0.

Ah, "non-negative" should be "positive".  The message for the existing 
fetch_size should be fixed too.  Tomas-san, could you include this as well?  
I'm sorry to trouble you.


> +   int numInserted = numSlots;
> 
> Since numInserted is filled by ExecForeignBatchInsert(), the initialization 
> can be done with 0.

No, the code is correct, since the batch function requires the number of rows 
to insert as input.


Regards
Takayuki Tsunakawa



Paint some PG_USED_FOR_ASSERTS_ONLY in inline functions of ilist.h and bufpage.h

2021-01-18 Thread Michael Paquier
Hi all,

The following functions in ilist.h and bufpage.h use some arguments
only in assertions:
- dlist_next_node
- dlist_prev_node
- slist_has_next
- slist_next_node
- PageValidateSpecialPointer

Without PG_USED_FOR_ASSERTS_ONLY, this can lead to compilation
warnings when not using assertions, and one example of that is
plpgsql_check.  We don't have examples on HEAD where
PG_USED_FOR_ASSERTS_ONLY is used on arguments, but that looks to work
properly with gcc.

Thoughts?
--
Michael
diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index aa196428ed..0c5e1d375f 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -418,7 +418,8 @@ dlist_has_prev(dlist_head *head, dlist_node *node)
  * Return the next node in the list (there must be one).
  */
 static inline dlist_node *
-dlist_next_node(dlist_head *head, dlist_node *node)
+dlist_next_node(dlist_head *head PG_USED_FOR_ASSERTS_ONLY,
+dlist_node *node)
 {
 	Assert(dlist_has_next(head, node));
 	return node->next;
@@ -428,7 +429,8 @@ dlist_next_node(dlist_head *head, dlist_node *node)
  * Return previous node in the list (there must be one).
  */
 static inline dlist_node *
-dlist_prev_node(dlist_head *head, dlist_node *node)
+dlist_prev_node(dlist_head *head PG_USED_FOR_ASSERTS_ONLY,
+dlist_node *node)
 {
 	Assert(dlist_has_prev(head, node));
 	return node->prev;
@@ -608,7 +610,8 @@ slist_pop_head_node(slist_head *head)
  * Check whether 'node' has a following node.
  */
 static inline bool
-slist_has_next(slist_head *head, slist_node *node)
+slist_has_next(slist_head *head PG_USED_FOR_ASSERTS_ONLY,
+			   slist_node *node)
 {
 	slist_check(head);
 
@@ -619,7 +622,8 @@ slist_has_next(slist_head *head, slist_node *node)
  * Return the next node in the list (there must be one).
  */
 static inline slist_node *
-slist_next_node(slist_head *head, slist_node *node)
+slist_next_node(slist_head *head PG_USED_FOR_ASSERTS_ONLY,
+slist_node *node)
 {
 	Assert(slist_has_next(head, node));
 	return node->next;
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 359b749f7f..85414a53d6 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -310,7 +310,7 @@ typedef PageHeaderData *PageHeader;
  * specifics from the macro failure within this function.
  */
 static inline bool
-PageValidateSpecialPointer(Page page)
+PageValidateSpecialPointer(Page page PG_USED_FOR_ASSERTS_ONLY)
 {
 	Assert(PageIsValid(page));
 	Assert(((PageHeader) (page))->pd_special <= BLCKSZ);


signature.asc
Description: PGP signature


RE: POC: postgres_fdw insert batching

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so 
> attached is
> a rebased patch (0001) fixing that.
> 
> 0002 adds a couple comments and minor tweaks
> 
> 0003 addresses a couple shortcomings related to explain - we haven't been
> showing the batch size for EXPLAIN (VERBOSE), because there'd be no
> FdwState, so this tries to fix that. Furthermore, there were no tests for 
> EXPLAIN
> output with batch size, so I added a couple.

Thank you, good additions.  They all look good.
Only one point: I think the code for retrieving batch_size in 
create_foreign_modify() can be replaced with a call to the new function in 0003.

God bless us.


Regards
Takayuki Tsunakawa



pgindent for worker.c

2021-01-18 Thread Peter Smith
PSA a trivial patch just to pgindent the file
src/backend/replication/logical/worker.c

(I am modifying this file in a separate patch, but every time I used
pgindent for my own code I would keep seeing these existing format
problems).


Kind Regards,
Peter Smith.
Fujitsu Australia


pgindent_worker.patch
Description: Binary data


RE: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Hou, Zhijie
> >>> +1 to add it after "dropped (Note )", how about as follows
> >>> with slight changes?
> >>>
> >>> dropped (Note that server name of an invalid connection can be NULL
> >>> if the server is dropped), and then such .
> >>
> >> Yes, I like this one. One question is; "should" or "is" is better
> >> than "can" in this case because the server name of invalid connection
> >> is always NULL when its server is dropped?
> >
> > I think "dropped (Note that server name of an invalid connection will
> > be NULL if the server is dropped), and then such ."
> 
> Sounds good to me. So patch attached.

+1

Best regards,
houzj




Re: list of extended statistics on psql

2021-01-18 Thread Tatsuro Yamada

Hi Tomas,


As for how to deal with this, I can think of about three ways:

1) simplify the command to only print information from pg_statistic_ext (so on 
information about which stats are built or sizes)

2) extend pg_stats_ext with necessary information (e.g. sizes)

3) create a new system view, with necessary information (so that pg_stats_ext 
does not need to be modified)

4) add functions returning the necessary information, possibly only for 
statistics the user can access (similarly to what pg_stats_ext does)

Options 2-4 have the obvious disadvantage that this won't work on older 
releases (we can't add views or functions there). So I'm leaning towards #1 
even if that means we have to remove some of the details. We can consider 
adding that for new releases, though.



Thanks for the useful advice. I go with option 1).
The following query is created by using pg_stats_ext instead of 
pg_statistic_ext and pg_statistic_ext_data. However, I was confused
about writing a part of the query for calculating MCV size because
there are four columns related to MCV. For example, most_common_vals, 
most_common_val_nulls, most_common_freqs, and most_common_base_freqs.
Currently, I don't know how to calculate the size of MCV by using the
four columns. Thoughts? :-)


Well, my suggestion was to use pg_statistic_ext, because that lists all 
statistics, while pg_stats_ext is filtering statistics depending on access 
privileges. I think that's more appropriate for \dX, the contents should not 
change depending on the user.

Also, let me clarify - with option (1) we'd not show the sizes at all. The size 
of the formatted statistics may be very different from the on-disk 
representation, so I see no point in showing it in \dX.

We might show other stats (e.g. number of MCV items, or the fraction of data 
represented by the MCV list), but the user can inspect pg_stats_ext if needed.

What we might do is to show those stats when a superuser is running this 
command, but I'm not sure that's a good idea (or how difficult would it be to 
implement).



Thanks for clarifying.
I see that your suggestion was to use pg_statistic_ext, not pg_stats_ext.
And we don't need the size of stats.

If that's the case, we also can't get the status of stats since PG12 or later
because we can't use pg_statistic_ext_data, as you know. Therefore, it would be
better to replace the query with the old query that I sent five months ago like 
this:

# the old query
SELECT
stxnamespace::pg_catalog.regnamespace AS "Schema",
stxrelid::pg_catalog.regclass AS "Table",
stxname AS "Name",
(SELECT pg_catalog.string_agg(pg_catalog.quote_ident(attname),', ')
 FROM pg_catalog.unnest(stxkeys) s(attnum)
 JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND
 a.attnum = s.attnum AND NOT attisdropped)) AS "Columns",
'd' = any(stxkind) AS "Ndistinct",
'f' = any(stxkind) AS "Dependencies",
'm' = any(stxkind) AS "MCV"
FROM pg_catalog.pg_statistic_ext stat
ORDER BY 1,2;

 Schema | Table  |Name| Columns | Ndistinct | Dependencies | MCV
+++-+---+--+-
 public | hoge1  | hoge1_ext  | a, b| t | t| t
 public | hoge_t | hoge_t_ext | a, b| t | t| t
(2 rows)


The above query is so simple so that we would better to use the following query:

# This query works on PG10 or later
SELECT
es.stxnamespace::pg_catalog.regnamespace::text AS "Schema",
es.stxname AS "Name",
pg_catalog.format('%s FROM %s',
(SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ')
 FROM pg_catalog.unnest(es.stxkeys) s(attnum)
 JOIN pg_catalog.pg_attribute a
 ON (es.stxrelid = a.attrelid
 AND a.attnum = s.attnum
 AND NOT a.attisdropped)),
es.stxrelid::regclass) AS "Definition",
CASE WHEN 'd' = any(es.stxkind) THEN 'defined'
END AS "Ndistinct",
CASE WHEN 'f' = any(es.stxkind) THEN 'defined'
END AS "Dependencies",
CASE WHEN 'm' = any(es.stxkind) THEN 'defined'
END AS "MCV"
FROM pg_catalog.pg_statistic_ext es
ORDER BY 1, 2;

 Schema |Name|Definition| Ndistinct | Dependencies | 
Dependencies
++--+---+--+--
 public | hoge1_ext  | a, b FROM hoge1  | defined   | defined  | defined
 public | hoge_t_ext | a, b FROM hoge_t | defined   | defined  | defined
(2 rows)


I'm going to create the WIP patch to use the above queriy.
Any comments welcome. :-D

Thanks,
Tatsuro Yamada





Re: [PATCH 1/1] Initial mach based shared memory support.

2021-01-18 Thread Tom Lane
James Hilliard  writes:
> OSX implements sysv shmem support via a mach wrapper, however the mach
> sysv shmem wrapper has some severe restrictions that prevent us from
> allocating enough memory segments in some cases.
> ...
> In order to avoid hitting these limits we can bypass the wrapper layer
> and just use mach directly.

I wanted to review this, but it's impossible because the kernel calls
you're using have you've-got-to-be-kidding documentation like this:

https://developer.apple.com/documentation/kernel/1402504-mach_vm_page_info?language=objc

Google finds the source code too, but that's equally bereft of useful
documentation.  So I wonder where you obtained the information necessary
to write this patch.

I did notice however that mach_shmem.c seems to be 95% copy-and-paste from
sysv_shmem.c, including a lot of comments that don't feel particularly
true or relevant here.  I wonder whether we need a separate file as
opposed to a few #ifdef's around the kernel calls.

regards, tom lane




Re: Is it worth accepting multiple CRLs?

2021-01-18 Thread Kyotaro Horiguchi
At Fri, 15 Jan 2021 08:56:27 +0100, Peter Eisentraut 
 wrote in 
> On 2020-08-31 11:03, Kyotaro Horiguchi wrote:
> > At Tue, 18 Aug 2020 16:43:47 +0900 (JST), Kyotaro Horiguchi
> >  wrote in
> >> Thank you very much. I'll do that after some polishing.
> >>
> >> A near-by discussion about OpenSSL3.0 conflicts with this but it's
> >> easy to follow.
> > Rebased. Fixed bogus tests and strange tentative API change of
> > SSLServer.pm.  Corrected a (maybe) spelling mistake.  I'm going to
> > register this to the coming CF.
> 
> Other systems that offer both a CRL file and a CRL directory usually
> specify those using two separate configuration settings.  Examples:
> 
> https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_ssl_crlpath
> https://httpd.apache.org/docs/current/mod/mod_ssl.html#sslcarevocationpath
> 
> These are then presumably both passed to X509_STORE_load_locations(),
> which supports specifying a file and directory concurrently.
> 
> I think that would be a preferable approach.  In practical terms, it
> would allow a user to introduce the directory method gradually without
> having to convert the existing CRL file at the same time.

Thank you for the information. The only reason for sharing the same
variable for both file and directory is to avoid additional variable
only for this reason. I'll post a new version where new GUC
ssl_crl_path is added.

By the way we can do the same thing on CA file/dir, but I personally
think that the benefit from the specify-by-directory for CA files is
far less than CRL files. So I'm not going to do this for CA files for
now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: popcount

2021-01-18 Thread David Fetter
On Mon, Jan 18, 2021 at 10:34:10AM -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> > [ assorted nits ]

fixed, I think.

> At the level of bikeshedding ... I quite dislike using the name "popcount"
> for these functions.  I'm aware that some C compilers provide primitives
> of that name, but I wouldn't expect a SQL programmer to know that;
> without that context the name seems pretty random and unintuitive.
> Moreover, it invites confusion with SQL's use of "pop" to abbreviate
> "population" in the statistical aggregates, such as var_pop().
> 
> Perhaps something along the lines of count_ones() or count_set_bits()
> would be more apropos.

Done that way.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 92f3d5bf1718b1b52a5a4d792f5c75e0bcc7fb74 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Wed, 30 Dec 2020 02:51:46 -0800
Subject: [PATCH v4] popcount
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.29.2"

This is a multi-part message in MIME format.
--2.29.2
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


Now it's accessible to SQL for the BIT VARYING and BYTEA types.

diff --git doc/src/sgml/func.sgml doc/src/sgml/func.sgml
index aa99665e2e..7626edeee7 100644
--- doc/src/sgml/func.sgml
+++ doc/src/sgml/func.sgml
@@ -4030,6 +4030,23 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
   
  
 
+  
+   
+
+ count_set_bits
+
+count_set_bits ( bytes bytea )
+bigint
+   
+   
+Counts the number of bits set in a binary string.
+   
+   
+count_set_bits('\xdeadbeef'::bytea)
+24
+   
+  
+
   

 
@@ -4830,6 +4847,23 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');

   
 
+  
+   
+
+ count_set_bits
+
+count_set_bits ( bits bit )
+bigint
+   
+   
+Counts the bits set in a bit string.
+   
+   
+count_set_bits(B'101010101010101010')
+9
+   
+  
+
   

 
@@ -4869,6 +4903,7 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 101010001010101010

   
+
  
 

diff --git src/include/catalog/pg_proc.dat src/include/catalog/pg_proc.dat
index b5f52d4e4a..416ee0c2fb 100644
--- src/include/catalog/pg_proc.dat
+++ src/include/catalog/pg_proc.dat
@@ -1446,6 +1446,9 @@
 { oid => '752', descr => 'substitute portion of string',
   proname => 'overlay', prorettype => 'bytea',
   proargtypes => 'bytea bytea int4', prosrc => 'byteaoverlay_no_len' },
+{ oid => '8436', descr => 'count set bits',
+  proname => 'count_set_bits', prorettype => 'int8', proargtypes => 'bytea',
+  prosrc => 'byteapopcount'},
 
 { oid => '725',
   proname => 'dist_pl', prorettype => 'float8', proargtypes => 'point line',
@@ -3865,6 +3868,9 @@
 { oid => '3033', descr => 'set bit',
   proname => 'set_bit', prorettype => 'bit', proargtypes => 'bit int4 int4',
   prosrc => 'bitsetbit' },
+{ oid => '8435', descr => 'count set bits',
+  proname => 'count_set_bits', prorettype => 'int8', proargtypes => 'bit',
+  prosrc => 'bitpopcount'},
 
 # for macaddr type support
 { oid => '436', descr => 'I/O',
diff --git src/backend/utils/adt/varbit.c src/backend/utils/adt/varbit.c
index 2235866244..c9c6c73422 100644
--- src/backend/utils/adt/varbit.c
+++ src/backend/utils/adt/varbit.c
@@ -36,6 +36,7 @@
 #include "libpq/pqformat.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/supportnodes.h"
+#include "port/pg_bitutils.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/varbit.h"
@@ -1878,3 +1879,30 @@ bitgetbit(PG_FUNCTION_ARGS)
 	else
 		PG_RETURN_INT32(0);
 }
+
+/*
+ * bitpopcount
+ *
+ * Returns the number of bits set in a bit string.
+ *
+ */
+Datum
+bitpopcount(PG_FUNCTION_ARGS)
+{
+	/* There's really no chance of an overflow here because
+	 * to get to INT64_MAX set bits, an object would have to be
+	 * an exbibyte long, exceeding what PostgreSQL can currently
+	 * store by a factor of 2^28
+	 */
+	int64		popcount;
+	VarBit		*arg1 = PG_GETARG_VARBIT_P(0);
+	bits8		*p;
+	int			len;
+
+	p = VARBITS(arg1);
+	len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE;
+
+	popcount = pg_popcount((char *)p, len);
+
+	PG_RETURN_INT64(popcount);
+}
diff --git src/backend/utils/adt/varlena.c src/backend/utils/adt/varlena.c
index 479ed9ae54..3f1179a0e8 100644
--- src/backend/utils/adt/varlena.c
+++ src/backend/utils/adt/varlena.c
@@ -3439,6 +3439,22 @@ bytea_overlay(bytea *t1, bytea *t2, int sp, int sl)
 	return result;
 }
 
+/*
+ * popcount
+ */
+Datum
+byteapopcount(PG_FUNCTION_ARGS)
+{
+	bytea	*t1 = PG_GETARG_BYTEA_PP(0);
+	int		len;
+	int64	result;
+
+	len = VARSIZE_ANY_EXHDR(t1);
+	result = 

Re: Key management with tests

2021-01-18 Thread Bruce Momjian
On Mon, Jan 18, 2021 at 04:38:47PM -0500, Robert Haas wrote:
> To me, it wouldn't make sense to commit a full README for a TDE
> feature that we don't have yet with a key management patch, but the
> way that they'll interact with each other has to be clear. The
> doc/database-encryption.sgml file that Bruce included in the patch is
> a decent start on explaining the design, though I think it needs more
> work and more details, perhaps including some of the things Andres
> mentioned.

Sure.

> To be honest, after reading over that SGML documentation a bit, I'm
> somewhat skeptical about whether it really makes sense to think about
> committing the key management part separately. It seems to have no use
> independent of the main feature, and it in fact embeds very specific

For usefulness, it does enable passphrase prompting for the TLS private
key.

> details of how the main feature is expected to work. For example, the
> documentation says that key #0 will be used for data files, and key #1
> for WAL. There seems to be no suggestion that the key management
> portion of this can be used to manage encryption keys generally for
> whatever purposes someone might have; it's all about the needs of a
> particular TDE implementation. Typically, we would not commit

We originally were going to have SQL-level keys, but many felt they
weren't useful.

> something like that separately, or only once the main patch was done,
> with the two commits occurring in a relatively short time period.
> Otherwise, as Bruce already noted, we can end up with something that
> is documented and visible to users but doesn't actually work yet.

Yep, that is the risk.

> Some more specific comments on data-encryption.sgml:
> 
> * The documentation explains that the purpose of having a WAL key
> separate from the data file key is so that the data file keys can
> "eventually" be rotated. It's not clear whether this means that we
> might eventually have that feature or that we might eventually be able
> to rotate, after failing over. If this kind of thing is possible,
> we'll eventually need documentation on how to do it.

I have clarified that saying "future release".

> * The reasons for use a DEK and a KEK are not explained. I realize
> it's not an uncommon practice and that other systems do it, but I
> think a few sentences of explanation wouldn't be a bad idea. Even if
> we are supposing that hackers who want to have input into this feature
> have to be knowledgeable about cryptography, I don't think we can
> reasonably suppose that for users.

I added a little about that in the docs.

> * "For example" is at one point followed by a period rather than a
> colon or comma.

Fixed.

> * In the "Internals" subsection, the last sentence doesn't seem to be
> grammatical. I wonder if it's missing the word "or"'.

Fixed.

> * The part about integrity-checking keys on startup isn't clear. It
> makes it sound like we still have a copy of the KEK lying around
> someplace against which we can compare, which I assume is not the case
> since it would be really insecure.

I rewored that entire section.  See if it is better now.

> * I think it's going to be pretty important that we can easily switch
> to other cryptographic algorithms as they are discovered, so I don't
> like the fact that this is tied specifically to AES. (In fact,
> kmgr_utils.h makes it sound like we're specifically locked into
> AES256, but that contradicts the documentation, so I guess there's
> some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN
> is doing.) As far as possible we should try to make this generic, like
> supporting any cipher that SSL has which has property X. It seems
> relatively inevitable that every currently popular cryptographic
> algorithm will at some point in the future be judged weak and
> worthless, just as has already happened with MD5 and some variants of
> SHA, both of which used to be considered state of the art. It seems
> equally inevitable that new and stronger algorithms will continued to
> be devised, and we'll want to adopt those easily.

That is a nifty idea.  Right now I just pass the integer length around,
and store it in pg_control, but if we define macros, we can easily
abstract this and easily allow for new methods.  If others like that, I
will start on it now.

> I'm not sure to what extent this a serious flaw in the patch and to
> what extent it's just a matter of tweaking the wording of some things,
> but I think this is actually an extremely critical design point where
> we had better be certain we've got it right. Few things would be
> sadder than to get a TDE feature and then have to rip it out again
> because it couldn't be upgraded to work with newer crypto algorithms
> with reasonable effort.

Yep.

> Notes on other parts of the documentation:
> 
> * The documentation for initdb -K doesn't list the valid values of the
> parameter, only the default. Probably we should be specifying an

Fixed.

> algorithm 

Re: NOT VALID for Unique Indexes

2021-01-18 Thread japin


On Fri, 15 Jan 2021 at 00:22, Simon Riggs  wrote:
> As you may be aware the NOT VALID qualifier currently only applies to
> CHECK and FK constraints, but not yet to unique indexes. I have had
> customer requests to change that.
>
> It's a reasonably common requirement to be able to change an index
> to/from a unique index, i.e. Unique -> NonUnique or NonUnique to
> Unique. Previously, it was easy enough to do that using a catalog
> update, but with security concerns  and the fact that the optimizer
> uses the uniqueness to optimize queries means that there is a gap in
> our support. We obviously need to scan the index to see if it actually
> can be marked as unique.
>
> In terms of locking we need to exclude writes while we add uniqueness,
> so scanning the index to check it is unique would cause problems. So
> we need to do the same thing as we do with other constraint types: add
> the constraint NOT VALID in one transaction and then later validate it
> in a separate transaction (if ever).
>
> I present a WIP patch to show it's a small patch to change Uniqueness
> for an index, with docs and tests.
>
> ALTER INDEX SET [NOT] UNIQUE [NOT VALID]
> ALTER INDEX VALIDATE UNIQUE
>
> It doesn't do the index validation scan (yet), but I wanted to check
> acceptability, syntax and requirements before I do that.
>
> I can also add similar syntax for UNIQUE and PK constraints.
>
> Thoughts please?

Great! I have some questions.

1. In the patch, you add a new attribute named "induniquevalid" in pg_index,
   however, there is a "indisvalid" in pg_index, can we use "indisvalid"?

2. The foreign key and CHECK constraints are valid by using
   ALTER TABLE .. ADD table_constraint [ NOT VALID ]
   ALTER TABLE .. VALIDATE CONSTRAINT constraint_name

   Should we implement unique index valid/not valid same as foreign key and
   CHECK constraints?

3. If we use the syntax to valid/not valid the unique, should we support
   other constraints, such as foreign key and CHECK constraints?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR

2021-01-18 Thread Massimo Fidanza
This is an interesting feature, but it seems that the author has abandoned
development, what happens now? Will this be postponed from commitfest to
commitfest and never be taken over by anyone?

Massimo.

Il giorno ven 6 mar 2020 alle ore 22:36 Dent John  ha
scritto:

> > On 22 Feb 2020, at 10:38, Dent John  wrote:
> >
> >> On 18 Feb 2020, at 03:03, Thomas Munro  wrote:
> >>
> >> From the trivialities department, I see a bunch of warnings about
> >> local declaration placement (we're still using C90 rules for those by
> >> project policy):
> >>
> >> […]
> >
> > […]
>
> My bad. I missed on declaration.
>
> Another patch attached.
>
> d.
>


Re: Key management with tests

2021-01-18 Thread Tom Kincaid
 I met with Bruce and Stephen this afternoon to discuss the feedback
we received so far (prior to Robert's note which I haven't fully
digested yet)
on this patch.

Here is what we plan to do:

1) Bruce is going to gather all the details from the Wiki and build a
README for the TDE Key Management patch. In addition, it will include
details about the implementation, the data structures involved and the
locks that are taken and general technical implementation approach.

2) Stephen is going to write up the overall design of TDE.

Between these two patches, we hope to cover what Andres is asking for
and what Robert is asking for in his reply on this thread which I
haven't fully digested yet.


Stephen's documentation patch will also make reference to Neil Chen's
TDE prototype for making use of this Key Management patch to encrypt
and
decrypt heap pages as well as index pages.

https://www.postgresql.org/message-id/CAA3qoJ=qto5jcsbjqfdbt9ikux9xkmc5bxcrd7ryse+xsme...@mail.gmail.com

3) Tom will work to find somebody who will sign up as a reviewer upon
the next submission of this patch. (Somebody who is not an author).

Could we get feedback if this feels like enough to get this patch
(which will include just the Key Management portion of TDE) to a state
where it can be reviewed and assuming the review issues are resolved
with consensus be committed?

On Mon, Jan 18, 2021 at 2:00 PM Andres Freund  wrote:
>
> On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote:
> > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> > > Personally, but I admit that there's legitimate reasons to differ on
> > > that note, I don't think it's reasonable for a feature this invasive to
> > > commit preliminary patches without the major subsequent patches being in
> > > a shape that allows reviewing the whole picture.
> >
> > OK, if that is a requirement, I can't help anymore since there are
> > already complaints that the patch is too large to review, even if broken
> > into pieces.  Please let me know what the community decides.
>
> Those aren't conflicting demands. Having later patches around to
> validate the design of earlier patches doesn't necessitates that the
> later patches need to be reviewed at the same time.



-- 
Thomas John Kincaid




Re: Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Jan-18, Tom Lane wrote:
>> [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
>> out?  I guess it'd have to ...

> It does, per SyncOneBuffer(), called from BufferSync(), called from
> CheckPointBuffers().

Right, then we don't need any strange theories about autovacuum,
just bad timing luck.  whelk does seem pretty slow, so it's not
much of a stretch to imagine that it's more susceptible to this
corner case than faster machines.

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages?  By this theory, they'd
all be failures waiting to happen.

regards, tom lane




Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Tom Lane
David Geier  writes:
> On 18.01.21 19:46, Tom Lane wrote:
>> Hm.  I agree that we shouldn't simply assume that ss_currentRelation
>> isn't null.  However, we cannot make search_plan_tree() descend
>> through non-leaf CustomScan nodes, because we don't know what processing
>> is involved there.  We need to find a scan that is guaranteed to return
>> rows that are one-to-one with the cursor output.  This is why the function
>> doesn't descend through join or aggregation nodes, and I see no argument
>> by which we should assume we know more about what a customscan node will
>> do than we know about those.

> That makes sense. Thanks for the explanation.

OK, cool.  I was afraid you'd argue that you really needed your CustomScan
node to be transparent in such cases.  We could imagine inventing an
additional custom-scan-provider callback to embed the necessary knowledge,
but I'd rather not add the complexity until someone has a use-case.

> I updated the patch to match your proposal.

WFM, will push in a bit.

regards, tom lane




Re: Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Alvaro Herrera
On 2021-Jan-18, Tom Lane wrote:

> Right.  If that's the explanation, then adding DISABLE_PAGE_SKIPPING
> to the test's VACUUM options should fix it.  However, to believe that
> theory you have to have some reason to think that some other process
> might have the page pinned.  What would that be?  test1 only has one
> small tuple in it, so it doesn't seem credible that autovacuum or
> autoanalyze would have fired on it.

I guess the machine would have to be pretty constrained.  (It takes
almost seven minutes to go through the pg_upgrade test, so it does seems
small.)

> [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
> out?  I guess it'd have to ...

It does, per SyncOneBuffer(), called from BufferSync(), called from
CheckPointBuffers().

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Jan-18, Tom Lane wrote:
>> Searching the buildfarm logs turned up exactly one previous occurrence,
>> also on whelk [2].  So I'm not sure what to make of it.  Could the
>> immediately preceding VACUUM FREEZE command have silently skipped this
>> page for some reason?  That'd be a bug I should think.

> Hmm, doesn't vacuum skip pages when they are pinned?  I don't think
> VACUUM FREEZE would be treated especially -- only "aggressive"
> wraparound would be an exception, IIRC.

Right.  If that's the explanation, then adding DISABLE_PAGE_SKIPPING
to the test's VACUUM options should fix it.  However, to believe that
theory you have to have some reason to think that some other process
might have the page pinned.  What would that be?  test1 only has one
small tuple in it, so it doesn't seem credible that autovacuum or
autoanalyze would have fired on it.

[ thinks for a bit... ]  Does the checkpointer pin pages it's writing
out?  I guess it'd have to ...

regards, tom lane




Re: Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Alvaro Herrera
On 2021-Jan-18, Tom Lane wrote:

> Searching the buildfarm logs turned up exactly one previous occurrence,
> also on whelk [2].  So I'm not sure what to make of it.  Could the
> immediately preceding VACUUM FREEZE command have silently skipped this
> page for some reason?  That'd be a bug I should think.

Hmm, doesn't vacuum skip pages when they are pinned?  I don't think
VACUUM FREEZE would be treated especially -- only "aggressive"
wraparound would be an exception, IIRC.  This would reflect in the
relfrozenxid for the table after vacuum, but I'm not sure if there's a
decent way to make the regression tests reflect that.

> Also, not really a bug, but why is this test script running exactly
> the same query twice in a row?  If that's of value, and not just a
> copy-and-paste error, the comments sure don't explain why.  But what
> it looks like is that these queries were different when first added,
> and then 58b4cb30a5b made them the same when it probably should have
> removed one.

Agreed.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-18 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jan 18, 2021 at 3:25 PM Tom Lane  wrote:
>> If memory serves, the reason for the lock was that the CHECKPOINT
>> command used to run the checkpointing code directly in the calling
>> backend, so we needed it to keep more than one process from doing
>> that at once.  AFAICS, it's no longer possible for more than one
>> process to try to run that code concurrently, so we shouldn't need
>> the lock anymore.

> Interesting. I think that must have been a *very* long time ago.

Yeah.  Digging further, it looks like I oversimplified things above:
we once launched special background-worker-like processes for checkpoints,
and there could be more than one at a time.  That seems to have changed
here:

Author: Tom Lane 
Branch: master Release: REL8_0_BR [076a055ac] 2004-05-29 22:48:23 +

Separate out bgwriter code into a logically separate module, rather
than being random pieces of other files.  Give bgwriter responsibility
for all checkpoint activity (other than a post-recovery checkpoint);
so this child process absorbs the functionality of the former transient
checkpoint and shutdown subprocesses.

At the time I thought the CheckpointLock had become totally pro forma,
but there are later references to having to prevent a checkpoint from
running concurrently with a restartpoint, which was originally done
within the startup/WAL-replay process.  It looks like that changed here:

Author: Heikki Linnakangas 
Branch: master Release: REL8_4_BR [7e48b77b1] 2009-06-25 21:36:00 +

Fix some serious bugs in archive recovery, now that bgwriter is active
during it:

When bgwriter is active, the startup process can't perform mdsync() 
correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.

(The modern checkpointer process wasn't split out of bgwriter until
806a2aee3 of 2011-11-01.)

regards, tom lane




Odd, intermittent failure in contrib/pageinspect

2021-01-18 Thread Tom Lane
whelk failed today [1] with this surprising symptom:

--- snip ---
diff -w -U3 
C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 
C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out
--- 
C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
2020-03-08 09:00:35.036254700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out 
2021-01-18 22:10:10.889655500 +0100
@@ -90,8 +90,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
  LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 | raw_flags  
   |   combined_flags   
-+-+---+
-   2816 |   2 | 
{HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
++-+-+
+   2304 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 
 -- output the decoded flag HEAP_XMIN_FROZEN instead
@@ -99,8 +99,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
  LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 | raw_flags  
   |   combined_flags   
-+-+---+
-   2816 |   2 | 
{HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
++-+-+
+   2304 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)
 
 -- tests for decoding of combined flags
--- snip ---

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2].  So I'm not sure what to make of it.  Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason?  That'd be a bug I should think.

Also, not really a bug, but why is this test script running exactly
the same query twice in a row?  If that's of value, and not just a
copy-and-paste error, the comments sure don't explain why.  But what
it looks like is that these queries were different when first added,
and then 58b4cb30a5b made them the same when it probably should have
removed one.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2021-01-18%2020%3A42%3A13
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2020-04-17%2023%3A42%3A10




Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-18 Thread Robert Haas
On Mon, Jan 18, 2021 at 3:25 PM Tom Lane  wrote:
> If memory serves, the reason for the lock was that the CHECKPOINT
> command used to run the checkpointing code directly in the calling
> backend, so we needed it to keep more than one process from doing
> that at once.  AFAICS, it's no longer possible for more than one
> process to try to run that code concurrently, so we shouldn't need
> the lock anymore.

Interesting. I think that must have been a *very* long time ago.
Perhaps 076a055acf3c55314de267c62b03191586d79cf6 from 2004?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Key management with tests

2021-01-18 Thread Robert Haas
On Mon, Jan 18, 2021 at 2:00 PM Tom Kincaid  wrote:
> Some of the missing things you mention above are about the design of
> TDE  feature in general. However, this patch is about Key Management
> which is going part of the larger TDE feature. So it feels as though
> there is the need for a general design document about the overall
> vision / approach for TDE and a specific design doc. for Key
> Management. Is it appropriate to include both of those in the same
> patch?

To me, it wouldn't make sense to commit a full README for a TDE
feature that we don't have yet with a key management patch, but the
way that they'll interact with each other has to be clear. The
doc/database-encryption.sgml file that Bruce included in the patch is
a decent start on explaining the design, though I think it needs more
work and more details, perhaps including some of the things Andres
mentioned.

To be honest, after reading over that SGML documentation a bit, I'm
somewhat skeptical about whether it really makes sense to think about
committing the key management part separately. It seems to have no use
independent of the main feature, and it in fact embeds very specific
details of how the main feature is expected to work. For example, the
documentation says that key #0 will be used for data files, and key #1
for WAL. There seems to be no suggestion that the key management
portion of this can be used to manage encryption keys generally for
whatever purposes someone might have; it's all about the needs of a
particular TDE implementation. Typically, we would not commit
something like that separately, or only once the main patch was done,
with the two commits occurring in a relatively short time period.
Otherwise, as Bruce already noted, we can end up with something that
is documented and visible to users but doesn't actually work yet.

Some more specific comments on data-encryption.sgml:

* The documentation explains that the purpose of having a WAL key
separate from the data file key is so that the data file keys can
"eventually" be rotated. It's not clear whether this means that we
might eventually have that feature or that we might eventually be able
to rotate, after failing over. If this kind of thing is possible,
we'll eventually need documentation on how to do it.

* The reasons for use a DEK and a KEK are not explained. I realize
it's not an uncommon practice and that other systems do it, but I
think a few sentences of explanation wouldn't be a bad idea. Even if
we are supposing that hackers who want to have input into this feature
have to be knowledgeable about cryptography, I don't think we can
reasonably suppose that for users.

* "For example" is at one point followed by a period rather than a
colon or comma.

* In the "Internals" subsection, the last sentence doesn't seem to be
grammatical. I wonder if it's missing the word "or"'.

* The part about integrity-checking keys on startup isn't clear. It
makes it sound like we still have a copy of the KEK lying around
someplace against which we can compare, which I assume is not the case
since it would be really insecure.

* I think it's going to be pretty important that we can easily switch
to other cryptographic algorithms as they are discovered, so I don't
like the fact that this is tied specifically to AES. (In fact,
kmgr_utils.h makes it sound like we're specifically locked into
AES256, but that contradicts the documentation, so I guess there's
some clarification needed here about what exactly KMGR_CLUSTER_KEY_LEN
is doing.) As far as possible we should try to make this generic, like
supporting any cipher that SSL has which has property X. It seems
relatively inevitable that every currently popular cryptographic
algorithm will at some point in the future be judged weak and
worthless, just as has already happened with MD5 and some variants of
SHA, both of which used to be considered state of the art. It seems
equally inevitable that new and stronger algorithms will continued to
be devised, and we'll want to adopt those easily.

I'm not sure to what extent this a serious flaw in the patch and to
what extent it's just a matter of tweaking the wording of some things,
but I think this is actually an extremely critical design point where
we had better be certain we've got it right. Few things would be
sadder than to get a TDE feature and then have to rip it out again
because it couldn't be upgraded to work with newer crypto algorithms
with reasonable effort.

Notes on other parts of the documentation:

* The documentation for initdb -K doesn't list the valid values of the
parameter, only the default. Probably we should be specifying an
algorithm here and not just a bit count. Otherwise, like I say above,
what happens when AES gives way to something else? It'd be easy to say
-K BFT256 instead of -K AES256, but if AES is assumed and it's no
longer what we want them we have problems. This kind of thing probably
needs to be cleaned up in a bunch of places.

* I don't see the 

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Peter Geoghegan
On Mon, Jan 18, 2021 at 1:10 PM Victor Yegorov  wrote:
> I must admit, that it's a bit difficult to understand you here (at least for 
> me).
>
> I assume that by "bet" you mean flagged tuple, that we marked as such
> (should we implement the suggested case).
> As heapam will give up early in case there are no deletable tuples, why do 
> say,
> that "bet" is no longer low? At the end, we still decide between page split 
> (and
> index bloat) vs a beneficial space cleanup.

Well, as I said, there are various ways in which our inferences (say
the ones in nbtdedup.c) are likely to be wrong. You understand this
already. For example, obviously if there are two duplicate index
tuples pointing to the same heap page then it's unlikely that both
will be deletable, and there is even a fair chance that neither will
be (for many reasons). I think that it's important to justify why we
use stuff like that to drive our decisions -- the reasoning really
matters. It's very much not like the usual optimization problem thing.
It's a tricky thing to discuss.

I don't assume that I understand all workloads, or how I might
introduce regressions. It follows that I should be extremely
conservative about imposing new costs here. It's good that we
currently know of no workloads that the patch is likely to regress,
but absence of evidence isn't evidence of absence. I personally will
never vote for a theoretical risk with only a theoretical benefit. And
right now that's what the idea of doing bottom-up deletions in more
marginal cases (the page flag thing) looks like.

--
Peter Geoghegan




Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier

Hi,

On 18.01.21 19:46, Tom Lane wrote:

David Geier  writes:

search_plan_tree() assumes that
CustomScanState::ScanState::ss_currentRelation is never NULL. In my
understanding that only holds for CustomScanState nodes which are at the
bottom of the plan and actually read from a relation. CustomScanState
nodes which are not at the bottom don't have ss_currentRelation set. I
believe for such nodes, instead search_plan_tree() should recurse into
CustomScanState::custom_ps.

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

That makes sense. Thanks for the explanation.


So I'm inclined to think a suitable fix is just

-   if (RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   RelationGetRelid(sstate->ss_currentRelation) == table_oid)
 result = sstate;

regards, tom lane



I updated the patch to match your proposal.

Best regards,
David
Swarm64
diff --git a/src/backend/executor/execCurrent.c 
b/src/backend/executor/execCurrent.c
index c7f909241b..fb3cf9537c 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -330,7 +330,8 @@ search_plan_tree(PlanState *node, Oid table_oid,
{
ScanState  *sstate = (ScanState *) node;
 
-   if 
(RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   
RelationGetRelid(sstate->ss_currentRelation) == table_oid)
result = sstate;
break;
}


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Tom Lane
Zhihong Yu  writes:
> I was thinking that, if sstate->ss_currentRelation is null for the other
> cases, that would be a bug.
> An assertion can be added for the cases ending with T_TidScanState.

Maybe, but there are surely a lot of other places that would crash
in such a case --- places far more often traversed than search_plan_tree.
I do not see any value in complicating search_plan_tree for that.

regards, tom lane




Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Victor Yegorov
пн, 18 янв. 2021 г. в 21:43, Peter Geoghegan :

> In the end, I couldn't justify imposing this cost on anything other
> than a non-HOT updater, which is what the flag proposal would require
> me to do -- then it's not 100% clear that the relative cost of each
> "bet" placed in heapam.c really is extremely low (we can no longer say
> for sure that we have very little to lose, given the certain
> alternative that is a version churn page split).


I must admit, that it's a bit difficult to understand you here (at least
for me).

I assume that by "bet" you mean flagged tuple, that we marked as such
(should we implement the suggested case).
As heapam will give up early in case there are no deletable tuples, why do
say,
that "bet" is no longer low? At the end, we still decide between page split
(and
index bloat) vs a beneficial space cleanup.

The fact is that
> "version chains in indexes" still cannot get very long. Plus there are
> other subtle ways in which it's unlikely to be a problem (e.g. the
> LP_DEAD deletion stuff also got much better, deduplication can combine
> with bottom-up deletion so that we'll trigger a useful bottom-up
> deletion pass sooner or later without much downside, and possibly
> other things I haven't even thought of).
>

I agree with this, except for "version chains" not being long. It all
really depends
on the data distribution. It's perfectly common to find indexes supporting
FK constraints
on highly skewed sets, with 80% of the index belonging to a single value
(say, huge business
customer vs several thousands of one-time buyers).


-- 
Victor Yegorov


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Zhihong Yu
Hi, Tom:
I was thinking that, if sstate->ss_currentRelation is null for the other
cases, that would be a bug.
An assertion can be added for the cases ending with T_TidScanState.
Though, the null sstate->ss_currentRelation would surface immediately
(apart from assertion). So I omitted the assertion in the diff.

Cheers

On Mon, Jan 18, 2021 at 12:16 PM Tom Lane  wrote:

> Zhihong Yu  writes:
> > It seems sstate->ss_currentRelation being null can only
> > occur for T_ForeignScanState and T_CustomScanState.
> > What about the following change ?
>
> Seems like more code for no very good reason.
>
> regards, tom lane
>


Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple

2021-01-18 Thread Tom Lane
japin  writes:
> I find that the outputstr variable in logicalrep_write_tuple() only use in
> `else` branch, I think we can narrow the scope, just like variable outputbytes
> in `if` branch (for more readable).

Agreed, done.

For context, I'm not usually in favor of making one-off stylistic
improvements: the benefit seldom outweighs the risk of creating
merge hazards for future back-patching.  But in this case, the
code involved is mostly new in v14, so improving it now doesn't
cost anything in that way.

regards, tom lane




Re: Add primary keys to system catalogs

2021-01-18 Thread Joel Jacobson
On Mon, Jan 18, 2021, at 19:33, Tom Lane wrote:
> On second thought, a catalog is overkill; it'd only be useful if the data
> could change after initdb, which this data surely cannot.  The right way
> to expose such info to SQL is with a set-returning function reading a
> constant table in the C code, a la pg_get_keywords().  That way takes a
> whole lot less infrastructure and is just as useful to SQL code.

+1

Re: truncating timestamps on arbitrary intervals

2021-01-18 Thread John Naylor
On Mon, Nov 23, 2020 at 1:44 PM John Naylor 
wrote:
>
> On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
> > - After reading the discussion a few times, I'm not so sure anymore
> > whether making this a cousin of date_trunc is the right way to go.  As
> > you mentioned, there are some behaviors specific to date_trunc that
> > don't really make sense in date_trunc_interval, and maybe we'll have
> > more of those.

For v10, I simplified the behavior by decoupling the behavior from
date_trunc() and putting in some restrictions as discussed earlier. It's
much simpler now. It could be argued that it goes too far in that
direction, but it's easy to reason about and we can put back some features
as we see fit.

> > Also, date_trunc_interval isn't exactly a handy name.
> > Maybe something to think about.  It's obviously fairly straightforward
> > to change it.
>
> Effectively, it puts timestamps into bins, so maybe date_bin() or
something like that?

For v10 I went with date_bin() so we can see how that looks.

> > - There were various issues with the stride interval having months and
> > years.  I'm not sure we even need that.  It could be omitted unless you
> > are confident that your implementation is now sufficient.
>
> Months and years were a bit tricky, so I'd be happy to leave that out if
there is not much demand for it. date_trunc() already has quarters,
decades, centuries, and millenia.

I removed months and years for this version, but that can be reconsidered
of course. The logic is really simple now.

> > - Also, negative intervals could be prohibited, but I suppose that
> > matters less.

I didn't go this far, but probably should before long.

> > - Then again, I'm thinking that maybe we should make the origin
> > mandatory.  Otherwise, the default answers when having strides larger
> > than a day are entirely arbitrary, e.g.,

I've tried this and like the resulting simplification.

> > - I'm wondering whether we need the date_trunc_interval(interval,
> > timestamptz, timezone) variant.  Isn't that the same as
> > date_trunc_interval(foo AT ZONE xyz, value)?
>
> I based this on 600b04d6b5ef6 for date_trunc(), whose message states:
>
> date_trunc(field, timestamptz, zone_name)
>
> is the same as
>
> date_trunc(field, timestamptz at time zone zone_name) at time zone
zone_name
>
> so without the shorthand, you need to specify the timezone twice, once
for the calculation, and once for the output.

In light of making the origin mandatory, it no longer makes sense to have a
time zone parameter, since we can specify the time zone on the origin; and
if desired on the output as well.

--
John Naylor
EDB: http://www.enterprisedb.com


v10-datetrunc-interval.patch
Description: Binary data


Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Peter Geoghegan
On Mon, Jan 18, 2021 at 6:11 AM Victor Yegorov  wrote:
> If I understand you correctly, you suggest to track _all_ the hints that came
> from the executor for possible non-HOT logical duplicates somewhere on
> the page. And when we hit the no-space case, we'll check not only the last
> item being hinted, but all items on the page, which makes it more probable
> to kick in and do something.

> Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE
> flag, maybe it can be reused for this purpose?

There actually was a similar flag (named BTP_UNCHANGED_UPDATE and
later BTP_HAS_DUPS) that appeared in earlier versions of the patch
(several versions in total, up to and including v6). This was never
discussed on the list because I assumed that that wouldn't be helpful
(I was already writing too many overlong e-mails). It was unsettled in
my mind at the time, so it didn't make sense to start discussing. I
changed my mind at some point, and so it never came up until now, when
Amit raised the question.

Looking back on my personal notes, I am reminded that I debated this
exact question with myself at length. The argument for keeping the
nbtree flag (i.e. what Amit is arguing for now) involved an isolated
example that seems very similar to the much more recent example from
Amit (that he arrived at independently). I am at least sympathetic to
this view of things, even now. Let me go into why I changed my mind
after a couple of months of on and off deliberation.

In general, the highly speculative nature of the extra work that
heapam.c does for index deleters in the bottom-up caller case can only
be justified because the cost is paid by non-HOT updaters that are
definitely about to split the page just to fit another version, and
because we only risk wasting one heap page access at any given point
of the entire process (the latter point about risk/cost is not 100%
true, because you have additional fixed CPU costs and stuff like that,
but it's at least true in spirit). We can justify "gambling" like this
only because the game is rigged in our favor to an *absurd* degree:
there are many ways to win big (relative to the likely version churn
page split baseline case), and we only have to put down relatively
small "bets" at any given point -- heapam.c will give up everything
when it encounters one whole heap page that lacks a single deletable
entry, no matter what the reason is.

The algorithm *appears* to behave very intelligently when seen from
afar, but is in fact stupid and opportunistic when you look at it up
close. It's possible to be so permissive about the upside benefit by
also being very conservative about the downside cost. Almost all of
our individual inferences can be wrong, and yet we still win in the
end. And the effect is robust over time. You could say that it is an
organic approach: it adapts to the workload, rather than trying to
make the workload adapt to it. This is less radical than you'd think
-- in some sense this is how B-Trees have always worked.

In the end, I couldn't justify imposing this cost on anything other
than a non-HOT updater, which is what the flag proposal would require
me to do -- then it's not 100% clear that the relative cost of each
"bet" placed in heapam.c really is extremely low (we can no longer say
for sure that we have very little to lose, given the certain
alternative that is a version churn page split). The fact is that
"version chains in indexes" still cannot get very long. Plus there are
other subtle ways in which it's unlikely to be a problem (e.g. the
LP_DEAD deletion stuff also got much better, deduplication can combine
with bottom-up deletion so that we'll trigger a useful bottom-up
deletion pass sooner or later without much downside, and possibly
other things I haven't even thought of).

It's possible that I have been too conservative. I admit that my
decision on this point is at least a little arbitrary, but I stand by
it for now. I cannot feel too bad about theoretically leaving some
gains on the table, given that we're only talking about deleting a
group of related versions a little later than we would otherwise, and
only in some circumstances, and in a way that seems likely to be
imperceptible to users. I reserve the right to change my mind about
it, but for now it doesn't look like an absurdly good deal, and those
are the kind of deals that it makes sense to focus on here. I am very
happy about the fact that it is relatively easy to understand why the
worst case for bottom-up index deletion cannot be that bad.

--
Peter Geoghegan




[PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-18 Thread Michail Nikolaev
Hello, hackers.

[ABSTRACT]

Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:

 * We do this because the xmin on the primary node could easily be
 * later than the xmin on the standby node, so that what the primary
 * thinks is killed is supposed to be visible on standby. So for correct
 * MVCC for queries during recovery we must ignore these hints and check
 * all tuples.

Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.

The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.

This thread is the continuation (and party copy-paste) of the old
previous one [4].

[PROBLEM]

The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.

So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.

The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...

But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].

Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.

[DETAILS]

The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.

`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).

Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).

New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).

There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.

It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.

What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).

While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.

To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.

So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.

And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because `HeapTupleIsSurelyDead` (and 

Re: PG vs LLVM 12 on seawasp, next round

2021-01-18 Thread Fabien COELHO



Hello Alvaro,


The "no such file" error seems more like a machine local issue to me.


I'll look into it when I have time, which make take some time. Hopefully
over the holidays.


This is still happening ... Any chance you can have a look at it?


Indeed. I'll try to look (again) into it soon. I had a look but did not 
find anything obvious in the short time frame I had. Last two months were 
a little overworked for me so I let slip quite a few things. If you want 
to disable the animal as Tom suggests, do as you want.


--
Fabien.




Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-18 Thread Álvaro Herrera
On 2021-Jan-18, Matthias van de Meent wrote:

> On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera  wrote:

> Would this not need to be the following? Right now, it resets
> potentially older h->catalog_oldest_nonremovable (which is set in the
> PROC_IN_SAFE_IC branch).
> 
> > +if (statusFlags & PROC_IN_SAFE_IC)
> > +h->catalog_oldest_nonremovable =
> > +TransactionIdOlder(h->catalog_oldest_nonremovable, 
> > xmin);
> > +else
> > +{
> > +h->data_oldest_nonremovable =
> > +TransactionIdOlder(h->data_oldest_nonremovable, xmin);
> > +h->catalog_oldest_nonremovable =
> > +TransactionIdOlder(h->catalog_oldest_nonremovable, 
> > xmin);
> > +}

Oops, you're right.

-- 
Álvaro Herrera   Valdivia, Chile




Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-18 Thread Álvaro Herrera
On 2021-Jan-18, Matthias van de Meent wrote:

> Example:
> 
> 1.) RI starts
> 2.) PHASE 2: filling the index:
> 2.1.) scanning the heap (live tuple is cached)
> < tuple is deleted
> < last transaction other than RI commits, only snapshot of RI exists
> < vacuum drops the tuple, and cannot remove it from the new index
> because this new index is not yet populated.
> 2.2.) sorting tuples
> 2.3.) index filled with tuples, incl. deleted tuple
> 3.) PHASE 3: wait for transactions
> 4.) PHASE 4: validate does not remove the tuple from the index,
> because it is not built to do so: it will only insert new tuples.
> Tuples that are marked for deletion are removed from the index only
> through VACUUM (and optimistic ALL_DEAD detection).
> 
> According to my limited knowledge of RI, it requires VACUUM to not run
> on the table during the initial index build process (which is
> currently guaranteed through the use of a snapshot).

VACUUM cannot run concurrently with CIC or RI in a table -- both acquire
ShareUpdateExclusiveLock, which conflicts with itself, so this cannot
occur.

I do wonder if the problem you suggest (or something similar) can occur
via HOT pruning, though.

-- 
Álvaro Herrera   Valdivia, Chile




Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-18 Thread Tom Lane
Robert Haas  writes:
> Here's a patch to remove CheckpointLock completely.  ...
> So I don't see any problem with just ripping this out entirely, but
> I'd like to know if anybody else does.

If memory serves, the reason for the lock was that the CHECKPOINT
command used to run the checkpointing code directly in the calling
backend, so we needed it to keep more than one process from doing
that at once.  AFAICS, it's no longer possible for more than one
process to try to run that code concurrently, so we shouldn't need
the lock anymore.

regards, tom lane




Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Tom Lane
Zhihong Yu  writes:
> It seems sstate->ss_currentRelation being null can only
> occur for T_ForeignScanState and T_CustomScanState.
> What about the following change ?

Seems like more code for no very good reason.

regards, tom lane




Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-18 Thread Matthias van de Meent
On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera  wrote:
>
> So one last remaining improvement was to have VACUUM ignore processes
> doing CIC and RC to compute the Xid horizon of tuples to remove.  I
> think we can do something simple like the attached patch.

Regarding the patch:

> +if (statusFlags & PROC_IN_SAFE_IC)
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +else
> +h->data_oldest_nonremovable = h->catalog_oldest_nonremovable 
> =
> +TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Would this not need to be the following? Right now, it resets
potentially older h->catalog_oldest_nonremovable (which is set in the
PROC_IN_SAFE_IC branch).

> +if (statusFlags & PROC_IN_SAFE_IC)
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +else
> +{
> +h->data_oldest_nonremovable =
> +TransactionIdOlder(h->data_oldest_nonremovable, xmin);
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +}


Prior to reading the rest of my response: I do not understand the
intricacies of the VACUUM process, nor the systems of db snapshots, so
it's reasonably possible I misunderstand this.
But would this patch not allow for tuples to be created, deleted and
vacuumed away from a table whilst rebuilding an index on that table,
resulting in invalid pointers in the index?

Example:

1.) RI starts
2.) PHASE 2: filling the index:
2.1.) scanning the heap (live tuple is cached)
< tuple is deleted
< last transaction other than RI commits, only snapshot of RI exists
< vacuum drops the tuple, and cannot remove it from the new index
because this new index is not yet populated.
2.2.) sorting tuples
2.3.) index filled with tuples, incl. deleted tuple
3.) PHASE 3: wait for transactions
4.) PHASE 4: validate does not remove the tuple from the index,
because it is not built to do so: it will only insert new tuples.
Tuples that are marked for deletion are removed from the index only
through VACUUM (and optimistic ALL_DEAD detection).

According to my limited knowledge of RI, it requires VACUUM to not run
on the table during the initial index build process (which is
currently guaranteed through the use of a snapshot).


Regards,

Matthias van de Meent




Re: [PATCH] Add support for leading/trailing bytea trim()ing

2021-01-18 Thread Tom Lane
"Joel Jacobson"  writes:
> On Fri, Dec 4, 2020, at 22:02, Tom Lane wrote:
>> (Maybe the existing ltrim/rtrim descrs are also like this, but if so
>> I'd change them too.)

> They weren't, but I think the description for the bytea functions
> can be improved to have a more precise description
> if we take inspiration from the the text functions.

Yeah, I agree with making the bytea descriptions look like the
text ones.  Pushed with minor additional doc fixes.

regards, tom lane




Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Peter Geoghegan
On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila  wrote:
> With the above example, it seems like it would also help when this is not 
> true.

I'll respond to your remarks here separately, in a later email.

> I am not sure what I proposed fits here but the bottom-up sounds like
> we are starting from the leaf level and move upwards to root level
> which I think is not true here.

I guess that that's understandable, because it is true that B-Trees
are maintained in a bottom-up fashion. However, it's also true that
you can have top-down and bottom-up approaches in query optimizers,
and many other things (it's even something that is used to describe
governance models). The whole point of the term "bottom-up" is to
suggest that bottom-up deletion complements top-down cleanup by
VACUUM. I think that this design embodies certain principles that can
be generalized to other areas, such as heap pruning.

I recall that Robert Haas hated the name deduplication. I'm not about
to argue that my choice of "deduplication" was objectively better than
whatever he would have preferred. Same thing applies here - I more or
less chose a novel name because the concept is itself novel (unlike
deduplication). Reasonable people can disagree about what exact name
might have been better, but it's not like we're going to arrive at
something that everybody can be happy with. And whatever we arrive at
probably won't matter that much - the vast majority of users will
never need to know what either thing is. They may be important things,
but that doesn't mean that many people will ever think about them (in
fact I specifically hope that they don't ever need to think about
them).

> How is that working? Does heapam.c can someway indicate additional
> tuples (extra to what has been marked/sent by IndexAM as promising) as
> deletable? I see in heap_index_delete_tuples that we mark the status
> of the passed tuples as delectable (by setting knowndeletable flag for
> them).

The bottom-up nbtree caller to
heap_index_delete_tuples()/table_index_delete_tuple() (not to be
confused with the simple/LP_DEAD heap_index_delete_tuples() caller)
always provides heapam.c with a complete picture of the index page, in
the sense that it exhaustively has a delstate.deltids entry for each
and every TID on the page, no matter what. This is the case even
though in practice there is usually no possible way to check even 20%
of the deltids entries within heapam.c.

In general, the goal during a bottom-up pass is *not* to maximize
expected utility (i.e. the number of deleted index tuples/space
freed/whatever), exactly. The goal is to lower the variance across
related calls, so that we'll typically manage to free a fair number of
index tuples when we need to. In general it is much better for
heapam.c to make its decisions based on 2 or 3 good reasons rather
than just 1 excellent reason. And so heapam.c applies a power-of-two
bucketing scheme, never truly giving too much weight to what nbtree
tells it about duplicates. See comments above
bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments
(both are from heapam.c).

-- 
Peter Geoghegan




Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-18 Thread Robert Haas
On Thu, Jan 14, 2021 at 10:21 AM Robert Haas  wrote:
> Yeah, I think this lock is useless. In fact, I think it's harmful,
> because it makes large sections of the checkpointer code, basically
> all of it really, run with interrupts held off for no reason.
>
> It's not impossible that removing the lock could reveal bugs
> elsewhere: suppose we have segments of code that actually aren't safe
> to interrupt, but because of this LWLock, it never happens. But, if
> that happens to be true, I think we should just view those cases as
> bugs to be fixed.
>
> One thing that blunts the impact of this quite a bit is that the
> checkpointer doesn't use rely much on CHECK_FOR_INTERRUPTS() in the
> first place. It has its own machinery: HandleCheckpointerInterrupts().
> Possibly that's something we should be looking to refactor somehow as
> well.

Here's a patch to remove CheckpointLock completely. For
ProcessInterrupts() to do anything, one of the following things would
have to be true:

1. ProcDiePending = true. Normally this is set by die(), but the
checkpointer does not use die(). Perhaps someday it should, or maybe
not, but that's an issue for another day.

2. ClientConnectionLost = true. This gets set in pqcomm.c's
internal_flush(), but the checkpointer has no client connection.

3. DoingCommandRead = true. Can't happen; only set in PostgresMain().

4. QueryCancelPending = true. Can be set by recovery conflicts, but
those shouldn't happen in the checkpointer. Normally set by
StatementCancelHandler, which the checkpointer doesn't use.

5. IdleInTransactionSessionTimeoutPending = true. Set up by
InitPostgres(), which is not used by auxiliary processes.

6. ProcSignalBarrierPending = true. This is the case we actually care
about allowing for the ASRO patch, so if this occurs, it's good.

7. ParallelMessagePending = true. The checkpointer definitely never
starts parallel workers.

So I don't see any problem with just ripping this out entirely, but
I'd like to know if anybody else does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Remove-CheckpointLock.patch
Description: Binary data


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Zhihong Yu
Hi,
It seems sstate->ss_currentRelation being null can only
occur for T_ForeignScanState and T_CustomScanState.

What about the following change ?

Cheers

diff --git a/src/backend/executor/execCurrent.c
b/src/backend/executor/execCurrent.c
index 0852bb9cec..56e31951d1 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -325,12 +325,21 @@ search_plan_tree(PlanState *node, Oid table_oid,
 case T_IndexOnlyScanState:
 case T_BitmapHeapScanState:
 case T_TidScanState:
+{
+ScanState  *sstate = (ScanState *) node;
+
+if (RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
+result = sstate;
+break;
+}
+
 case T_ForeignScanState:
 case T_CustomScanState:
 {
 ScanState  *sstate = (ScanState *) node;

-if (RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
+if (sstate->ss_currentRelation &&
+RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
 result = sstate;
 break;
 }

On Mon, Jan 18, 2021 at 10:46 AM Tom Lane  wrote:

> David Geier  writes:
> > search_plan_tree() assumes that
> > CustomScanState::ScanState::ss_currentRelation is never NULL. In my
> > understanding that only holds for CustomScanState nodes which are at the
> > bottom of the plan and actually read from a relation. CustomScanState
> > nodes which are not at the bottom don't have ss_currentRelation set. I
> > believe for such nodes, instead search_plan_tree() should recurse into
> > CustomScanState::custom_ps.
>
> Hm.  I agree that we shouldn't simply assume that ss_currentRelation
> isn't null.  However, we cannot make search_plan_tree() descend
> through non-leaf CustomScan nodes, because we don't know what processing
> is involved there.  We need to find a scan that is guaranteed to return
> rows that are one-to-one with the cursor output.  This is why the function
> doesn't descend through join or aggregation nodes, and I see no argument
> by which we should assume we know more about what a customscan node will
> do than we know about those.
>
> So I'm inclined to think a suitable fix is just
>
> -   if (RelationGetRelid(sstate->ss_currentRelation) ==
> table_oid)
> +   if (sstate->ss_currentRelation &&
> +   RelationGetRelid(sstate->ss_currentRelation) ==
> table_oid)
> result = sstate;
>
> regards, tom lane
>
>
>


Re: Key management with tests

2021-01-18 Thread Andres Freund
On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote:
> On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> > Personally, but I admit that there's legitimate reasons to differ on
> > that note, I don't think it's reasonable for a feature this invasive to
> > commit preliminary patches without the major subsequent patches being in
> > a shape that allows reviewing the whole picture.
> 
> OK, if that is a requirement, I can't help anymore since there are
> already complaints that the patch is too large to review, even if broken
> into pieces.  Please let me know what the community decides.

Those aren't conflicting demands. Having later patches around to
validate the design of earlier patches doesn't necessitates that the
later patches need to be reviewed at the same time.




Re: Key management with tests

2021-01-18 Thread Tom Kincaid
> > I have to admit I was kind of baffled that the wiki page wasn't
> > sufficient, because it is one of the longest Postgres feature
> > explanations I have seen, but I now think the missing part is tying
> > the wiki contents to the code implementation.  If that is it, please
> > confirm.  If it is something else, also explain.
>
> I don't think the wiki right now covers what's needed. The "Overview",
> "Threat model" and "Scope of TDE" are a start, but beyond that it's
> missing a bunch of things. And it's not in the source tree (we'll soon
> have multiple versions of postgres with increasing levels of TDE
> features, the wiki doesn't help with that)
>

Thanks, the versioning issue makes sense for the design document
needing to be part of the the source tree.


As I was reading the README for the patch Amit referenced and as I am
going through this patch, I feel the desire to incorporate diagrams.
Are design diagrams ever incorporated in the source tree as a part of
the design description of a feature? If not, any concerns about doing
that? I think that is likely where I can contribute the most.


> Missing:
> - talks about cluster wide encyrption being simpler, without mentioning
>   what it's being compared to, and what makes it simpler
> - no differentiation from file system / block level encryption
> - there's no explanation of which/why specific crypto primitives were
>   chosen, what the tradeoffs are
> - no explanation which keys exists, stored where
> - the key management patch introduces new files, not documented
> - there's new types of lock files, possibility of interrupted
>   operations, ... - no documentation of what that means
> - there's no documentation what "key wrapping" actually precisely is,
>   what the danger of the two-tier model is, ...
> - are there dangers in not encrypting zero pages etc?
> - ...
>

Some of the missing things you mention above are about the design of
TDE  feature in general. However, this patch is about Key Management
which is going part of the larger TDE feature. So it feels as though
there is the need for a general design document about the overall
vision / approach for TDE and a specific design doc. for Key
Management. Is it appropriate to include both of those in the same
patch?

Something along the lines here is the overall design of TDE and here
is how the Key Management portion is designed and implemented. I guess
in that case, follow on patches for TDE could refer to the overall
design described in this patch.




>
>
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.
>
> Greetings,
>
> Andres Freund



-- 
Thomas John Kincaid




Re: Key management with tests

2021-01-18 Thread Bruce Momjian
On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.

OK, if that is a requirement, I can't help anymore since there are
already complaints that the patch is too large to review, even if broken
into pieces.  Please let me know what the community decides.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: WIP: System Versioned Temporal Table

2021-01-18 Thread Surafel Temesgen
On Mon, Jan 18, 2021 at 1:43 AM Vik Fearing  wrote:

>
> This is not good, and I see that DROP SYSTEM VERSIONING also removes
> these columns which is even worse.  Please read the standard that you
> are trying to implement!
>
>
The standard states the function of ALTER TABLE ADD SYSTEM VERSIONING
as  "Alter a regular persistent base table to a system-versioned table" and
system versioned table is described in the standard by two generated
stored constraint columns and implemented as such.


> I will do a more thorough review of the functionalities in this patch
> (not necessarily the code) this week.
>
>
Please do

regards
Surafel


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Tom Lane
David Geier  writes:
> search_plan_tree() assumes that 
> CustomScanState::ScanState::ss_currentRelation is never NULL. In my 
> understanding that only holds for CustomScanState nodes which are at the 
> bottom of the plan and actually read from a relation. CustomScanState 
> nodes which are not at the bottom don't have ss_currentRelation set. I 
> believe for such nodes, instead search_plan_tree() should recurse into 
> CustomScanState::custom_ps.

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

So I'm inclined to think a suitable fix is just

-   if (RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   RelationGetRelid(sstate->ss_currentRelation) == table_oid)
result = sstate;

regards, tom lane




  1   2   >