Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Peter Smith
Hi.

Using a tablesync debugging technique as described in another mail
thread [1][2] I have caused the tablesync worker to handle (e.g.
apply_dispatch) a 2PC PREPARE

This exposes a problem with the current 2PC logic because if/when the
PREPARE is processed by the tablesync worker then the txn will end up
being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT
PREPARED by the publisher.

For example, below is some logging (using my patch [2]) which shows
this occurring:

---

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE
SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub;"
2020-11-18 17:00:37.394 AEDT [15885] LOG:  logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.394 AEDT [15885] DETAIL:  There are no running transactions.
2020-11-18 17:00:37.394 AEDT [15885] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2020-11-18 17:00:37.407 AEDT [15886] LOG:  logical replication apply
worker for subscription "tap_sub" has started
2020-11-18 17:00:37.407 AEDT [15886] LOG:  !!>> The apply worker
process has PID = 15886
2020-11-18 17:00:37.415 AEDT [15887] LOG:  starting logical decoding
for slot "tap_sub"
2020-11-18 17:00:37.415 AEDT [15887] DETAIL:  Streaming transactions
committing after 0/16EF878, reading WAL from 0/16EF840.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT:  START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15887] LOG:  logical decoding found
consistent point at 0/16EF840
2020-11-18 17:00:37.415 AEDT [15887] DETAIL:  There are no running transactions.
2020-11-18 17:00:37.415 AEDT [15887] STATEMENT:  START_REPLICATION
SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names
'"tap_pub"')
2020-11-18 17:00:37.415 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:37.415 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:37.421 AEDT [15889] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2020-11-18 17:00:37.421 AEDT [15889] LOG:  !!>> The tablesync worker
process has PID = 15889
2020-11-18 17:00:37.421 AEDT [15889] LOG:  !!>>

Sleeping 30 secs. For debugging, attach to process 15889 now!

[postgres@CentOS7-x64 ~]$ 2020-11-18 17:00:38.431 AEDT [15886] LOG:
!!>> apply worker: LogicalRepApplyLoop
2020-11-18 17:00:38.431 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:39.433 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:39.433 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:40.437 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:40.437 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:41.439 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:41.439 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:42.441 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:42.441 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:00:43.442 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:00:43.442 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
-- etc.
2020-11-18 17:01:03.520 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:03.520 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:04.521 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:04.521 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:05.523 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:05.523 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:06.532 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:06.532 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.426 AEDT [15889] LOG:  !!>> tablesync worker:
About to call LogicalRepSyncTableStart to do initial syncing
2020-11-18 17:01:07.536 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:07.536 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:07.536 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:08.539 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:08.539 AEDT [15886] LOG:  !!>> apply worker: called
process_syncing_tables
2020-11-18 17:01:09.541 AEDT [15886] LOG:  !!>> apply worker:
LogicalRepApplyLoop
2020-11-18 17:01:09.541 AEDT [15886] LOG: 

Re: Terminate the idle sessions

2020-11-17 Thread Li Japin

On Nov 18, 2020, at 2:22 PM, 
kuroda.hay...@fujitsu.com wrote:

Oops.. I forgot putting my suggestion. Sorry.
How about substituting sigalrm_delivered to true in the reschedule_timeouts()?
Maybe this processing looks strange, so some comments should be put too.
Here is an example:

```diff
@@ -423,7 +423,14 @@ reschedule_timeouts(void)

   /* Reschedule the interrupt, if any timeouts remain active. */
   if (num_active_timeouts > 0)
+   {
+   /*
+* sigalrm_delivered is set to true,
+* because any intrreputions might be occured.
+*/
+   sigalrm_delivered = true;
   schedule_alarm(GetCurrentTimestamp());
+   }
}
```

Thanks for your suggestion.  Attached!

--
Best regards
Japin Li



v7-0001-Allow-terminating-the-idle-sessions.patch
Description: v7-0001-Allow-terminating-the-idle-sessions.patch


v7-0002-Optimize-setitimer-usage.patch
Description: v7-0002-Optimize-setitimer-usage.patch


Re: Detecting File Damage & Inconsistencies

2020-11-17 Thread Craig Ringer
On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs  wrote:

>
> What I'm proposing is an option to add 16 bytes onto each COMMIT
> record
>

Would it make sense to write this at the time we write a topxid assignment
to WAL instead?

Otherwise it won't be accessible to streaming-mode logical decoding.


RE: Terminate the idle sessions

2020-11-17 Thread kuroda.hay...@fujitsu.com
Dear Li, 

> Yeah, it might be occurred. Any suggestions to fix it?

Oops.. I forgot putting my suggestion. Sorry.
How about substituting sigalrm_delivered to true in the reschedule_timeouts()?
Maybe this processing looks strange, so some comments should be put too.
Here is an example:

```diff
@@ -423,7 +423,14 @@ reschedule_timeouts(void)
 
/* Reschedule the interrupt, if any timeouts remain active. */
if (num_active_timeouts > 0)
+   {
+   /*
+* sigalrm_delivered is set to true,
+* because any intrreputions might be occured.
+*/
+   sigalrm_delivered = true;
schedule_alarm(GetCurrentTimestamp());
+   }
 }
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Split copy.

2020-11-17 Thread Soumyadeep Chakraborty
On Tue, Nov 17, 2020 at 2:38 AM Heikki Linnakangas  wrote:
>
> Thanks for feedback, attached is a new patch version.
>
> On 11/11/2020 21:49, Soumyadeep Chakraborty wrote:
> > On Tue, Nov 3, 2020 at 1:30 AM Heikki Linnakangas  wrote:
> >> I also split/duplicated the CopyStateData struct into CopyFromStateData
> >> and CopyToStateData. Many of the fields were common, but many were not,
> >> and I think some duplication is nicer than a struct where you use some
> >> fields and others are unused. I put the common formatting options into a
> >> new CopyFormatOptions struct.
> >
> > Would we be better off if we sub-struct CopyState <- Copy{From,To}State?
> > Like this:
> > typedef struct Copy{From|To}StateData
> > {
> > CopyState cs;
> > // Fields specific to COPY FROM/TO follow..
> > }
>
> Hmm. I don't think that would be better. There isn't actually that much
> in common between CopyFromStateData and CopyToStateData, and a little
> bit of duplication seems better.
>

Fair.

> > 6.
> >
> >> /* create workspace for CopyReadAttributes results */
> >> if (!cstate->opts.binary)
> >
> > Can we replace this if with an else?
>
> Seems better as it is IMO, the if- and else-branches are not really
> related to each other, even though they both happen to be conditioned on
> cstate->opts.binary.

Fair.

> Fixed all the other things you listed, fixed a bug in setting
> 'file_encoding', and trimmed down the #includes.
>

Thanks! LGTM! Marking as Ready for Committer.

Regards,
Soumyadeep (VMware)




Re: Parallel copy

2020-11-17 Thread Bharath Rupireddy
On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila  wrote:
>
> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.
>
> One more point, I have noticed that some time back [1], I have given
> one suggestion related to the way workers process the set of lines
> (aka chunk). I think you can try by increasing the chunk size to say
> 100, 500, 1000 and use some shared counter to remember the number of
> chunks processed.
>

Hi, I did some analysis on using spinlock protected worker write position
i.e. each worker acquires spinlock on a shared write position to choose the
next available chunk vs each worker hops to get the next available chunk
position:

Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on
text column, results are of the form (no of workers, total exec time in
sec, index insertion time in sec, worker write pos get time in sec, buffer
contention event count):

With spinlock:
(1,1126.443,1060.067,0.478,*0*), (2,669.343,630.769,0.306,*26*),
(4,346.297,326.950,0.161,*89*), (8,209.600,196.417,0.088,*291*),
(16,166.113,157.086,0.065,*1468*), (20,173.884,166.013,0.067,*2700*),
(30,173.087,1166.565,0.0065,*5346*)
Without spinlock:
(1,1119.695,1054.586,0.496,*0*), (2,645.733,608.313,1.5,*8*),
(4,340.620,320.344,1.6,*58*), (8,203.985,189.644,1.3,*222*),
(16,142.997,133.045,1,*813*), (20,132.621,122.527,1.1,*1215*),
(30,135.737,126.716,1.5,*2901*)

With spinlock each worker is getting the required write position quickly
and proceeding further till the index insertion(which is becoming a single
point of contention) where we observed more buffer lock contention. Reason
is that all the workers are reaching the index insertion point at the
similar time.

Without spinlock, each worker is spending some time in hopping to get the
write position, by the time the other workers are inserting into the
indexes. So basically, all the workers are not reaching the index insertion
point at the same time and hence less buffer lock contention.

The same behaviour(explained above) is observed with different worker chunk
count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker
caches into its local memory before inserting into table.

In summary: with spinlock, it looks like we are able to avoid workers
waiting to get the next chunk, which also means that we are not creating
any contention point inside the parallel copy code. However this is causing
another choking point i.e. index insertion if indexes are available on the
table, which is out of scope of parallel copy code. We think that it would
be good to use spinlock-protected worker write position or an atomic
variable for worker write position(as it performs equal to spinlock or
little better in some platforms). Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


RE: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-17 Thread Shinoda, Noriyoshi (PN Japan FSI)
> Okay.  I have tweaked a couple of things in the patch and applied it.

Thank you so much.
The next time I post a patch, be careful with git --diff check and indentation.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Michael Paquier [mailto:mich...@paquier.xyz] 
Sent: Wednesday, November 18, 2020 2:07 PM
To: Tom Lane 
Cc: Shinoda, Noriyoshi (PN Japan FSI) ; 
pgsql-hackers@lists.postgresql.org
Subject: Re: Tab complete for CREATE OR REPLACE TRIGGER statement

On Mon, Nov 16, 2020 at 10:14:10PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> I don't think that this is a requirement for this thread, though.
> 
> Agreed, I'm not trying to block this patch.  Just wishing there were a 
> better way.

Okay.  I have tweaked a couple of things in the patch and applied it.
I am wondering if libreadline gives the possibility to implement an optional 
grouping of words to complete, but diving into its documentation I have not 
found something that we could use.
--
Michael




Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-11-17 Thread David Pirotte
On Fri, Nov 6, 2020 at 7:05 AM Ashutosh Bapat 
wrote:

> +/*
> + * Write MESSAGE to stream
> + */
> +void
> +logicalrep_write_message(StringInfo out, ReorderBufferTXN *txn,
> XLogRecPtr lsn,
> +bool transactional, const char *prefix, Size sz,
> +const char *message)
> +{
> +   uint8   flags = 0;
> +
> +   pq_sendbyte(out, LOGICAL_REP_MSG_MESSAGE);
> +
>
> Similar to the UPDATE/DELETE/INSERT records decoded when streaming is being
> used, we need to add transaction id for transactional messages. May be we
> add
> that even in case of non-streaming case and use it to decide whether it's a
> transactional message or not. That might save us a byte when we are adding
> a
> transaction id.
>

My preference is to add in the xid when streaming is enabled. (1) It is a
more consistent implementation with the other message types, and (2) it
saves 3 bytes when streaming is disabled. I've attached an updated patch.
It is not a strong preference, though, if you suggest a different approach.


> +   /* encode and send message flags */
> +   if (transactional)
> +   flags |= MESSAGE_TRANSACTIONAL;
> +
> +   pq_sendint8(out, flags);
>
> Is 8 bits enough considering future improvements? What if we need to use
> more
> than 8 bit flags?
>

8 possible flags already sounds like a lot, here, so I suspect that a byte
will be sufficient for the foreseeable future. If we needed to go beyond
that, it'd be a protocol version bump. (Well, it might first warrant
reflection as to why we had so many flags...)


> @@ -1936,6 +1936,9 @@ apply_dispatch(StringInfo s)
> apply_handle_origin(s);
> return;
>
> +   case LOGICAL_REP_MSG_MESSAGE:
>
> Should we add the logical message to the WAL downstream so that it flows
> further down to a cascaded logical replica. Should that be controlled
> by an option?
>

Hmm, I can't think of a use case for this, but perhaps someone could. Do
you, or does anyone, have something in mind? I think we provide a lot of
value with logical messages in pgoutput without supporting consumption from
a downstream replica, so perhaps this is better considered separately.

If we want this, I think we would add a "messages" option on the
subscription. If present, the subscriber will receive messages and pass
them to any downstream subscribers. I started working on this and it does
expand the change's footprint. As is, a developer would consume messages by
connecting to a pgoutput slot on the message's origin. (e.g. via Debezium
or a custom client) The subscription and logical worker infrastructure
don't know about messages, but they would need to in order to support
consuming an origin's messages on a downstream logical replica. In
any case, I'll keep working on it so we can see what it looks like.

Cheers,
Dave


v2-0002-Add-xid-to-messages-when-streaming.patch.gz
Description: GNU Zip compressed data


v2-0001-Add-logical-decoding-messages-to-pgoutput.patch.gz
Description: GNU Zip compressed data


Re: Terminate the idle sessions

2020-11-17 Thread Li Japin

On Nov 18, 2020, at 10:40 AM, 
kuroda.hay...@fujitsu.com wrote:


I’m not familiar with the system interrupt, however,
the sigalrm_due_at is subsutitue between HOLD_INTERRUPTS()
and RESUM_INTERRUPTS(), so I think it cannot be interrupted.
The following comments comes from miscadmin.h.

Right, but how about before HOLD_INTERRUPTS()?
If so, only calling handle_sig_alarm() is occurred, and
Setitimer will not be set, I think.

Yeah, it might be occurred. Any suggestions to fix it?

--
Best regards
Japin Li


Re: pl/pgsql feature request: shorthand for argument and local variable references

2020-11-17 Thread Pavel Stehule
út 17. 11. 2020 v 21:45 odesílatel Chapman Flack 
napsal:

> On 11/17/20 15:18, Jack Christensen wrote:
> >> CREATE OR REPLACE FUNCTION very_long_name(par1 int)
> >> RETURNS int AS $$
> >> #routine_label lnm
> >> BEGIN
> >>   RAISE NOTICE '%', lnm.par1;
>
> Could that be somehow shoehorned into the existing ALIAS syntax, maybe as
>
> DECLARE
>   lnm ALIAS FOR ALL very_long_name.*;
>
> or something?
>

I thought about it - but I don't think so this syntax is correct - in your
case it should be

  lnm.* ALIAS FOR ALL very_long_name.*;

but it looks a little bit scary in ADA based language.

Maybe

DECLARE lnm LABEL ALIAS FOR very_long_name;

or

DECLARE lnm ALIAS FOR LABEL very_long_name;

I think so implementation should not be hard. But there are advantages,
disadvantages - 1. label aliasing increases the complexity of variable
searching (instead comparing string with name of namespace, you should
compare list of names). Probably not too much. 2. The syntax is a little
bit harder than #option. Implementation based on #option can just rename
top namespace, so there is not any overhead, and parsing #option syntax is
pretty simple (and the syntax is shorter). So from implementation reasons I
prefer #option based syntax. There is clear zero impact on performance.

Regards

Pavel


>
> (And would it be cool if Table C.1 [1] had some sort of javascript-y
> filtering on reservedness categories, for just such kinds of bikeshedding?)
>
> Regards,
> -Chap
>
>
> [1]
> https://www.postgresql.org/docs/13/sql-keywords-appendix.html#KEYWORDS-TABLE
>


Re: Tab complete for CREATE OR REPLACE TRIGGER statement

2020-11-17 Thread Michael Paquier
On Mon, Nov 16, 2020 at 10:14:10PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> I don't think that this is a requirement for this thread, though.
> 
> Agreed, I'm not trying to block this patch.  Just wishing
> there were a better way.

Okay.  I have tweaked a couple of things in the patch and applied it.
I am wondering if libreadline gives the possibility to implement an
optional grouping of words to complete, but diving into its
documentation I have not found something that we could use.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Amit Kapila
On Wed, Nov 18, 2020 at 7:54 AM Masahiko Sawada  wrote:
>
> On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian  wrote:
> > >
> > > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila  
> > > wrote:
> > >
> > > >
> > > > Doesn't this happen only if you set replication origins? Because
> > > > otherwise both PrepareTransaction() and
> > > > RecordTransactionCommitPrepared() used the current timestamp.
> > > >
> > >
> > > I was also checking this, even if you set replicating origins, the
> > > preparedTransaction will reflect the local prepare time in
> > > pg_prepared_xacts. pg_prepared_xacts fetches this information
> > > from GlobalTransaction data which does not store the origin_timestamp;
> > > it only stores the prepared_at which is the local timestamp.
> > >
> >
> > Sure, but my question was does this difference in behavior happens
> > without replication origins in any way? The reason is that if it
> > occurs only with replication origins, I don't think we need to bother
> > about the same because that feature is not properly implemented and
> > not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
> > happen without replication origins then we might want to consider
> > changing it.
>
> Logical replication workers always have replication origins, right? Is
> that what you meant 'with replication origins'?
>

I was thinking with respect to the publisher-side but you are right
that logical apply workers always have replication origins so the
effect will be visible but I think the same should be true on
publisher without this patch as well. Say, the user has set up
replication origin via pg_replication_origin_xact_setup and provided a
value of timestamp then also the same behavior will be there.

> IIUC logical replication workers always set the origin's commit
> timestamp as the commit timestamp of the replicated transaction. OTOH,
> the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
> the local timestamp even if the caller of PrepareTransaction() sets
> replorigin_session_origin_timestamp. In terms of user-visible
> timestamps of transaction operations, I think users might expect these
> timestamps are matched between the origin and its subscribers. But the
> pg_xact_commit_timestamp() is a function of the commit timestamp
> feature whereas ‘prepare’ is a pure timestamp when the transaction is
> prepared. So I’m not sure these timestamps really need to be matched,
> though.
>

Yeah, I am not sure if it is a good idea for users to rely on this
especially if the same behavior is visible on the publisher as well.
We might want to think separately if there is a value in making
prepare-time to also rely on replorigin_session_origin_timestamp and
if so, that can be done as a separate patch. What do you think?

-- 
With Regards,
Amit Kapila.




Re: abstract Unix-domain sockets

2020-11-17 Thread David G. Johnston
On Tue, Nov 17, 2020 at 7:00 PM Michael Paquier  wrote:

> On Tue, Nov 17, 2020 at 11:18:12PM +0100, Peter Eisentraut wrote:
> > So the mention of the "port" doesn't really add any information here and
> > just introduces new terminology that isn't really relevant.
> >
> > My idea is to change the message to:
> >
> > ERROR:  could not bind Unix address "/tmp/.s.PGSQL.5432": Address
> already in
> > use
> > HINT:  Is another postmaster already running at this address?
>
> Are you saying that you would remove the hint telling to remove the
> socket file even for the case of non-abstract files?  For abstract
> paths, this makes sense.  For both, removing the "port" part is indeed
> a good idea as long as you keep around the full socket file name.
>
>
(resending to the list)

Given that "port" is a postgresql.conf setting its use here (and elsewhere)
should be taken to mean the value of that specific variable.  To that end,
I find the current description of port to be lacking - it should mention
its usage as a qualifier when dealing with unix socket files (in addition
to the existing wording under unix_socket_directories).

If we are going to debate semantics here "bind unix address" doesn't seem
correct.  could not create Unix socket file /tmp/.s.PGSQL.5432, it already
exists.

The hint would be better written: Is another postmaster running with
unix_socket_directories = /tmp and port = 5432?  If not, remove the unix
socket file /tmp/.s.PGSQL.5432 and retry.

I don't see much benefit in trying to share logic/wording between the
various messages and hints for the different ways the server can establish
communication points.

I agree that there isn't a useful hint for the abstract case as it
shouldn't happen unless there is indeed another running instance with the
same configuration.  Though a hint similar to the above, but without the
"remove and retry" bit, probably wouldn't hurt.

David J.


Re: abstract Unix-domain sockets

2020-11-17 Thread Craig Ringer
On Fri, Oct 9, 2020 at 3:28 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> During the discussion on Unix-domain sockets on Windows, someone pointed
> out[0] abstract Unix-domain sockets.
>

This reminds me on a somewhat random note that SSPI mode authentication
should work out of the box for unix domain sockets on Windows.

The main reason we probably can't use it as a default is that SSPI isn't
easy to implement for pure language drivers, it requires Windows API calls
to interact with the windows auth services. It's a pain in JDBC for example.


RE: Terminate the idle sessions

2020-11-17 Thread kuroda.hay...@fujitsu.com
Dear Li,

> I’m not familiar with the system interrupt, however,
> the sigalrm_due_at is subsutitue between HOLD_INTERRUPTS()
> and RESUM_INTERRUPTS(), so I think it cannot be interrupted.
> The following comments comes from miscadmin.h.

Right, but how about before HOLD_INTERRUPTS()?
If so, only calling handle_sig_alarm() is occurred, and
Setitimer will not be set, I think.

> Not sure! I find that Win32 do not support setitimer(),
> PostgreSQL emulate setitimer() by creating a persistent thread to handle
> the timer setting and notification upon timeout.

Yes, set/getitimer() is the systemcall, and
implemented only in the unix-like system.
But I rethink that such a fix is categorized in the refactoring and
we should separate topic. Hence we can ignore here.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

From: Li Japin  
Sent: Tuesday, November 17, 2020 6:02 PM
To: Kuroda, Hayato/黒田 隼人 
Cc: David G. Johnston ; Kyotaro Horiguchi 
; Thomas Munro ; 
bharath.rupireddyforpostg...@gmail.com; pgsql-hackers@lists.postgresql.org
Subject: Re: Terminate the idle sessions


On Nov 17, 2020, at 2:07 PM, mailto:kuroda.hay...@fujitsu.com wrote:

Dear Li, David,
 
> Additionally, using postgres_fdw within the server doesn't cause issues,
> its using postgres_fdw and the remote server having this setting set to zero 
> that causes a problem.
 
I didn't know the fact that postgres_fdw can use within the server... Thanks.
 
I read optimize-setitimer patch, and looks basically good. I put what I 
understanding,
so please confirm it whether your implementation is correct.
(Maybe I missed some simultaneities, so please review anyone...)
 
[besic consept]
 
sigalrm_due_at means the time that interval timer will ring, and 
sigalrm_delivered means who calls schedule_alarm().
If fin_time of active_timeouts[0] is larger than or equal to sigalrm_due_at,
stop calling setitimer because handle_sig_alarm() will be call sooner.
 

Agree. The sigalrm_delivered means a timer has been handled by 
handle_sig_alarm(), so we should call setitimer() in next schedule_alarm().


[when call setitimer]
 
In the attached patch, setitimer() will be only called the following scenarios:
 
* when handle_sig_alarm() is called due to the pqsignal
* when a timeout is registered and its fin_time is later than active_timeous[0]
* when disable a timeout
* when handle_sig_alarm() is interrupted and rescheduled(?)
 
According to comments, handle_sig_alarm() may be interrupted because of the 
ereport.
I think if handle_sig_alarm() is interrupted before subsutituting 
sigalrm_due_at to true,
interval timer will be never set. Is it correct, or is my assumption wrong?
 

I’m not familiar with the system interrupt, however, the sigalrm_due_at is 
subsutitue between HOLD_INTERRUPTS()
and RESUM_INTERRUPTS(), so I think it cannot be interrupted.  The following 
comments comes from miscadmin.h.

> The HOLD_INTERRUPTS() and RESUME_INTERRUPTS() macros
> allow code to ensure that no cancel or die interrupt will be accepted,
> even if CHECK_FOR_INTERRUPTS() gets called in a subroutine.  The interrupt
> will be held off until CHECK_FOR_INTERRUPTS() is done outside any
> HOLD_INTERRUPTS() ... RESUME_INTERRUPTS() section.



Lastly, I found that setitimer is obsolete and should change to another one. 
According to my man page:
 
```
POSIX.1-2001, SVr4, 4.4BSD (this call first appeared in 4.2BSD).
POSIX.1-2008 marks getitimer() and setitimer() obsolete,
recommending the use of the POSIX timers API (timer_gettime(2), 
timer_settime(2), etc.) instead.
```
 
Do you have an opinion for this? I think it should be changed
if all platform can support timer_settime system call, but this fix affects all 
timeouts,
so more considerations might be needed.
 

Not sure! I find that Win32 do not support setitimer(), PostgreSQL emulate 
setitimer() by creating a persistent thread to handle
the timer setting and notification upon timeout.

So if we want to replace it, I think we should open a new thread to achieve 
this.

--
Best regards
Japin Li



Re: [doc] plan invalidation when statistics are update

2020-11-17 Thread Fujii Masao




On 2020/11/18 11:04, torikoshia wrote:

Hi,

AFAIU, when the planner statistics are updated, generic plans are invalidated 
and PostgreSQL recreates. However, the manual doesn't seem to explain it 
explicitly.

   https://www.postgresql.org/docs/devel/sql-prepare.html

I guess this case is included in 'whenever database objects used in the 
statement have definitional (DDL) changes undergone', but I feel it's hard to 
infer.

Since updates of the statistics can often happen, how about describing this 
case explicitly like an attached patch?


+1 to add that note.

-   statement.  Also, if the value of  changes
+   statement. For example, when the planner statistics of the statement
+   are updated, PostgreSQL re-analyzes and
+   re-plans the statement.

I don't think "For example," is necessary.

"planner statistics of the statement" sounds vague? Does the statement
is re-analyzed and re-planned only when the planner statistics of database
objects used in the statement are updated? If yes, we should describe
that to make the note a bit more explicitly?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Masahiko Sawada
On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila  wrote:
>
> On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian  wrote:
> >
> > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila  
> > wrote:
> >
> > >
> > > Doesn't this happen only if you set replication origins? Because
> > > otherwise both PrepareTransaction() and
> > > RecordTransactionCommitPrepared() used the current timestamp.
> > >
> >
> > I was also checking this, even if you set replicating origins, the
> > preparedTransaction will reflect the local prepare time in
> > pg_prepared_xacts. pg_prepared_xacts fetches this information
> > from GlobalTransaction data which does not store the origin_timestamp;
> > it only stores the prepared_at which is the local timestamp.
> >
>
> Sure, but my question was does this difference in behavior happens
> without replication origins in any way? The reason is that if it
> occurs only with replication origins, I don't think we need to bother
> about the same because that feature is not properly implemented and
> not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
> happen without replication origins then we might want to consider
> changing it.

Logical replication workers always have replication origins, right? Is
that what you meant 'with replication origins'?

IIUC logical replication workers always set the origin's commit
timestamp as the commit timestamp of the replicated transaction. OTOH,
the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses
the local timestamp even if the caller of PrepareTransaction() sets
replorigin_session_origin_timestamp. In terms of user-visible
timestamps of transaction operations, I think users might expect these
timestamps are matched between the origin and its subscribers. But the
pg_xact_commit_timestamp() is a function of the commit timestamp
feature whereas ‘prepare’ is a pure timestamp when the transaction is
prepared. So I’m not sure these timestamps really need to be matched,
though.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




[doc] plan invalidation when statistics are update

2020-11-17 Thread torikoshia

Hi,

AFAIU, when the planner statistics are updated, generic plans are 
invalidated and PostgreSQL recreates. However, the manual doesn't seem 
to explain it explicitly.


  https://www.postgresql.org/docs/devel/sql-prepare.html

I guess this case is included in 'whenever database objects used in the 
statement have definitional (DDL) changes undergone', but I feel it's 
hard to infer.


Since updates of the statistics can often happen, how about describing 
this case explicitly like an attached patch?



Regards,

--
Atsushi TorikoshiFrom d71dbb0b100f706f19d92175b72f9e1833a8a442 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 12 Nov 2020 17:18:29 +0900
Subject: [PATCH v1] When the planner statistics are updated, generic plans are
 invalidated and PostgreSQL recreates. However, the manual didn't explain it
 explicitly. This patch adds this case as a example.

---
 doc/src/sgml/ref/prepare.sgml | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/prepare.sgml b/doc/src/sgml/ref/prepare.sgml
index 57a34ff83c..4075de5689 100644
--- a/doc/src/sgml/ref/prepare.sgml
+++ b/doc/src/sgml/ref/prepare.sgml
@@ -185,7 +185,10 @@ EXPLAIN EXECUTE name(parameter_values changes
+   statement. For example, when the planner statistics of the statement
+   are updated, PostgreSQL re-analyzes and
+   re-plans the statement.
+   Also, if the value of  changes
from one use to the next, the statement will be re-parsed using the new
search_path.  (This latter behavior is new as of
PostgreSQL 9.3.)  These rules make use of a
-- 
2.18.1



[doc] adding way to examine the plan type of prepared statements

2020-11-17 Thread torikoshia

Hi,


Currently, EXPLAIN is the only way to know whether the plan is generic 
or custom according to the manual of PREPARE.


  https://www.postgresql.org/docs/devel/sql-prepare.html

After commit d05b172, we can also use pg_prepared_statements view to 
examine the plan types.


How about adding this explanation like the attached patch?


Regards,

--
Atsushi TorikoshiFrom 2c8f66637075fcb2f802a2b9cfd354f2ef18 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 12 Nov 2020 17:00:19 +0900
Subject: [PATCH v1] After commit d05b172, we can use pg_prepared_statements
 view to examine whether the plan is generic or custom. This patch adds this
 explanation in the manual of PREPARE.

---
 doc/src/sgml/ref/prepare.sgml | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/doc/src/sgml/ref/prepare.sgml b/doc/src/sgml/ref/prepare.sgml
index 57a34ff83c..2268c222a9 100644
--- a/doc/src/sgml/ref/prepare.sgml
+++ b/doc/src/sgml/ref/prepare.sgml
@@ -179,6 +179,13 @@ EXPLAIN EXECUTE name(parameter_values
 
+  
+   To examine how many times each prepared statement chose generic and
+   custom plan cumulatively in the current session, refer
+   pg_prepared_statements
+   system view.
+  
+
   
Although the main point of a prepared statement is to avoid repeated parse
analysis and planning of the statement, PostgreSQL will
-- 
2.18.1



Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-17 Thread Alvaro Herrera
On 2020-Nov-17, Simon Riggs wrote:

> As an additional optimization, if we do find a row that needs freezing
> on a data block, we should simply freeze *all* row versions on the
> page, not just the ones below the selected cutoff. This is justified
> since writing the block is the biggest cost and it doesn't make much
> sense to leave a few rows unfrozen on a block that we are dirtying.

Yeah.  We've had ealier proposals to use high and low watermarks: if any
tuple is past the high watermark, then freeze all tuples that are past
the low watermark.  However this is ancient thinking (prior to
HEAP_XMIN_FROZEN) and we don't need the low watermark to be different
from zero, since the original xid is retained anyway.

So +1 for this idea.




Re: abstract Unix-domain sockets

2020-11-17 Thread Michael Paquier
On Tue, Nov 17, 2020 at 11:18:12PM +0100, Peter Eisentraut wrote:
> So the mention of the "port" doesn't really add any information here and
> just introduces new terminology that isn't really relevant.
> 
> My idea is to change the message to:
> 
> ERROR:  could not bind Unix address "/tmp/.s.PGSQL.5432": Address already in
> use
> HINT:  Is another postmaster already running at this address?

Are you saying that you would remove the hint telling to remove the
socket file even for the case of non-abstract files?  For abstract
paths, this makes sense.  For both, removing the "port" part is indeed
a good idea as long as you keep around the full socket file name.
--
Michael


signature.asc
Description: PGP signature


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-17 Thread Michael Paquier
On Tue, Nov 17, 2020 at 02:14:53PM -0300, Alvaro Herrera wrote:
> diff --git a/src/backend/replication/logical/logical.c 
> b/src/backend/replication/logical/logical.c
> index f1f4df7d70..4324e32656 100644
> --- a/src/backend/replication/logical/logical.c
> +++ b/src/backend/replication/logical/logical.c
> @@ -181,7 +181,7 @@ StartupDecodingContext(List *output_plugin_options,
>*/
>   if (!IsTransactionOrTransactionBlock())
>   {
> - LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> + LWLockAcquire(ProcArrayLock, LW_SHARED);
>   MyProc->statusFlags |= PROC_IN_LOGICAL_DECODING;
>   ProcGlobal->statusFlags[MyProc->pgxactoff] = 
> MyProc->statusFlags;
>   LWLockRelease(ProcArrayLock);

We don't really document that it is safe to update statusFlags while
holding a shared lock on ProcArrayLock, right?  Could this be
clarified at least in proc.h?

> + /* Determine whether we can call set_safe_index_flag */
> + safe_index = indexInfo->ii_Expressions == NIL &&
> + indexInfo->ii_Predicate == NIL;

This should tell why we check after expressions and predicates, in
short giving an explanation about the snapshot issues with build and
validation when executing those expressions and/or predicates.

> + * Set the PROC_IN_SAFE_IC flag in my PGPROC entry.
> + *
> + * When doing concurrent index builds, we can set this flag
> + * to tell other processes concurrently running CREATE
> + * INDEX CONCURRENTLY to ignore us when
> + * doing their waits for concurrent snapshots.  On one hand it
> + * avoids pointlessly waiting for a process that's not interesting
> + * anyway, but more importantly it avoids deadlocks in some cases.
> + *
> + * This can only be done for indexes that don't execute any expressions.
> + * Caller is responsible for only calling this routine when that
> + * assumption holds true.
> + *
> + * (The flag is reset automatically at transaction end, so it must be
> + * set for each transaction.)

Would it be better to document that this function had better be called
after beginning a transaction?  I am wondering if we should not
enforce some conditions here, say this one or similar:
Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);

> + */
> +static inline void
> +set_safe_index_flag(void)

This routine name is rather close to index_set_state_flags(), could it
be possible to finish with a better name?
--
Michael


signature.asc
Description: PGP signature


Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-17 Thread Michael Paquier
On Tue, Nov 17, 2020 at 09:24:30PM +0100, Daniel Gustafsson wrote:
> I tend to agree, randomness is complicated enough without adding a compile 
> time
> extensibility which few (if anyone) will ever use.  Attached is an attempt at
> this.

Going down to that, it seems to me that we could just remove
USE_WIN32_RANDOM (as this is implied by WIN32), as well as
USE_DEV_URANDOM because configure.ac checks for the existence of
/dev/urandom, no?  In short, configure.ac could be changed to check
after /dev/urandom if not using OpenSSL and not being on Windows.

-elif test x"$USE_WIN32_RANDOM" = x"1" ; then
+elif test x"$PORTANME" = x"win32" ; then
Typo here, s/PORTANME/PORTNAME.
--
Michael


signature.asc
Description: PGP signature


Re: Zedstore - compressed in-core columnar storage

2020-11-17 Thread Jacob Champion
On Nov 13, 2020, at 2:00 PM, Tomas Vondra  wrote:
> 
> Fedora 32, nothing special. I'm not sure if I ran the tests with pglz or
> lz4, maybe there's some dependence on that, but it does fail for me
> quite reliably with this:
> 
> ./configure --enable-debug  --enable-cassert --enable-tap-tests
> --with-lz4 && make -s clean && make -s -j4 && make check-world

I'm not sure what I messed up the first time, but I am able to reproduce
reliably now, with and without lz4. It looks like we have a workaround
in place that significantly increases the number of simultaneous locks
acquired during indexing:

#define XLR_MAX_BLOCK_ID199

So that's in need of resolution. I'd expect gin and gist to be pretty
flaky until we fix that.

--Jacob



Re: abstract Unix-domain sockets

2020-11-17 Thread Peter Eisentraut

On 2020-11-12 08:12, Michael Paquier wrote:

On Wed, Nov 11, 2020 at 01:39:17PM +0100, Peter Eisentraut wrote:

Thinking about it further, I think the hint in the Unix-domain socket case
is bogus.  A socket in the file-system namespace never reports EADDRINUSE
anyway, it just overwrites the file.  For sockets in the abstract namespace,
you can get this error, but of course there is no file to remove.

Perhaps we should change the hint in both the Unix and the IP cases to:

"Is another postmaster already running at this address?"
(This also resolves the confusing reference to "port" in the Unix case.)

Er, it is perfectly possible for two postmasters to use the same unix
socket path, abstract or not, as long as they listen to different
ports (all nodes in a single TAP test do that for example).  So we
should keep a reference to the port used in the log message, no?


"Port" is not a real thing for Unix-domain sockets, it's just something 
we use internally and append to the socket file.  The error message is 
currently something like


ERROR:  could not bind Unix address "/tmp/.s.PGSQL.5432": Address 
already in use
HINT:  Is another postmaster already running on port 5432? If not, 
remove socket file "/tmp/.s.PGSQL.5432" and retry.


So the mention of the "port" doesn't really add any information here and 
just introduces new terminology that isn't really relevant.


My idea is to change the message to:

ERROR:  could not bind Unix address "/tmp/.s.PGSQL.5432": Address 
already in use

HINT:  Is another postmaster already running at this address?

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Re: ERROR: too many dynamic shared memory segment

2020-11-17 Thread Thomas Munro
On Wed, Nov 18, 2020 at 10:55 AM Thomas Munro  wrote:
> a few times those ways, also due to an ancient DSM leak that was fixed
> in 93745f1e (fix present in 12.4).

I take that bit back -- I misremembered -- that was a leak of the
actual memory once slots were exhausted, not a leak of the slot.




Re: ERROR: too many dynamic shared memory segment

2020-11-17 Thread Thomas Munro
On Wed, Nov 18, 2020 at 4:21 AM Ben Kanouse  wrote:
> My database is experiencing a 'ERROR: too many dynamic shared memory
> segment' error from time to time. It seems to happen most when traffic
> is high, and it happens with semi-simple select statements that run a
> parallel query plan with a parallel bitmap heap scan. It is currently
> on version 12.4.
>
> Here is an example query: https://explain.depesz.com/s/aatA

Hmm, not sure how this plan could cause the problem, even if running
many copies of it.  In the past, this problem has been reported from
single queries that had a very high number of separate Gather nodes,
or very very large parallel hash joins, or, once you've hit the error
a few times those ways, also due to an ancient DSM leak that was fixed
in 93745f1e (fix present in 12.4).

> I noticed in version 13.1 there is a commit that changes the
> multiplier from 2 to 5:
> https://github.com/postgres/postgres/commit/d061ea21fc1cc1c657bb5c742f5c4a1564e82ee2
>
> maxitems  = 64 + 5 * MaxBackends
>
> Should this commit be back-ported to earlier versions of postgres to
> prevent this error in other versions?

Yeah, that seems like a good idea anyway.  I will do that tomorrow,
barring objections.




Re: proposal: possibility to read dumped table's name from file

2020-11-17 Thread Justin Pryzby
On Wed, Nov 11, 2020 at 06:49:43AM +0100, Pavel Stehule wrote:
> Perhaps this feature could co-exist with a full blown configuration for
> >> pg_dump, but even then there's certainly issues with what's proposed-
> >> how would you handle explicitly asking for a table which is named
> >> "  mytable" to be included or excluded?  Or a table which has a newline
> >> in it?  Using a standardized format which supports the full range of
> >> what we do in a table name, explicitly and clearly, would address these
> >> issues and also give us the flexibility to extend the options which
> >> could be used through the configuration file beyond just the filters in
> >> the future.

I think it's a reasonable question - why would a new configuration file option
include support for only a handful of existing arguments but not the rest.

> > This is the correct argument - I will check a possibility to use strange
> > names, but there is the same possibility and functionality like we allow
> > from the command line. So you can use double quoted names. I'll check it.
> 
> I checked
> echo "+t \"bad Name\"" | /usr/local/pgsql/master/bin/pg_dump 
> --filter=/dev/stdin
> It is working without any problem

I think it couldn't possibly work with newlines, since you call pg_get_line().
I realize that entering a newline into the shell would also be a PITA, but that
could be one *more* reason to support a config file - to allow terrible table
names to be in a file and avoid writing dash tee quote something enter else
quote in a pg_dump command, or shell script.

I fooled with argument parsing to handle reading from a file in the quickest
way.  As written, this fails to handle multiple config files, and special table
names, which need to support arbitrary, logical lines, with quotes surrounding
newlines or other special chars.  As written, the --config file is parsed
*after* all other arguments, so it could override previous args (like
--no-blobs --no-blogs, --file, --format, --compress, --lock-wait), which I
guess is bad, so the config file should be processed *during* argument parsing.
Unfortunately, I think that suggests duplicating parsing of all/most the
argument parsing for config file support - I'd be happy if someone suggested a
better way.

BTW, in your most recent patch:
s/empty rows/empty lines/
unbalanced parens: "invalid option type (use [+-]"

@cfbot: I renamed the patch so please ignore it.

-- 
Justin
>From 8bff976d3d787898e71d3c8c893d5f9093268569 Mon Sep 17 00:00:00 2001
From: Pavel Stehule 
Date: Wed, 11 Nov 2020 06:54:22 +0100
Subject: [PATCH 1/2] proposal: possibility to read dumped table's name from
 file
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

--87718205b3ce6fcd
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi

=C4=8Dt 24. 9. 2020 v 19:47 odes=C3=ADlatel Pavel Stehule 
napsal:

> Hi
>
> rebase + minor change - using pg_get_line_buf instead pg_get_line_append
>
>
fresh rebase

Regards
>

> Pavel
>

Hičt 24. 9. 2020 v 19:47 odesílatel Pavel Stehule pavel.steh...@gmail.com napsal:Hirebase + minor change - using pg_get_line_buf instead pg_get_line_appendfresh rebase Regards Pavel

---
 doc/src/sgml/ref/pg_dump.sgml   |  50 +++
 src/bin/pg_dump/pg_dump.c   | 168 
 src/bin/pg_dump/t/004_pg_dump_filter.pl | 129 ++
 3 files changed, 347 insertions(+)
 create mode 100644 src/bin/pg_dump/t/004_pg_dump_filter.pl

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..068821583f 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -751,6 +751,56 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Read objects filters from the specified file.
+If you use "-" as a filename, the filters are read from stdin.
+The lines of this file must have the following format:
+
+(+|-)[tnfd] objectname
+
+   
+
+   
+The first character specifies whether the object is to be included
+(+) or excluded (-), and the
+second character specifies the type of object to be filtered:
+t (table),
+n (schema),
+f (foreign server),
+d (table data).
+   
+
+   
+With the following filter file, the dump would include table
+mytable1 and data from foreign tables of
+some_foreign_server foreign server, but exclude data
+from table mytable2.
+
++t mytable1
++f some_foreign_server
+-d mytable2
+
+   
+
+   
+The lines starting with symbol # are ignored.
+Previous white chars (spaces, tabs) are not allowed. These
+lines can be used for comments, notes. Empty lines are ignored too.
+   
+
+   
+The --filter option works just like the other
+options to include or exclude tables, schemas, table data, or 

Re: Protect syscache from bloating with negative cache entries

2020-11-17 Thread Robert Haas
On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas  wrote:
> 0.7% degradation is probably acceptable.

I haven't looked at this patch in a while and I'm pleased with the way
it seems to have been redesigned. It seems relatively simple and
unlikely to cause big headaches. I would say that 0.7% is probably not
acceptable on a general workload, but it seems fine on a benchmark
that is specifically designed to be a worst-case for this patch, which
I gather is what's happening here. I think it would be nice if we
could enable this feature by default. Does it cause a measurable
regression on realistic workloads when enabled? I bet a default of 5
or 10 minutes would help many users.

One idea for improving things might be to move the "return
immediately" tests in CatCacheCleanupOldEntries() to the caller, and
only call this function if they indicate that there is some purpose.
This would avoid the function call overhead when nothing can be done.
Perhaps the two tests could be combined into one and simplified. Like,
suppose the code looks (roughly) like this:

if (catcacheclock >= time_at_which_we_can_prune)
CatCacheCleanupOldEntries(...);

To make it that simple, we want catcacheclock and
time_at_which_we_can_prune to be stored as bare uint64 quantities so
we don't need TimestampDifference(). And we want
time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature
is disabled. But those both seem like pretty achievable things... and
it seems like the result would probably be faster than what you have
now.

+ * per-statement basis and additionaly udpated periodically

two words spelled wrong

+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+ catalog_cache_prune_min_age = newval;
+}

hmm, do we need this?

+ /*
+ * Entries that are not accessed after the last pruning
+ * are removed in that seconds, and their lives are
+ * prolonged according to how many times they are accessed
+ * up to three times of the duration. We don't try shrink
+ * buckets since pruning effectively caps catcache
+ * expansion in the long term.
+ */
+ ct->naccess = Min(2, ct->naccess);

The code doesn't match the comment, it seems, because the limit here
is 2, not 3. I wonder if this does anything anyway. My intuition is
that when a catcache entry gets accessed at all it's probably likely
to get accessed a bunch of times. If there are any meaningful
thresholds here I'd expect us to be trying to distinguish things like
1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't
need to distinguish at all and can just have a single mark bit rather
than a counter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Peter Geoghegan
On Tue, Nov 17, 2020 at 7:24 AM Victor Yegorov  wrote:
> I've looked through the code and it looks very good from my end:
> - plenty comments, good description of what's going on
> - I found no loose ends in terms of AM integration
> - magic constants replaced with defines
> Code looks good. Still, it'd be good if somebody with more experience could 
> look into this patch.

Great, thank you.

> Question: why in the comments you're using double spaces after dots?
> Is this a convention of the project?

Not really. It's based on my habit of trying to be as consistent as
possible with existing code.

There seems to be a weak consensus among English speakers on this
question, which is: the two space convention is antiquated, and only
ever made sense in the era of mechanical typewriters. I don't really
care either way, and I doubt that any other committer pays much
attention to these things. You may have noticed that I use only one
space in my e-mails.

Actually, I probably shouldn't care about it myself. It's just what I
decided to do at some point. I find it useful to decide that this or
that practice is now a best practice, and then stick to it without
thinking about it very much (this frees up space in my head to think
about more important things). But this particular habit of mine around
spaces is definitely not something I'd insist on from other
contributors. It's just that: a habit.

> I am thinking of two more scenarios that require testing:
> - queue in the table, with a high rate of INSERTs+DELETEs and a long 
> transaction.

I see your point. This is going to be hard to make work outside of
unique indexes, though. Unique indexes are already not dependent on
the executor hint -- they can just use the "uniquedup" hint. The code
for unique indexes is prepared to notice duplicates in
_bt_check_unique() in passing, and apply the optimization for that
reason.

Maybe there is some argument to forgetting about the hint entirely,
and always assuming that we should try to find tuples to delete at the
point that a page is about to be split. I think that that argument is
a lot harder to make, though. And it can be revisited in the future.
It would be nice to do better with INSERTs+DELETEs, but that's surely
not the big problem for us right now.

I realize that this unique indexes/_bt_check_unique() thing is not
even really a partial fix to the problem you describe. The indexes
that have real problems with such an INSERTs+DELETEs workload will
naturally not be unique indexes -- _bt_check_unique() already does a
fairly good job of controlling bloat without bottom-up deletion.

> - upgraded cluster with !heapkeyspace indexes.

I do have a patch that makes that easy to test, that I used for the
Postgres 13 deduplication work -- I can rebase it and post it if you
like. You will be able to apply the patch, and run the regression
tests with a !heapkeyspace index. This works with only one or two
tweaks to the tests (IIRC the amcheck tests need to be tweaked in one
place for this to work). I don't anticipate that !heapkeyspace indexes
will be a problem, because they won't use any of the new stuff anyway,
and because nothing about the on-disk format is changed by bottom-up
index deletion.

-- 
Peter Geoghegan




Re: pl/pgsql feature request: shorthand for argument and local variable references

2020-11-17 Thread Chapman Flack
On 11/17/20 15:18, Jack Christensen wrote:
>> CREATE OR REPLACE FUNCTION very_long_name(par1 int)
>> RETURNS int AS $$
>> #routine_label lnm
>> BEGIN
>>   RAISE NOTICE '%', lnm.par1;

Could that be somehow shoehorned into the existing ALIAS syntax, maybe as

DECLARE
  lnm ALIAS FOR ALL very_long_name.*;

or something?

(And would it be cool if Table C.1 [1] had some sort of javascript-y
filtering on reservedness categories, for just such kinds of bikeshedding?)

Regards,
-Chap


[1] https://www.postgresql.org/docs/13/sql-keywords-appendix.html#KEYWORDS-TABLE




Re: Move OpenSSL random under USE_OPENSSL_RANDOM

2020-11-17 Thread Daniel Gustafsson
> On 16 Nov 2020, at 16:06, Tom Lane  wrote:
> 
> Magnus Hagander  writes:
>> I agree with those -- either we remove the ability to choose random source
>> independently of the SSL library (and then only use the windows crypto
>> provider or /dev/urandom as platform-specific choices when *no* SSL library
>> is used), and in that case we should not have separate #ifdef's for them.
>> Or we fix the includes. Which is obviously easier, but we should take the
>> time to do what we think is right long-term of course.
> 
> FWIW, I'd vote for the former.  I think the presumption that OpenSSL's
> random-number machinery can be used without any other initialization is
> shaky as heck.

I tend to agree, randomness is complicated enough without adding a compile time
extensibility which few (if anyone) will ever use.  Attached is an attempt at
this.

cheers ./daniel



0001-Remove-ability-to-choose-randomness-source.patch
Description: Binary data


Re: pl/pgsql feature request: shorthand for argument and local variable references

2020-11-17 Thread Jack Christensen
On Tue, Nov 17, 2020 at 1:36 PM Pavel Stehule 
wrote:

>
> Personally in your example I very much like notation "update_item.id",
> because there is a clean signal so "id" is the function's argument. When
> you use "$id", then it is not clean if "id" is a local variable or
> function's argument. So your proposal decreases safety :-/. Plus this
> syntax reduces collision only on one side, you should use aliases for sql
> identifiers and again it is not balanced - In MS SQL I can write predicate
> id = @id. But it is not possible in your proposal (and it is not possible
> from compatibility reasons ever).
>
> More we already has a possibility to do ALIAS of any variable
> https://www.postgresql.org/docs/current/plpgsql-declarations.html#PLPGSQL-DECLARATION-ALIAS
>
> I understand that there can be problems with functions with very long
> names.
>

Right. The problem is most pronounced when the function name is long. And
in particular when there are a lot of optional arguments. In this case the
caller will be using named arguments so I want to avoid prefixing the
parameter names. And while alias is an option, that can also get quite
verbose when there are a large number of parameters.


> So I think introducing new syntax is not necessary. The open question is a
> possibility to do aliasing more comfortably. ADA language has a possibility
> to rename function or procedure. But it is much more stronger, than can be
> implemented in plpgsql. Probably the most easy implementation can be a
> possibility to specify a new argument's label with already supported
> #option syntax.
>
> CREATE OR REPLACE FUNCTION very_long_name(par1 int)
> RETURNS int AS $$
> #routine_label lnm
> BEGIN
>   RAISE NOTICE '%', lnm.par1;
>
>
I concur. This would be an improvement.

Jack


Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-17 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 17, 2020 at 10:32 AM Tom Lane  wrote:
>> Adding the expected length to the error message might be OK though.

> Certainly seems like we should do at least that much. The current
> message is just wrong, right?

It's incomplete, for sure.  Doesn't mention nulls either.

regards, tom lane




Re: [patch] [doc] Clarify that signal functions have no feedback

2020-11-17 Thread David G. Johnston
On Tue, Nov 17, 2020 at 6:13 AM Heikki Linnakangas  wrote:

> On 02/11/2020 18:02, David G. Johnston wrote:
> > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> > index bf6004f321..43bc2cf086 100644
> > --- a/doc/src/sgml/func.sgml
> > +++ b/doc/src/sgml/func.sgml
> > @@ -23892,7 +23892,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
> >
> > 
> >  The functions shown in  > -linkend="functions-admin-signal-table"/> send control signals to
> > +linkend="functions-admin-signal-table"/> send uni-directional
> > +control signals to
> >  other server processes.  Use of these functions is restricted to
> >  superusers by default but access may be granted to others using
> >  GRANT, with noted exceptions.
>
> The "uni-directional" sounds a bit redundant, "send" implies that it's
> uni-directional I think.
>

Agreed, the other two changes sufficiently address the original complaint.

You can use the  linkend="view-pg-file-settings">pg_file_settings and  linkend="view-pg-hba-file-rules">pg_hba_file_rules views to check
> the configuration files for possible errors, before reloading.
>

I agree with adding "why" you want to check those links, and added a bit
about why doing so before reloading works, but the phrasing using "You"
seems out-of-place in this region of the documentation.

v3 attached

David J.


v3-doc-pg-reload-conf-signaling.patch
Description: Binary data


Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-17 Thread Robert Haas
On Tue, Nov 17, 2020 at 10:32 AM Tom Lane  wrote:
> Adding the expected length to the error message might be OK though.

Certainly seems like we should do at least that much. The current
message is just wrong, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-17 Thread Alexey Kondratov

Hi,

On 2020-11-06 18:56, Anastasia Lubennikova wrote:

Status update for a commitfest entry.

This thread was inactive for a while and from the latest messages, I
see that the patch needs some further work.
So I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author


I had a look on the initial patch and discussed options [1] to proceed 
with this issue. I agree with Bruce about idle_session_timeout, it would 
be a nice to have in-core feature on its own. However, this should be a 
cluster-wide option and it will start dropping all idle connection not 
only foreign ones. So it may be not an option for some cases, when the 
same foreign server is used for another load as well.


Regarding the initial issue I prefer point #3, i.e. foreign server 
option. It has a couple of benefits IMO: 1) it may be set separately on 
per foreign server basis, 2) it will live only in the postgres_fdw 
contrib without any need to touch core. I would only supplement this 
postgres_fdw foreign server option with a GUC, e.g. 
postgres_fdw.keep_connections, so one could easily define such behavior 
for all foreign servers at once or override server-level option by 
setting this GUC on per session basis.


Attached is a small POC patch, which implements this contrib-level 
postgres_fdw.keep_connections GUC. What do you think?


[1] 
https://www.postgresql.org/message-id/CALj2ACUFNydy0uo0JL9A1isHQ9pFe1Fgqa_HVanfG6F8g21nSQ%40mail.gmail.com



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..64f0e96635 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -28,6 +28,8 @@
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 
+#include "postgres_fdw.h"
+
 /*
  * Connection cache hash table entry
  *
@@ -948,6 +950,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+			!keep_connections ||
 			entry->changing_xact_state)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9c5aaacc51..4cd5f71223 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -45,6 +45,8 @@
 #include "utils/sampling.h"
 #include "utils/selfuncs.h"
 
+#include "postgres_fdw.h"
+
 PG_MODULE_MAGIC;
 
 /* Default CPU cost to start up a foreign query. */
@@ -301,6 +303,8 @@ typedef struct
 	List	   *already_used;	/* expressions already dealt with */
 } ec_member_foreign_arg;
 
+bool keep_connections = true;
+
 /*
  * SQL functions
  */
@@ -505,6 +509,15 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
 			  const PgFdwRelationInfo *fpinfo_o,
 			  const PgFdwRelationInfo *fpinfo_i);
 
+void
+_PG_init(void)
+{
+	DefineCustomBoolVariable("postgres_fdw.keep_connections",
+			 "Enables postgres_fdw connection caching.",
+			 "When off postgres_fdw will close connections at the end of transaction.",
+			 _connections, true, PGC_USERSET, 0, NULL,
+			 NULL, NULL);
+}
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..7f1bdb96d6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -124,9 +124,12 @@ typedef struct PgFdwRelationInfo
 	int			relation_index;
 } PgFdwRelationInfo;
 
+extern bool keep_connections;
+
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
+extern void _PG_init(void);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);


Re: pl/pgsql feature request: shorthand for argument and local variable references

2020-11-17 Thread Pavel Stehule
Hi

út 17. 11. 2020 v 19:04 odesílatel Jack Christensen 
napsal:

> When arguments and other local variables in pl/pgsql functions have the
> same name as columns referenced in queries it is necessary to disambiguate
> the names. This can be done by prefixing the function name (e.g.
> my_func.name), using the argument number is the case of an argument (e.g.
> $1), or renaming the variable (e.g. _name). It is also possible to use a
> GUC to always use the variable or the column but that seems dangerous to me.
>
> Prefixing names with an underscore works well enough for local variables,
> but when using named arguments I prefer the external name not require an
> underscore. I would like to suggest a standard prefix such as $ to
> reference a local variable or argument. $ followed by an integer already
> references an argument by ordinal. What if $ followed by a name meant a
> local reference (essentially it would expand to "my_func.")?
>
> For example, currently I have to do something like this:
>
> create function update_item(id int, foo int, bar text) returns void
> language plpgsql as $$
> begin
>   update items
>   set foo = update_item.foo,
> bar = update_item.bar
>   where items.id = update_item.id;
> end;
> $$;
>
> I would like to be able to do something like:
>
> create function update_item(id int, foo int, bar text) returns void
> language plpgsql as $$
> begin
>   update items
>   set foo = $foo,
> bar = $bar
>   where items.id = $id;
> end;
> $$;
>
> Any opinions on the desirability of this feature? My C skills are rather
> atrophied, but from the outside it seems like a small enough change I might
> be able to tackle it...
>

I don't like this proposal too much. Introducing the next different syntax
for writing local variables doesn't look like a good idea for me. More this
syntax is used by very different languages than is PL/pgSQL, and then it
can be messy. The behaviour of local variables in PHP or Perl or shell is
really very different.

Personally in your example I very much like notation "update_item.id",
because there is a clean signal so "id" is the function's argument. When
you use "$id", then it is not clean if "id" is a local variable or
function's argument. So your proposal decreases safety :-/. Plus this
syntax reduces collision only on one side, you should use aliases for sql
identifiers and again it is not balanced - In MS SQL I can write predicate
id = @id. But it is not possible in your proposal (and it is not possible
from compatibility reasons ever).

More we already has a possibility to do ALIAS of any variable
https://www.postgresql.org/docs/current/plpgsql-declarations.html#PLPGSQL-DECLARATION-ALIAS

I understand that there can be problems with functions with very long
names.

We already can do

CREATE OR REPLACE FUNCTION public.fx(par1 integer, par2 integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
<>
declare
  p1 alias for par1;
  p2 alias for par2;
begin
  raise notice '% % % %', par1, par2, b.p1, b.p2;
end;
$function$

or safer

CREATE OR REPLACE FUNCTION public.fx(par1 integer, par2 integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
<>
declare
  p1 alias for fx.par1;
  p2 alias for fx.par2;
begin
  raise notice '% % % %', par1, par2, b.p1, b.p2;
end;
$function$

So I think introducing new syntax is not necessary. The open question is a
possibility to do aliasing more comfortably. ADA language has a possibility
to rename function or procedure. But it is much more stronger, than can be
implemented in plpgsql. Probably the most easy implementation can be a
possibility to specify a new argument's label with already supported
#option syntax.

CREATE OR REPLACE FUNCTION very_long_name(par1 int)
RETURNS int AS $$
#routine_label lnm
BEGIN
  RAISE NOTICE '%', lnm.par1;

Regards

Pavel


> Jack
>


Re: Hash support for row types

2020-11-17 Thread Tom Lane
Peter Eisentraut  writes:
> I wrote a new patch to add a lot more tests around hash-based plans. 
> This is intended to apply separately from the other patch, and the other 
> patch would then "flip" some of the test cases.

OK, that addresses my concerns.

regards, tom lane




pl/pgsql feature request: shorthand for argument and local variable references

2020-11-17 Thread Jack Christensen
When arguments and other local variables in pl/pgsql functions have the
same name as columns referenced in queries it is necessary to disambiguate
the names. This can be done by prefixing the function name (e.g.
my_func.name), using the argument number is the case of an argument (e.g.
$1), or renaming the variable (e.g. _name). It is also possible to use a
GUC to always use the variable or the column but that seems dangerous to me.

Prefixing names with an underscore works well enough for local variables,
but when using named arguments I prefer the external name not require an
underscore. I would like to suggest a standard prefix such as $ to
reference a local variable or argument. $ followed by an integer already
references an argument by ordinal. What if $ followed by a name meant a
local reference (essentially it would expand to "my_func.")?

For example, currently I have to do something like this:

create function update_item(id int, foo int, bar text) returns void
language plpgsql as $$
begin
  update items
  set foo = update_item.foo,
bar = update_item.bar
  where items.id = update_item.id;
end;
$$;

I would like to be able to do something like:

create function update_item(id int, foo int, bar text) returns void
language plpgsql as $$
begin
  update items
  set foo = $foo,
bar = $bar
  where items.id = $id;
end;
$$;

Any opinions on the desirability of this feature? My C skills are rather
atrophied, but from the outside it seems like a small enough change I might
be able to tackle it...

Jack


Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Peter Geoghegan
On Tue, Nov 17, 2020 at 9:19 AM Victor Yegorov  wrote:
> OK. Can you explain what deprecation means here?
> If this functionality is left as is, it is not really deprecation?..

It just means that we only keep it around for compatibility purposes.
We would like to remove it, but can't right now. If we ever stop
supporting version 3 indexes, then we can probably remove it entirely.
I would like to avoid special cases across B-Tree index versions.
Simply maintaining the page flag in the same way as we always have is
the simplest approach.

Pushed the BTP_HAS_GARBAGE patch just now. I'll post a rebased version
of the patch series later on today.

Thanks
-- 
Peter Geoghegan




Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Victor Yegorov
вт, 17 нояб. 2020 г. в 17:24, Peter Geoghegan :

> I prefer to continue to maintain the flag in the same way, regardless
> of which B-Tree version is in use (i.e. if it's heapkeyspace or not).
> Maintaining the flag is not expensive, may have some small value for
> forensic or debugging purposes, and saves callers the trouble of
> telling  _bt_delitems_delete() (and code like it) whether or not this
> is a heapkeyspace index.
>

OK. Can you explain what deprecation means here?
If this functionality is left as is, it is not really deprecation?..

-- 
Victor Yegorov


Re: [PATCH] Covering SPGiST index

2020-11-17 Thread Pavel Borisov
вт, 17 нояб. 2020 г. в 11:36, Pavel Borisov :

> I've started to review this, and I've got to say that I really disagree
>> with this choice:
>>
>> + * If there are INCLUDE columns, they are stored after a key value, each
>> + * starting from its own typalign boundary. Unlike IndexTuple, first
>> INCLUDE
>> + * value does not need to start from MAXALIGN boundary, so SPGiST uses
>> private
>> + * routines to access them.
>>
> Tom, I took a stab at making the code for tuple creation/decomposition
more optimal. Now I see several options for this:
1. Included values can be added after key value as a whole index tuple. Pro
of this: it reuses existing code perfectly. Con is that it will introduce
extra (empty) index tuple header.
2. Existing state: pro is that in my opinion, it has the least possible
gaps. The con is the need to duplicate much of the existing code with some
modification. Frankly I don't like this duplication very much even if it is
only a private spgist code.
2A. Existing state can be shifted into fewer changes in index_form_tuple
and index_deform_tuple if I shift the null mask after the tuple header and
before the key value (SpGistTupleHeader+nullmask chunk will be maxaligned).
This is what I proposed in the previous answer. I tried to work on this
variant but it will need to duplicate index_form_tuple and
index_deform_tuple code into private version. The reason is that spgist
tuple has its own header of different size and in my understanding, it can
not be incorporated using index_form_tuple.
3. I can split index_form_tuple into two parts: a) header adding and size
calculation,  b) filling attributes. External (a), which could be
constructed differently for SpGist, and internal (b) being universal.
3A. I can make index_form_tuple accept pointer as an argument to create
tuple in already palloced memory area (with the shift to its start). So
external caller will be able to incorporate headers after calling
index_form_tuple routine.

Maybe there is some other way I don't imagine yet. Which way do you think
for me better to follow? What is your advice?


> I also find it unacceptable that you stuck a tuple descriptor pointer into
>> the rd_amcache structure.  The relcache only supports that being a flat
>> blob of memory.  I see that you tried to hack around that by having
>> spgGetCache reconstruct a new tupdesc every time through, but (a) that's
>> actually worse than having no cache at all, and (b) spgGetCache doesn't
>> really know much about the longevity of the memory context it's called in.
>> This could easily lead to dangling tuple pointers, serious memory bloat
>> from repeated tupdesc construction, or quite possibly both.  Safer would
>> be to build the tupdesc during initSpGistState(), or maybe just make it
>> on-demand.  In view of the previous point, I'm also wondering if there's
>> any way to get the relcache's regular rd_att tupdesc to be useful here,
>> so we don't have to build one during scans at all.
>>
> I see that FormData_pg_attribute's inside TupleDescData are situated in a
single memory chunk. If I add it at the ending of allocated SpGistCache as
a copy of this chunk (using memcpy), not a pointer to it as it is now, will
it be safe for use?
Or maybe it would still bel better to initialize tuple descriptor any
time initSpGistState called without trying to cache it?

What will you advise?

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-17 Thread Alvaro Herrera
So I made the change to set the statusFlags with only LW_SHARED -- both
in existing uses (see 0002) and in the new CIC code (0003).  I don't
think 0002 is going to have a tremendous performance impact, because it
changes operations that are very infrequent.  But even so, it would be
weird to leave code around that uses exclusive lock when we're going to
add new code that uses shared lock for the same thing.

I still left the REINDEX CONCURRENTLY support in split out in 0004; I
intend to get the first three patches pushed now, and look into 0004
again later.
>From 5e2af4e082df8f839bf257d933d70bb9645f6cfe Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 17 Nov 2020 11:37:36 -0300
Subject: [PATCH v6 1/4] indexcmds.c: reorder function prototypes

... out of an overabundance of neatnikism, perhaps.
---
 src/backend/commands/indexcmds.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 35696f9f75..02c7a0c7e1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -68,6 +68,7 @@
 
 
 /* non-export function prototypes */
+static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 static void CheckPredicate(Expr *predicate);
 static void ComputeIndexAttrs(IndexInfo *indexInfo,
 			  Oid *typeOidP,
@@ -87,13 +88,11 @@ static char *ChooseIndexNameAddition(List *colnames);
 static List *ChooseIndexColumnNames(List *indexElems);
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, void *arg);
-static bool ReindexRelationConcurrently(Oid relationOid, int options);
-
+static void reindex_error_callback(void *args);
 static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
 static void ReindexMultipleInternal(List *relids, int options);
-static void reindex_error_callback(void *args);
+static bool ReindexRelationConcurrently(Oid relationOid, int options);
 static void update_relispartition(Oid relationId, bool newval);
-static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
-- 
2.20.1

>From 3e2778d553448c67e956bd385c4a94039933d3c6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 17 Nov 2020 13:48:16 -0300
Subject: [PATCH v6 2/4] relax lock level for setting statusFlags

---
 src/backend/commands/vacuum.c | 2 +-
 src/backend/replication/logical/logical.c | 2 +-
 src/backend/replication/slot.c| 2 +-
 src/backend/storage/ipc/procarray.c   | 2 ++
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d89ba3031f..395e75f768 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1741,7 +1741,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
 		 * might appear to go backwards, which is probably Not Good.
 		 */
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
 		MyProc->statusFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
 			MyProc->statusFlags |= PROC_VACUUM_FOR_WRAPAROUND;
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1f4df7d70..4324e32656 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -181,7 +181,7 @@ StartupDecodingContext(List *output_plugin_options,
 	 */
 	if (!IsTransactionOrTransactionBlock())
 	{
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		LWLockAcquire(ProcArrayLock, LW_SHARED);
 		MyProc->statusFlags |= PROC_IN_LOGICAL_DECODING;
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 		LWLockRelease(ProcArrayLock);
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 5ed955ba0b..9c7cf13d4d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -527,7 +527,7 @@ ReplicationSlotRelease(void)
 	MyReplicationSlot = NULL;
 
 	/* might not have been set when we've been a plain slot */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
 	MyProc->statusFlags &= ~PROC_IN_LOGICAL_DECODING;
 	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 	LWLockRelease(ProcArrayLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e2cb6e990d..ebe91907d6 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -662,6 +662,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->statusFlags & PROC_VACUUM_STATE_MASK)
 		{
+			/* Only safe to change my own flags with only share lock */
+			Assert(proc == MyProc);
 			

Re: Detecting File Damage & Inconsistencies

2020-11-17 Thread Simon Riggs
On Fri, 13 Nov 2020 at 11:24, Simon Riggs  wrote:
>
> On Fri, 13 Nov 2020 at 00:50, tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Simon Riggs 
> > > If a rogue user/process is suspected, this would allow you to identify
> > > more easily the changes made by specific sessions/users.
> >
> > Isn't that kind of auditing a job of pgAudit or log_statement = mod?  Or, 
> > does "more easily" mean that you find pgAudit complex to use and/or 
> > log_statement's overhead is big?
>
> Well, I designed pgaudit, so yes, I think pgaudit is useful.
>
> However, pgaudit works at the statement level, not the data level. So
> using pgaudit to locate data rows that have changed is fairly hard.
>
> What I'm proposing is an option to add 16 bytes onto each COMMIT
> record, which is considerably less than turning on full auditing in
> pgaudit. This option would allow identifying data at the row level, so
> you could for example find all rows changed by specific sessions.
> Also, because it is stored in WAL it will show updates that might no
> longer exist in the database because the changed row versions might
> have been vacuumed away. So pgaudit will tell you that happened, but
> having extra info in WAL is important also.
>
> So thank you for the question because it has allowed me to explain why
> it is useful and important.

Patch attached to implement "wal_sessioninfo" option.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


wal_sessioninfo.v2.patch
Description: Binary data


Re: planner support functions: handle GROUP BY estimates ?

2020-11-17 Thread Tomas Vondra



On 11/17/20 5:18 PM, Justin Pryzby wrote:
> On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote:
>> On 1/15/20 12:44 AM, Tom Lane wrote:
>>> Tomas Vondra  writes:
 On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
> I wonder just how messy it would be to add a column to pg_statistic_ext
> whose type is the composite type "pg_statistic", and drop the required
> data into that.  We've not yet used any composite types in the system
> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
> it seems like we might be able to get away with it.
>>>
>>> [ I meant pg_statistic_ext_data, obviously ]
>>>
 I don't know, but feels a bit awkward to store this type of stats into
 pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
 work fine, not sure.
>>
>> I've started looking at statistics on expressions too, mostly because it
>> seems the extended stats improvements (as discussed in [1]) need that.
>>
>> The "stash pg_statistic records into pg_statistics_ext_data" approach
>> seems simple, but it's not clear to me how to make it work, so I'd
>> appreciate some guidance.
>>
>>
>> 1) Considering we don't have any composite types in any catalog yet, and
>> naive attempts to just use something like
>>
>> pg_statistic stxdexprs[1];
>>
>> did not work. So I suppose this will require changes to genbki.pl, but
>> honestly, my Perl-fu is non-existent :-(
> 
> In the attached, I didn't need to mess with perl.
> 
>> 2) Won't it be an issue that pg_statistic contains pseudo-types? That
>> is, this does not work, for example:
>>
>> test=# create table t (a pg_statistic[]);
>> ERROR:  column "stavalues1" has pseudo-type anyarray
> 
> It works during initdb for the reasons that it's allowed for pg_statistic.
> 

Oh, wow! I haven't expected a patch implementing this, that's great. I
owe you a beer or a drink of your choice.

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add session statistics to pg_stat_database

2020-11-17 Thread Magnus Hagander
On Tue, Nov 17, 2020 at 4:22 PM Laurenz Albe 
wrote:

> On Fri, 2020-10-16 at 16:24 +0500, Ahsan Hadi wrote:
> > I have applied the latest patch on master, all the regression test cases
> are passing
> >  and the implemented functionality is also looking fine. The point that
> I raised about
> >  idle connection not included is also addressed.
>
> If you think that the patch is ready to go, you could mark it as
> "ready for committer" in the commitfest app.
>

I've taken a look as well, and here are a few short notes:

* It talks about "number of connections" but "number of aborted sessions".
We should probably be consistent about talking either about connections or
sessions? In particular, connections seems wrong in this case, because it
only starts counting after authentication is complete (since otherwise we
send no stats)? (This goes for both docs and actual function names)

* Is there a reason we're counting active and idle in transaction
(including aborted), but not fastpath? In particular, we seem to ignore
fastpath -- if we don't want to single it out specifically, it should
probably be included in active?

* pgstat_send_connstat() but pgstat_recv_connection(). Let's call both
connstat or both connection (I'd vote connstat)?

* Is this actually a fix that's independent of the new stats? It seems in
general to be changing the behaviour of "force", which is more generic?
-   !have_function_stats)
+   !have_function_stats && !force)

* in pgstat_send_connstat() you pass the parameter "force" in as
"disconnect". That behaviour at least requires a comment saying why, I
think. My understanding is it relies on that "force" means this is
a "backend is shutting down", but that is not actually documented anywhere.
Maybe the "force" parameter should actually be renamed to indicate this is
really what it means, to avoid a future mistake in the area? But even with
that, how does that turn into disconnect?

* Maybe rename pgStatSessionDisconnected
to pgStatSessionNormalDisconnected? To avoid having to go back to the
setting point and look it up in a comment.

I wonder if there would also be a way to count "sessions that crashed" as
well. That is,the ones that failed in a way that caused the postmaster to
restart the system. But that's information we'd have to send from the
postmaster, but I'm actually unsure if we're "allowed" to send things to
the stats collector from the postmaster. But I think it could be quite
useful information to have. Maybe we can find some way to piggyback on the
fact that we're restarting the stats collector as a result?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Commitfest 2020-11

2020-11-17 Thread Anastasia Lubennikova
Now that we are halfway through the commitfest, the status breakdown 
looks like this:


Needs review: 116
Waiting on Author: 45
Ready for Committer: 22
Committed: 51
Returned with Feedback: 1
Withdrawn: 8
Rejected: 1
Total: 244

which means we have reached closure on a quarter of the patches. And 
many discussions have significantly moved forward. Keep it up and don't 
wait for the deadline commitfest)


Most inactive discussions and patches which didn't apply were notified 
on their respective threads. If there will be no response till the last 
days of the CF they will be considered stalled and returned with feedback.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Peter Geoghegan
On Tue, Nov 17, 2020 at 7:05 AM Victor Yegorov  wrote:
> I've looked over the BTP_HAS_GARBAGE modifications, they look sane.
> I've double checked that heapkeyspace indexes don't use this flag (don't rely 
> on it),
> while pre-v4 ones still use it.

Cool.

> I have a question. This flag is raised in the _bt_check_unique() and in 
> _bt_killitems().
> If we're deprecating this flag, perhaps it'd be good to either avoid raising 
> it at least for
> _bt_check_unique(), as it seems to me that conditions are dealing with 
> postings, therefore
> we are speaking of heapkeyspace indexes here.

Well, we still want to mark LP_DEAD bits set in all cases, just as
before. The difference is that heapkeyspace indexes won't rely on the
page-level flag later on.

> If we'll conditionally raise this flag in the functions above, we can get rid 
> of blocks that drop it
> in _bt_delitems_delete(), I think.

I prefer to continue to maintain the flag in the same way, regardless
of which B-Tree version is in use (i.e. if it's heapkeyspace or not).
Maintaining the flag is not expensive, may have some small value for
forensic or debugging purposes, and saves callers the trouble of
telling  _bt_delitems_delete() (and code like it) whether or not this
is a heapkeyspace index.

-- 
Peter Geoghegan




Re: enable_incremental_sort changes query behavior

2020-11-17 Thread Tomas Vondra



On 11/17/20 3:03 PM, James Coleman wrote:
> On Mon, Nov 16, 2020 at 11:23 PM Tomas Vondra
>  wrote:
>>
>> Hmm, I missed that other thread. That indeed seems like a bug in the
>> same area already tweaked by ebb7ae839d033d0f2 for similar cases.
> 
> I meant to bring it up in this thread before we got the other patch
> committed, but I just ended up not having time to look into it.
> 
>> The attached patch fixes this simply by adding is_parallel_safe to
>> get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
>> not entirely sure that's the right place. Maybe it should be done in
>> find_em_expr_usable_for_sorting_rel (which might make a difference if an
>> EC clause can contain a mix of parallel safe and unsafe expressions). Or
>> maybe we should do it in the caller (which would allow using
>> get_useful_pathkeys_for_relation in contexts not requiring parallel
>> safety in the future).
>>
>> Anyway, while this is not an "incremental sort" bug, it seems like the
>> root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
>> sort patches started considering sorting below gather nodes, not
>> realizing these possible consequences.
> 
> Yeah. I'd like to investigate a bit if really we should also be adding
> it to prepare_sort_from_pathkeys (which
> find_em_expr_usable_for_sorting_rel parallels) or similar, since this
> seems to be a broader concern that's been previously missed (even if
> the bug was otherwise not yet exposed). In particular, as I understand
> it, we did do sorts below gather nodes before, just in fewer places,
> which meant that "in theory" the code was already broken (or at least
> not complete) for parallel queries.
> 

True. It's possible the bug was there before, but the affected plans
were just very unlikely for some reason, and the new plans introduced by
incremental sort just made it easier to trigger.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: planner support functions: handle GROUP BY estimates ?

2020-11-17 Thread Justin Pryzby
On Mon, Nov 16, 2020 at 06:24:41PM +0100, Tomas Vondra wrote:
> On 1/15/20 12:44 AM, Tom Lane wrote:
> > Tomas Vondra  writes:
> >> On Tue, Jan 14, 2020 at 05:37:53PM -0500, Tom Lane wrote:
> >>> I wonder just how messy it would be to add a column to pg_statistic_ext
> >>> whose type is the composite type "pg_statistic", and drop the required
> >>> data into that.  We've not yet used any composite types in the system
> >>> catalogs, AFAIR, but since pg_statistic_ext isn't a bootstrap catalog
> >>> it seems like we might be able to get away with it.
> > 
> > [ I meant pg_statistic_ext_data, obviously ]
> > 
> >> I don't know, but feels a bit awkward to store this type of stats into
> >> pg_statistic_ext, which was meant for multi-column stats. Maybe it'd
> >> work fine, not sure.
> 
> I've started looking at statistics on expressions too, mostly because it
> seems the extended stats improvements (as discussed in [1]) need that.
> 
> The "stash pg_statistic records into pg_statistics_ext_data" approach
> seems simple, but it's not clear to me how to make it work, so I'd
> appreciate some guidance.
> 
> 
> 1) Considering we don't have any composite types in any catalog yet, and
> naive attempts to just use something like
> 
> pg_statistic stxdexprs[1];
> 
> did not work. So I suppose this will require changes to genbki.pl, but
> honestly, my Perl-fu is non-existent :-(

In the attached, I didn't need to mess with perl.

> 2) Won't it be an issue that pg_statistic contains pseudo-types? That
> is, this does not work, for example:
> 
> test=# create table t (a pg_statistic[]);
> ERROR:  column "stavalues1" has pseudo-type anyarray

It works during initdb for the reasons that it's allowed for pg_statistic.

-- 
Justin
>From b9dabd3b773b077e55bb5ea23b89eb3d650029ee Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Nov 2020 09:28:33 -0600
Subject: [PATCH 1/2] Allow composite types in bootstrap

---
 src/backend/bootstrap/bootstrap.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..ed6b3906ee 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -937,6 +937,21 @@ gettype(char *type)
 return (*app)->am_oid;
 			}
 		}
+
+		/* The type wasn't known; check again to handle composite
+		 * types, added since first populating the array. */
+		Typ = NULL;
+		populate_typ_array();
+
+		/* Need to avoid infinite recursion... */
+		for (app = Typ; *app != NULL; app++)
+		{
+			if (strncmp(NameStr((*app)->am_typ.typname), type, NAMEDATALEN) == 0)
+			{
+Ap = *app;
+return (*app)->am_oid;
+			}
+		}
 	}
 	else
 	{
-- 
2.17.0

>From d7246706640d9a7e40805f1861d269cccb3f05ef Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Nov 2020 09:47:32 -0600
Subject: [PATCH 2/2] Add pg_statistic_ext_data.stxdexpr

---
 src/backend/catalog/Makefile| 8 
 src/backend/commands/statscmds.c| 2 ++
 src/include/catalog/pg_statistic_ext_data.h | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 2519771210..91db93f64d 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -49,15 +49,15 @@ include $(top_srcdir)/src/backend/common.mk
 
 # Note: the order of this list determines the order in which the catalog
 # header files are assembled into postgres.bki.  BKI_BOOTSTRAP catalogs
-# must appear first, and there are reputedly other, undocumented ordering
-# dependencies.
+# must appear first, and pg_statistic before pg_statistic_ext_data, and there
+# are reputedly other, undocumented ordering dependencies.
 CATALOG_HEADERS := \
 	pg_proc.h pg_type.h pg_attribute.h pg_class.h \
 	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
-	pg_statistic_ext.h pg_statistic_ext_data.h \
-	pg_statistic.h pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
+	pg_statistic.h pg_statistic_ext.h pg_statistic_ext_data.h \
+	pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
 	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h \
 	pg_authid.h pg_auth_members.h pg_shdepend.h pg_shdescription.h \
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 3057d89d50..476a6314a6 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -366,6 +366,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	datanulls[Anum_pg_statistic_ext_data_stxdndistinct - 1] = true;
 	datanulls[Anum_pg_statistic_ext_data_stxddependencies - 1] = true;
 	datanulls[Anum_pg_statistic_ext_data_stxdmcv - 1] = true;
+	datanulls[Anum_pg_statistic_ext_data_stxdexpr - 1] = true;
 
 	

Re: Is postgres ready for 2038?

2020-11-17 Thread Tom Lane
=?UTF-8?B?5pa55b6z6Lyd?=  writes:
> Is there any road map for 2038 problems in Postgres?

Postgres has no problem with post-2038 dates as long as you are using a
system with 64-bit time_t.  In other words, the road map is "get off
Windows, or press Microsoft to fix their problems".

regards, tom lane




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-17 Thread Alvaro Herrera
On 2020-Nov-16, Alvaro Herrera wrote:

> On 2020-Nov-09, Tom Lane wrote:
> 
> > Michael Paquier  writes:
> > >> +LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> > >> +MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
> > >> +ProcGlobal->vacuumFlags[MyProc->pgxactoff] = 
> > >> MyProc->vacuumFlags;
> > >> +LWLockRelease(ProcArrayLock);
> > 
> > > I can't help noticing that you are repeating the same code pattern
> > > eight times.  I think that this should be in its own routine, and that
> > > we had better document that this should be called just after starting
> > > a transaction, with an assertion enforcing that.
> > 
> > Do we really need exclusive lock on the ProcArray to make this flag
> > change?  That seems pretty bad from a concurrency standpoint.
> 
> BTW I now know that the reason for taking ProcArrayLock is not the
> vacuumFlags itself, but rather MyProc->pgxactoff, which can move.

... ah, but I realize now that this means that we can use shared lock
here, not exclusive, which is already an enormous improvement.  That's
because ->pgxactoff can only be changed with exclusive lock held; so as
long as we hold shared, the array item cannot move.




Re: Protect syscache from bloating with negative cache entries

2020-11-17 Thread Heikki Linnakangas

On 09/11/2020 11:34, Kyotaro Horiguchi wrote:

At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas  wrote in

Do you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.

Need to handle integer overflow somehow. Or maybe not: integer
overflow is so infrequent that even if a hot syscache entry gets
evicted prematurely because its ntaccess count wrapped around to 0, it
will happen so rarely that it won't make any difference in practice.


That relaxing simplifies the code significantly, but a significant
degradation by about 5% still exists.

(SearchCatCacheInternal())
  + ct->naccess++;
!+ ct->lastaccess = catcacheclock;

If I removed the second line above, the degradation disappears
(-0.7%).


0.7% degradation is probably acceptable.


However, I don't find the corresponding numbers in the output
of perf. The sum of the numbers for the removed instructions is (0.02
+ 0.28 = 0.3%).  I don't think the degradation as the whole doesn't
always reflect to the instruction level profiling, but I'm stuck here,
anyway.


Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) 
is exactly 64 bytes currently, so any fields that you add after 'tuple' 
will go on a different cache line. Maybe it would help if you just move 
the new fields before 'tuple'.


Making CatCTup smaller might help. Some ideas/observations:

- The 'ct_magic' field is only used for assertion checks. Could remove it.

- 4 Datums (32 bytes) are allocated for the keys, even though most 
catcaches have fewer key columns.


- In the current syscaches, keys[2] and keys[3] are only used to store 
32-bit oids or some other smaller fields. Allocating a full 64-bit Datum 
for them wastes memory.


- You could move the dead flag at the end of the struct or remove it 
altogether, with the change I mentioned earlier to not keep dead items 
in the buckets


- You could steal a few bit for dead/negative flags from some other 
field. Use special values for tuple.t_len for them or something.


With some of these tricks, you could shrink CatCTup so that the new 
lastaccess and naccess fields would fit in the same cacheline.


That said, I think this is good enough performance-wise as it is. So if 
we want to improve performance in general, that can be a separate patch.


- Heikki




Re: Add Information during standby recovery conflicts

2020-11-17 Thread Fujii Masao




On 2020/11/17 17:23, Drouvot, Bertrand wrote:


On 11/17/20 2:09 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On Mon, Nov 16, 2020 at 4:55 PM Drouvot, Bertrand  wrote:

Hi,

On 11/16/20 6:44 AM, Masahiko Sawada wrote:

Thank you for updating the patch.

Here are review comments.

+   if (report_waiting && (!logged_recovery_conflict ||
new_status == NULL))
+   ts = GetCurrentTimestamp();

The condition will always be true if log_recovery_conflict_wait is
false and report_waiting is true, leading to unnecessary calling of
GetCurrentTimestamp().

---
+   
+    You can control whether a log message is produced when the startup process
+    is waiting longer than deadlock_timeout for recovery
+    conflicts. This is controled by the 
+    parameter.
+   

s/controled/controlled/

---
  if (report_waiting)
  waitStart = GetCurrentTimestamp();

Similarly, we have the above code but we don't need to call
GetCurrentTimestamp() if update_process_title is false, even if
report_waiting is true.

I've attached the patch that fixes the above comments. It can be
applied on top of your v8 patch.

Thanks for the review and the associated fixes!

I've attached a new version that contains your fixes.


Thank you for updating the patch.

I have other comments:

+   
+    You can control whether a log message is produced when the startup process
+    is waiting longer than deadlock_timeout for recovery
+    conflicts. This is controlled by the
+     parameter.
+   

It would be better to use 'WAL replay' instead of 'the startup
process' for consistency with circumjacent descriptions. What do you
think?


Agree that the wording is more appropriate.



---
@@ -1260,6 +1262,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod
lockMethodTable)
   else
   enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
   }
+ else
+   standbyWaitStart = GetCurrentTimestamp();

I think we can add a check of log_recovery_conflict_waits to avoid
unnecessary calling of GetCurrentTimestamp().

I've attached the updated version patch including the above comments
as well as adding some assertions. Please review it.


That looks all good to me.

Thanks a lot for your precious help!


Thanks for updating the patch! Here are review comments.

+Controls whether a log message is produced when the startup process
+is waiting longer than deadlock_timeout
+for recovery conflicts.

But a log message can be produced also when the backend is waiting
for recovery conflict. Right? If yes, this description needs to be corrected.


+for recovery conflicts.  This is useful in determining if recovery
+conflicts prevents the recovery from applying WAL.

"prevents" should be "prevent"?


+   TimestampDifference(waitStart, GetCurrentTimestamp(), , );
+   msecs = secs * 1000 + usecs / 1000;

GetCurrentTimestamp() is basically called before LogRecoveryConflict()
is called. So isn't it better to avoid calling GetCurrentTimestamp() newly in
LogRecoveryConflict() and to reuse the timestamp that we got?
It's helpful to avoid the waste of cycles.


+   while (VirtualTransactionIdIsValid(*vxids))
+   {
+   PGPROC *proc = BackendIdGetProc(vxids->backendId);

BackendIdGetProc() can return NULL if the backend is not active
at that moment. This case should be handled.


+   case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+   reasonDesc = gettext_noop("recovery is still waiting 
recovery conflict on buffer pin");

It's natural to use "waiting for recovery" rather than "waiting recovery"?


+   /* Also, set deadlock timeout for logging purpose if necessary 
*/
+   if (log_recovery_conflict_waits)
+   {
+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;
+   cnt++;
+   }

This needs to be executed only when the message has not been logged yet.
Right?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Is postgres ready for 2038?

2020-11-17 Thread 方徳輝
Hello dear pgsql hackers


I think Postgres had fix 2038 problems since 8.4, when I read  in
stackexchange:


https://dba.stackexchange.com/a/233084/102852


So I test on my PC by simply change system date to `2040-09-25`:



   - Windows 10 Home edition
   - CPU: AMD 64 bit
   - Postgres version 13.1   64-bit
   - Postgres version 10.14 64-bit


Both Postgres edition seems basically work:

   - Postgres can start the process, and connect with psql/pgAdmin
   successfully. (but not show as “started” in services.msc)
   - Table with the Timestamp column, test with basic CRUD sql is work.


But there are many strange logs, like:


cat postgresql-2040-09-25_123738.log

> 1904-08-20 06:09:23.083 JST [7520] LOG:  stats_timestamp 2000-01-01
09:00:00+09 is later than collector's time 1904-08-20 06:09:23.083102+09
for database 0

> 1904-08-20 06:10:23.226 JST [13904] LOG:  stats collector's time
2000-01-01 09:00:00+09 is later than backend local time 1904-08-20
06:10:23.22678+09

> 1904-08-20 06:10:23.227 JST [7520] LOG:  stats_timestamp 2000-01-01
09:00:00+09 is later than collector's time 1904-08-20 06:10:23.227854+09
for database 12938


Note that log file name is OK, but logs content mix many time stamps (e.g
1904-08-20, 2000-01-01).



And I also found someone says in

https://www.postgresql-archive.org/Year-2038-Bug-td1990821.html


> PostgreSQL 8.4 uses 64bit data type for time. But if you use system
timezone

> then you can get in trouble if system does not support 64bit zic files.


But IMO output content in logs is not OS problems,  it should be Postgres’s
2038 bug.


As our products has 15 years support period, we need to think about 2038
seriously from now,

Is there any road map for 2038 problems in Postgres?


Best regards,

Fang


Re: Additional improvements to extended statistics

2020-11-17 Thread Dean Rasheed
On Thu, 12 Nov 2020 at 14:18, Tomas Vondra  wrote:
>
> Here is an improved WIP version of the patch series, modified to address
> the issue with repeatedly applying the extended statistics, as discussed
> with Dean in this thread. It's a bit rough and not committable, but I
> need some feedback so I'm posting it in this state.

As it stands, it doesn't compile if 0003 is applied, because it missed
one of the callers of clauselist_selectivity_simple(), but that's
easily fixed.

> 0001 is the original patch improving estimates of OR clauses
>
> 0002 adds thin wrappers for clause[list]_selectivity, with "internal"
> functions allowing to specify whether to keep considering extended stats
>
> 0003 does the same for the "simple" functions
>
>
> I've kept it like this to demonstrate that 0002 is not sufficient. In my
> response from March 24 I wrote this:
>
> > Isn't it the case that clauselist_selectivity_simple (and the OR
> > variant) should ignore extended stats entirely? That is, we'd need
> > to add a flag (or _simple variant) to clause_selectivity, so that it
> > calls causelist_selectivity_simple_or.
> But that's actually wrong, as 0002 shows (as it breaks a couple of
> regression tests), because of the way we handle OR clauses. At the top
> level, an OR-clause is actually just a single clause and it may get
> passed to clauselist_selectivity_simple. So entirely disabling extended
> stats for the "simple" functions would also mean disabling extended
> stats for a large number of OR clauses. Which is clearly wrong.
>
> So 0003 addresses that, by adding a flag to the two "simple" functions.
> Ultimately, this should probably do the same thing as 0002 and add thin
> wrappers, because the existing functions are part of the public API.

I agree that, taken together, these patches fix the
multiple-extended-stats-evaluation issue. However:

I think this has ended up with too many variants of these functions,
since we now have "_internal" and "_simple" variants, and you're
proposing adding more. The original purpose of the "_simple" variants
was to compute selectivities without looking at extended stats, and
now the "_internal" variants compute selectivities with an additional
"use_extended_stats" flag to control whether or not to look at
extended stats. Thus they're basically the same, and could be rolled
together.

Additionally, it's worth noting that the "_simple" variants expose the
"estimatedclauses" bitmap as an argument, which IMO is a bit messy as
an API. All callers of the "_simple" functions outside of clausesel.c
actually pass in estimatedclauses=NULL, so it's possible to refactor
and get rid of that, turning estimatedclauses into a purely internal
variable.

Also, it's quite messy that clauselist_selectivity_simple_or() needs
to be passed a Selectivity input (the final argument) that is the
selectivity of any already-estimated clauses, or the value to return
if no not-already-estimated clauses are found, and must be 0.0 when
called from the extended stats code.

Attached is the kind of thing I had in mind (as a single patch, since
I don't think it's worth splitting up). This replaces the "_simple"
and "_internal" variants of these functions with "_opt_ext_stats"
variants whose signatures match the originals except for having the
single extra "use_extended_stats" boolean parameter. Additionally, the
"_simple" functions are merged into the originals (making them more
like they were in PG11) so that the "estimatedclauses" bitmap and
partial-OR-list Selectivity become internal details, no longer exposed
in the API.

Regards,
Dean
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
new file mode 100644
index 37a735b..ab16a4c
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -44,6 +44,12 @@ static void addRangeClause(RangeQueryCla
 		   bool varonleft, bool isLTsel, Selectivity s2);
 static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
 			   List *clauses);
+static Selectivity clause_selectivity_opt_ext_stats(PlannerInfo *root,
+	Node *clause,
+	int varRelid,
+	JoinType jointype,
+	SpecialJoinInfo *sjinfo,
+	bool use_extended_stats);
 
 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -61,64 +67,8 @@ static RelOptInfo *find_single_rel_for_c
  *
  * The basic approach is to apply extended statistics first, on as many
  * clauses as possible, in order to capture cross-column dependencies etc.
- * The remaining clauses are then estimated using regular statistics tracked
- * for individual columns.  This is done by simply passing the clauses to
- * clauselist_selectivity_simple.
- */
-Selectivity
-clauselist_selectivity(PlannerInfo *root,
-	   List *clauses,
-	   int varRelid,
-	   JoinType jointype,
-	   SpecialJoinInfo *sjinfo)
-{
-	Selectivity s1 = 1.0;

Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-17 Thread Tom Lane
Alvaro Herrera  writes:
> Perhaps we can improve these error messages like below.  (Or maybe just
> keep it one message "proargmodes is not a 1-D char array of %d
> elements"?)  There are about 5 places to change I think.

I doubt that it's really worth expending more code on this.
Certainly I see no reason why that particular test is more likely
to fail than the others, in the presence of corrupt data --- and
I don't want to add individual elog's for each one.

Adding the expected length to the error message might be OK though.

regards, tom lane




Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Victor Yegorov
пт, 13 нояб. 2020 г. в 00:01, Peter Geoghegan :

> Attached is v8, which has the enhancements for low cardinality data
> that I mentioned earlier today. It also simplifies the logic for
> dealing with posting lists that we need to delete some TIDs from.
> These posting list simplifications also make the code a bit more
> efficient, which might be noticeable during benchmarking.
>

I've looked through the code and it looks very good from my end:
- plenty comments, good description of what's going on
- I found no loose ends in terms of AM integration
- magic constants replaced with defines
Code looks good. Still, it'd be good if somebody with more experience could
look into this patch.


Question: why in the comments you're using double spaces after dots?
Is this a convention of the project?

I am thinking of two more scenarios that require testing:
- queue in the table, with a high rate of INSERTs+DELETEs and a long
transaction.
  Currently I've seen such conditions yield indexes of several GB in size
wil holding less
  than a thousand of live records.
- upgraded cluster with !heapkeyspace indexes.

-- 
Victor Yegorov


Re: Add session statistics to pg_stat_database

2020-11-17 Thread Laurenz Albe
On Fri, 2020-10-16 at 16:24 +0500, Ahsan Hadi wrote:
> I have applied the latest patch on master, all the regression test cases are 
> passing
>  and the implemented functionality is also looking fine. The point that I 
> raised about
>  idle connection not included is also addressed.

If you think that the patch is ready to go, you could mark it as
"ready for committer" in the commitfest app.

Yours,
Laurenz Albe





ERROR: too many dynamic shared memory segment

2020-11-17 Thread Ben Kanouse
Hello,

My database is experiencing a 'ERROR: too many dynamic shared memory
segment' error from time to time. It seems to happen most when traffic
is high, and it happens with semi-simple select statements that run a
parallel query plan with a parallel bitmap heap scan. It is currently
on version 12.4.

Here is an example query: https://explain.depesz.com/s/aatA

I did find a message in the archive on this topic and it seems the
resolution was to increase the max connection configuration. Since I
am running my database on heroku I do not have access to this
configuration. It also seems like a bit of an unrelated configuration
change in order to not run into this issue.

Link to archived discussion:
https://www.postgresql.org/message-id/CAEepm%3D2RcEWgES-f%2BHyg4931bOa0mbJ2AwrmTrabz6BKiAp%3DsQ%40mail.gmail.com

I think this block of code is determining the size for the control segment:
https://github.com/postgres/postgres/blob/REL_13_1/src/backend/storage/ipc/dsm.c#L157-L159

maxitems  = 64 + 2 * MaxBackends

It seems like the reason increasing the max connections helps is
because that number is part of the equation for determining the max
segment size.
I noticed in version 13.1 there is a commit that changes the
multiplier from 2 to 5:
https://github.com/postgres/postgres/commit/d061ea21fc1cc1c657bb5c742f5c4a1564e82ee2

maxitems  = 64 + 5 * MaxBackends

Should this commit be back-ported to earlier versions of postgres to
prevent this error in other versions?

Thank you,
Ben




Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Victor Yegorov
чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan :

> Another thing that I'll probably add to v8: Prefetching. This is
> probably necessary just so I can have parity with the existing
> heapam.c function that the new code is based on,
> heap_compute_xid_horizon_for_tuples(). That will probably help here,
> too.
>

I don't quite see this part. Do you mean top_block_groups_favorable() here?


-- 
Victor Yegorov


Re: Deleting older versions in unique indexes to avoid page splits

2020-11-17 Thread Victor Yegorov
чт, 12 нояб. 2020 г. в 23:00, Peter Geoghegan :

> It would be helpful if you could take a look at the nbtree patch --
> particularly the changes related to deprecating the page-level
> BTP_HAS_GARBAGE flag. I would like to break those parts out into a
> separate patch, and commit it in the next week or two. It's just
> refactoring, really. (This commit can also make nbtree only use the
> word VACUUM for things that strictly involve VACUUM. For example,
> it'll rename _bt_vacuum_one_page() to _bt_delete_or_dedup_one_page().)
>
> We almost don't care about the flag already, so there is almost no
> behavioral change from deprecated BTP_HAS_GARBAGE in this way.
>
> Indexes that use deduplication already don't rely on BTP_HAS_GARBAGE
> being set ever since deduplication was added to Postgres 13 (the
> deduplication code doesn't want to deal with LP_DEAD bits, and cannot
> trust that no LP_DEAD bits can be set just because BTP_HAS_GARBAGE
> isn't set in the special area). Trusting the BTP_HAS_GARBAGE flag can
> cause us to miss out on deleting items with their LP_DEAD bits set --
> we're better off "assuming that BTP_HAS_GARBAGE is always set", and
> finding out if there really are LP_DEAD bits set for ourselves each
> time.
>
> Missing LP_DEAD bits like this can happen when VACUUM unsets the
> page-level flag without actually deleting the items at the same time,
> which is expected when the items became garbage (and actually had
> their LP_DEAD bits set) after VACUUM began, but before VACUUM reached
> the leaf pages. That's really wasteful, and doesn't actually have any
> upside -- we're scanning all of the line pointers only when we're
> about to split (or dedup) the same page anyway, so the extra work is
> practically free.
>

I've looked over the BTP_HAS_GARBAGE modifications, they look sane.
I've double checked that heapkeyspace indexes don't use this flag (don't
rely on it),
while pre-v4 ones still use it.

I have a question. This flag is raised in the _bt_check_unique() and
in _bt_killitems().
If we're deprecating this flag, perhaps it'd be good to either avoid
raising it at least for
_bt_check_unique(), as it seems to me that conditions are dealing with
postings, therefore
we are speaking of heapkeyspace indexes here.

If we'll conditionally raise this flag in the functions above, we can get
rid of blocks that drop it
in _bt_delitems_delete(), I think.

-- 
Victor Yegorov


Re: Is it useful to record whether plans are generic or custom?

2020-11-17 Thread Pavel Stehule
út 17. 11. 2020 v 15:21 odesílatel torikoshia 
napsal:

> On 2020-11-12 14:23, Pavel Stehule wrote:
>
> > yes, the plan self is very interesting information - and information
> > if plan was generic or not is interesting too. It is other dimension
> > of query - maybe there can be rule - for any query store max 100 most
> > slows plans with all attributes. The next issue is fact so first first
> > 5 execution of generic plans are not generic in real. This fact should
> > be visible too.
>
> Thanks!
> However, AFAIU, we can know whether the plan type is generic or custom
> from the plan information as described in the manual.
>
> -- https://www.postgresql.org/docs/devel/sql-prepare.html
> > If a generic plan is in use, it will contain parameter symbols $n,
> > while a custom plan will have the supplied parameter values substituted
> > into it.
>
> If we can get the plan information, the case like 'first 5 execution
> of generic plans are not generic in real' does not happen, doesn't it?
>

yes

postgres=# create table foo(a int);
CREATE TABLE
postgres=# prepare x as select * from foo where a = $1;
PREPARE
postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = 10)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = 10)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = 10)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = 10)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = 10)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = $1)  │
└─┘
(2 rows)

postgres=# explain execute x(10);
┌─┐
│ QUERY PLAN  │
╞═╡
│ Seq Scan on foo  (cost=0.00..41.88 rows=13 width=4) │
│   Filter: (a = $1)  │
└─┘
(2 rows)


>
> Regards,
>
> --
> Atsushi Torikoshi
>


Re: Is it useful to record whether plans are generic or custom?

2020-11-17 Thread torikoshia

On 2020-11-12 14:23, Pavel Stehule wrote:


yes, the plan self is very interesting information - and information
if plan was generic or not is interesting too. It is other dimension
of query - maybe there can be rule - for any query store max 100 most
slows plans with all attributes. The next issue is fact so first first
5 execution of generic plans are not generic in real. This fact should
be visible too.


Thanks!
However, AFAIU, we can know whether the plan type is generic or custom
from the plan information as described in the manual.

-- https://www.postgresql.org/docs/devel/sql-prepare.html
If a generic plan is in use, it will contain parameter symbols $n, 
while a custom plan will have the supplied parameter values substituted 
into it.


If we can get the plan information, the case like 'first 5 execution
of generic plans are not generic in real' does not happen, doesn't it?


Regards,

--
Atsushi Torikoshi




ResourceOwner refactoring

2020-11-17 Thread Heikki Linnakangas
Two recent patches that I have reviewed have reminded me that the 
ResourceOwner interface is a bit clunky. There are two issues:


1. Performance. It's fast enough in most use, but when I was testing 
Kyotaro's catcache patches [1], the Resowner functions to track catcache 
references nevertheless showed up, accounting for about 20% of the CPU 
time used by catcache lookups.


2. It's difficult for extensions to use. There is a callback mechanism 
for extensions, but it's much less convenient to use than the built-in 
functions. The pgcrypto code uses the callbacks currently, and Michael's 
patch [2] would move that support for tracking OpenSSL contexts to the 
core, which makes it a lot more convenient for pgcrypto. Wouldn't it be 
nice if extensions could have the same ergonomics as built-in code, if 
they need to track external resources?


Attached patch refactors the ResourceOwner internals to do that.

The current code in HEAD has a separate array for each "kind" of object 
that can be tracked. The number of different kinds of objects has grown 
over the years, currently there is support for tracking buffers, files, 
catcache, relcache and plancache references, tupledescs, snapshots, DSM 
segments and LLVM JIT contexts. And locks, which are a bit special.


In the patch, I'm using the same array to hold all kinds of objects, and 
carry another field with each tracked object, to tell what kind of an 
object it is. An extension can define a new object kind, by defining 
Release and PrintLeakWarning callback functions for it. The code in 
resowner.c is now much more generic, as it doesn't need to know about 
all the different object kinds anymore (with a couple of exceptions). In 
the new scheme, there is one small fixed-size array to hold a few most 
recently-added references, that is linear-searched, and older entries 
are moved to a hash table.


I haven't done thorough performance testing of this, but with some quick 
testing with Kyotaro's "catcachebench" to stress-test syscache lookups, 
this performs a little better. In that test, all the activity happens in 
the small array portion, but I believe that's true for most use cases.


Thoughts? Can anyone suggest test scenarios to verify the performance of 
this?


[1] 
https://www.postgresql.org/message-id/20201106.172958.1103727352904717607.horikyota.ntt%40gmail.com


[2] https://www.postgresql.org/message-id/20201113031429.gb1...@paquier.xyz

- Heikki
diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 30c30cf3a2e..c099f04f551 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -29,9 +29,21 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
-#include "utils/resowner_private.h"
+#include "utils/resowner.h"
 #include "utils/syscache.h"
 
+/* ResourceOwner callbacks to hold tupledesc references  */
+static void ResOwnerReleaseTupleDesc(Datum res);
+static void ResOwnerPrintTupleDescLeakWarning(Datum res);
+
+static ResourceOwnerFuncs tupdesc_resowner_funcs =
+{
+	/* relcache references */
+	.name = "tupdesc reference",
+	.phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.ReleaseResource = ResOwnerReleaseTupleDesc,
+	.PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning
+};
 
 /*
  * CreateTemplateTupleDesc
@@ -376,9 +388,10 @@ IncrTupleDescRefCount(TupleDesc tupdesc)
 {
 	Assert(tupdesc->tdrefcount >= 0);
 
-	ResourceOwnerEnlargeTupleDescs(CurrentResourceOwner);
+	ResourceOwnerEnlarge(CurrentResourceOwner);
 	tupdesc->tdrefcount++;
-	ResourceOwnerRememberTupleDesc(CurrentResourceOwner, tupdesc);
+	ResourceOwnerRemember(CurrentResourceOwner, PointerGetDatum(tupdesc),
+		  _resowner_funcs);
 }
 
 /*
@@ -394,7 +407,8 @@ DecrTupleDescRefCount(TupleDesc tupdesc)
 {
 	Assert(tupdesc->tdrefcount > 0);
 
-	ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
+	ResourceOwnerForget(CurrentResourceOwner, PointerGetDatum(tupdesc),
+		_resowner_funcs);
 	if (--tupdesc->tdrefcount == 0)
 		FreeTupleDesc(tupdesc);
 }
@@ -925,3 +939,23 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
 
 	return desc;
 }
+
+
+/*
+ * ResourceOwner callbacks
+ */
+static void
+ResOwnerReleaseTupleDesc(Datum res)
+{
+	DecrTupleDescRefCount((TupleDesc) DatumGetPointer(res));
+}
+
+static void
+ResOwnerPrintTupleDescLeakWarning(Datum res)
+{
+	TupleDesc tupdesc = (TupleDesc) DatumGetPointer(res);
+
+	elog(WARNING,
+		 "TupleDesc reference leak: TupleDesc %p (%u,%d) still referenced",
+		 tupdesc, tupdesc->tdtypeid, tupdesc->tdtypmod);
+}
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 5ca3f922fed..c44080b9742 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -26,7 +26,6 @@
 #include "jit/jit.h"
 #include "miscadmin.h"
 #include "utils/fmgrprotos.h"
-#include "utils/resowner_private.h"
 
 /* GUCs */
 bool		jit_enabled = true;
@@ -140,7 +139,6 @@ jit_release_context(JitContext *context)
 	if 

Re: logical streaming of xacts via test_decoding is broken

2020-11-17 Thread Amit Kapila
On Sun, Nov 15, 2020 at 11:34 AM Dilip Kumar  wrote:
>

Pushed after minor changes.

-- 
With Regards,
Amit Kapila.




Re: enable_incremental_sort changes query behavior

2020-11-17 Thread James Coleman
On Mon, Nov 16, 2020 at 11:23 PM Tomas Vondra
 wrote:
>
> Hmm, I missed that other thread. That indeed seems like a bug in the
> same area already tweaked by ebb7ae839d033d0f2 for similar cases.

I meant to bring it up in this thread before we got the other patch
committed, but I just ended up not having time to look into it.

> The attached patch fixes this simply by adding is_parallel_safe to
> get_useful_pathkeys_for_relation - that does fix the reproducer, but I'm
> not entirely sure that's the right place. Maybe it should be done in
> find_em_expr_usable_for_sorting_rel (which might make a difference if an
> EC clause can contain a mix of parallel safe and unsafe expressions). Or
> maybe we should do it in the caller (which would allow using
> get_useful_pathkeys_for_relation in contexts not requiring parallel
> safety in the future).
>
> Anyway, while this is not an "incremental sort" bug, it seems like the
> root cause is the same as for ebb7ae839d033d0f2 - one of the incremental
> sort patches started considering sorting below gather nodes, not
> realizing these possible consequences.

Yeah. I'd like to investigate a bit if really we should also be adding
it to prepare_sort_from_pathkeys (which
find_em_expr_usable_for_sorting_rel parallels) or similar, since this
seems to be a broader concern that's been previously missed (even if
the bug was otherwise not yet exposed). In particular, as I understand
it, we did do sorts below gather nodes before, just in fewer places,
which meant that "in theory" the code was already broken (or at least
not complete) for parallel queries.

James




Re: Hash support for row types

2020-11-17 Thread Peter Eisentraut
I wrote a new patch to add a lot more tests around hash-based plans. 
This is intended to apply separately from the other patch, and the other 
patch would then "flip" some of the test cases.


On 2020-11-13 20:51, Tom Lane wrote:

* The new test in with.sql claims to be testing row hashing, but
it's not very apparent that any such thing actually happens.  Maybe
EXPLAIN the query, as well as execute it, to confirm that a
hash-based plan is used.


The recursive union requires hashing, but this is not visible in the 
plan.  You only get an error if there is no hashing support for a type. 
I have added a test for this.


For the non-recursive union, I have added more tests that show this in 
the plans.



* Is it worth devising a test case in which hashing is not possible
because one of the columns isn't hashable?  I have mixed feelings
about this because the set of suitable column types may decrease
to empty over time, making it hard to maintain the test case.


I used the money type for now.  If someone adds hash support for that, 
we'll change it.  I don't think this will change too rapidly, though.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From dface8b1a8a5714001b2f6cd7f949d2f78047306 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 17 Nov 2020 13:54:52 +0100
Subject: [PATCH] Add more tests for hashing and hash-based plans

- Test hashing of an array of a non-hashable element type.

- Test UNION [DISTINCT] with hash- and sort-based plans.  (Previously,
  only INTERSECT and EXCEPT where tested there.)

- Test UNION [DISTINCT] with a non-hashable column type.  This
  currently reverts to a sort-based plan even if enable_hashagg is on.

- Test UNION/INTERSECT/EXCEPT hash- and sort-based plans with arrays
  as column types.  Also test an array with a non-hashable element
  type.

- Test UNION/INTERSECT/EXCEPT similarly with row types as column
  types.  Currently, this uses only sort-based plans because there is
  no hashing support for row types.

- Add a test case that shows that recursive queries using UNION
  [DISTINCT] require hashable column types.

- Add a currently failing test that uses UNION DISTINCT in a
  cycle-detection use case using row types as column types.

Discussion: 
https://www.postgresql.org/message-id/flat/38eccd35-4e2d-6767-1b3c-dada1eac3124%402ndquadrant.com
---
 src/test/regress/expected/hash_func.out |   7 +
 src/test/regress/expected/union.out | 357 +++-
 src/test/regress/expected/with.out  |  20 ++
 src/test/regress/sql/hash_func.sql  |   6 +
 src/test/regress/sql/union.sql  |  92 +-
 src/test/regress/sql/with.sql   |  18 ++
 6 files changed, 498 insertions(+), 2 deletions(-)

diff --git a/src/test/regress/expected/hash_func.out 
b/src/test/regress/expected/hash_func.out
index e6e3410aaa..e7d615fde5 100644
--- a/src/test/regress/expected/hash_func.out
+++ b/src/test/regress/expected/hash_func.out
@@ -177,6 +177,13 @@ WHERE  hash_array(v)::bit(32) != hash_array_extended(v, 
0)::bit(32)
 ---+--+---+---
 (0 rows)
 
+-- array hashing with non-hashable element type
+SELECT v as value, hash_array(v)::bit(32) as standard
+FROM   (VALUES ('{0}'::money[])) x(v);
+ERROR:  could not identify a hash function for type money
+SELECT v as value, hash_array_extended(v, 0)::bit(32) as extended0
+FROM   (VALUES ('{0}'::money[])) x(v);
+ERROR:  could not identify an extended hash function for type money
 SELECT v as value, hashbpchar(v)::bit(32) as standard,
hashbpcharextended(v, 0)::bit(32) as extended0,
hashbpcharextended(v, 1)::bit(32) as extended1
diff --git a/src/test/regress/expected/union.out 
b/src/test/regress/expected/union.out
index 6e72e92d80..22e1ff5c42 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -345,8 +345,28 @@ ERROR:  FOR NO KEY UPDATE is not allowed with 
UNION/INTERSECT/EXCEPT
 1 |2 |3
 (1 row)
 
--- exercise both hashed and sorted implementations of INTERSECT/EXCEPT
+-- exercise both hashed and sorted implementations of UNION/INTERSECT/EXCEPT
 set enable_hashagg to on;
+explain (costs off)
+select count(*) from
+  ( select unique1 from tenk1 union select fivethous from tenk1 ) ss;
+   QUERY PLAN   
+
+ Aggregate
+   ->  HashAggregate
+ Group Key: tenk1.unique1
+ ->  Append
+   ->  Index Only Scan using tenk1_unique1 on tenk1
+   ->  Seq Scan on tenk1 tenk1_1
+(6 rows)
+
+select count(*) from
+  ( select unique1 from tenk1 union select fivethous from tenk1 ) ss;
+ count 
+---
+ 1
+(1 row)
+
 explain (costs off)
 select count(*) from
   ( select unique1 from tenk1 intersect select fivethous from tenk1 ) ss;
@@ -389,6 +409,27 @@ select unique1 from tenk1 except select unique2 from tenk1 
where unique2 != 10;
 

Re: [patch] [doc] Clarify that signal functions have no feedback

2020-11-17 Thread Heikki Linnakangas

On 02/11/2020 18:02, David G. Johnston wrote:

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index bf6004f321..43bc2cf086 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23892,7 +23892,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
 


 The functions shown in  send control signals to
+linkend="functions-admin-signal-table"/> send uni-directional
+control signals to
 other server processes.  Use of these functions is restricted to
 superusers by default but access may be granted to others using
 GRANT, with noted exceptions.


The "uni-directional" sounds a bit redundant, "send" implies that it's 
uni-directional I think.



@@ -23900,7 +23901,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
 


 Each of these functions returns true if
-successful and false otherwise.
+the signal was successfully sent and false
+if the sending of the signal failed.



This is a good clarification.



@@ -23948,7 +23950,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
 server to reload their configuration files.  (This is initiated by
 sending a SIGHUP signal to the postmaster
 process, which in turn sends SIGHUP to each
-of its children.)
+of its children.)  Inspection of the relevant
+pg_file_settings
+or
+pg_hba_file_rules views
+is recommended after making changes but before signaling the server.

   


I don't understand this recommendation. What is the user supposed to 
look for in those views? And why before signaling the server?


[me reads what those views do]. Oh, I see, the idea is that you can use 
those views to check the configuration for errors, before applying the 
changes. How about this:


You can use the linkend="view-pg-file-settings">pg_file_settings and linkend="view-pg-hba-file-rules">pg_hba_file_rules views to check 
the configuration files for possible errors, before reloading.


- Heikki




Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-17 Thread Alvaro Herrera
On 2020-Sep-30, Craig Ringer wrote:

> Hi all
> 
> Random tip for future searchers. If you've modified pg_proc.dat and  initdb
> fails with "proargmodes is not a 1-D char array" - it could well actually
> be that the length of proargmodes does not match the length of
> proallargtypes given the test
> 
> ARR_DIMS(arr)[0] != numargs ||
> 
> in funcapi.c.

Perhaps we can improve these error messages like below.  (Or maybe just
keep it one message "proargmodes is not a 1-D char array of %d
elements"?)  There are about 5 places to change I think.

diff --git a/src/backend/utils/fmgr/funcapi.c b/src/backend/utils/fmgr/funcapi.c
index b9efa77291..c76c16f799 100644
--- a/src/backend/utils/fmgr/funcapi.c
+++ b/src/backend/utils/fmgr/funcapi.c
@@ -1167,10 +1167,11 @@ get_func_arg_info(HeapTuple procTup,
{
arr = DatumGetArrayTypeP(proargmodes);  /* ensure not toasted */
if (ARR_NDIM(arr) != 1 ||
-   ARR_DIMS(arr)[0] != numargs ||
ARR_HASNULL(arr) ||
ARR_ELEMTYPE(arr) != CHAROID)
elog(ERROR, "proargmodes is not a 1-D char array");
+   if (ARR_DIMS(arr)[0] != numargs)
+   elog(ERROR, "proargmodes is not %d elements long", 
numargs);
*p_argmodes = (char *) palloc(numargs * sizeof(char));
memcpy(*p_argmodes, ARR_DATA_PTR(arr),
   numargs * sizeof(char));




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-17 Thread Bharath Rupireddy
Thanks Craig!

On Fri, Oct 23, 2020 at 9:37 AM Craig Ringer
 wrote:
>
> src/test/modules/test_shm_mq/worker.c appears to do the right thing the wrong 
> way - it has its own custom handler instead of using die() or 
> SignalHandlerForShutdownRequest().
>

handle_sigterm() and die() are similar except that die() has extra
handling(below) for exiting immediately when waiting for input/command
from the client.
/*
 * If we're in single user mode, we want to quit immediately - we can't
 * rely on latches as they wouldn't work when stdin/stdout is a file.
 * Rather ugly, but it's unlikely to be worthwhile to invest much more
 * effort just for the benefit of single user mode.
 */
if (DoingCommandRead && whereToSendOutput != DestRemote)
ProcessInterrupts();

Having this extra handling is correct for normal backends as they can
connect directly to clients for reading input commands, the above if
condition may become true and ProcessInterrupts() may be called to
handle the SIGTERM immediately.

Note that DoingCommandRead can never be true in bg workers as it's
being set to true only in normal backend PostgresMain(). If we use
die()(instead of custom SIGTERM handlers such as handle_sigterm()) in
a bg worker/non-backend process, there are no chances that the above
part of the code gets hit and the interrupts are processed
immediately. And also here are the bg worker process that use die()
instead of their own custom handlers: parallel workers, autoprewarm
worker, autovacuum worker, logical replication launcher and apply
worker, wal sender process

I think we can also use die() instead of handle_sigterm() in
test_shm_mq/worker.c.

I attached a patch for this change.

>
> In contrast  src/test/modules/worker_spi/worker_spi.c looks plain wrong. 
> Especially since it's quoted as an example of how to do things right. It 
> won't respond to SIGTERM at all while it's executing a query from its queue, 
> no matter how long that query takes or whether it blocks. It can inhibit even 
> postmaster shutdown as a result.
>

Postmaster sends SIGTERM to all children(backends and bgworkers) for
both smart shutdown(wait for children to end their work, then shut
down.) and fast shutdown(rollback active transactions and shutdown
when they are gone.) For immediate shutdown SIGQUIT is sent to
children.(see pmdie()).

For each bg worker(so is for worker_spi.c),
SignalHandlerForCrashExit() is set as SIGQUIT handler in
InitPostmasterChild(), which does nothing but exits immediately. We
can not use quickdie() here because as a bg worker we don't have
to/can not send anything to client. We are good with
SignalHandlerForCrashExit() as with all other bg workers.

Yes, if having worker_spi_sigterm/SignalHandlerForShutdownRequest,
sometimes(as explained above) the worker_spi worker may not respond to
SIGTERM. I think we should be having die() as SIGTERM handler here (as
with normal backend and parallel workers).

Attaching another patch that has replaced custom SIGTERM handler
worker_spi_sigterm() with die() and custom SIGHUP handler
worker_spi_sighup() with standard SignalHandlerForConfigReload().

>
> I was going to lob off a quick patch to fix this by making both use 
> quickdie() for SIGQUIT and die() for SIGTERM, but after reading for a bit I'm 
> no longer sure what the right choice even is. I'd welcome some opinions.
>

We can not use quickdie() here because as a bg worker we don't have
to/can not send anything to client. We are good with
SignalHandlerForCrashExit() as with all other bg workers.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Use-die-instead-of-custom-signal-handler-in-test_shm_mq-worker.patch
Description: Binary data


v1-0001-Use-standard-SIGHUP-handler-and-SIGTERM-handler-die-in-worker_spi.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Amit Kapila
On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian  wrote:
>
> On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila  wrote:
>
> >
> > Doesn't this happen only if you set replication origins? Because
> > otherwise both PrepareTransaction() and
> > RecordTransactionCommitPrepared() used the current timestamp.
> >
>
> I was also checking this, even if you set replicating origins, the
> preparedTransaction will reflect the local prepare time in
> pg_prepared_xacts. pg_prepared_xacts fetches this information
> from GlobalTransaction data which does not store the origin_timestamp;
> it only stores the prepared_at which is the local timestamp.
>

Sure, but my question was does this difference in behavior happens
without replication origins in any way? The reason is that if it
occurs only with replication origins, I don't think we need to bother
about the same because that feature is not properly implemented and
not used as-is. See the discussion [1] [2]. OTOH, if this behavior can
happen without replication origins then we might want to consider
changing it.


[1] - 
https://www.postgresql.org/message-id/064fab0c-915e-aede-c02e-bd4ec1f59732%402ndquadrant.com
[2] - 
https://www.postgresql.org/message-id/188d15be-8699-c045-486a-f0439c9c2b7d%402ndquadrant.com

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Ajin Cherian
On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila  wrote:

>
> Doesn't this happen only if you set replication origins? Because
> otherwise both PrepareTransaction() and
> RecordTransactionCommitPrepared() used the current timestamp.
>

I was also checking this, even if you set replicating origins, the
preparedTransaction will reflect the local prepare time in
pg_prepared_xacts. pg_prepared_xacts fetches this information
from GlobalTransaction data which does not store the origin_timestamp;
it only stores the prepared_at which is the local timestamp.
The WAL record does have the origin_timestamp but that is not updated
in the GlobalTransaction data structure

typedef struct xl_xact_prepare
{
uint32  magic;  /* format identifier */
uint32  total_len;  /* actual file length */
TransactionId xid;  /* original transaction XID */
Oid database;   /* OID of database it was in */
TimestampTz prepared_at;/* time of preparation */ <=== this is
local time and updated in GlobalTransaction
Oid owner;  /* user running the transaction */
int32   nsubxacts;  /* number of following subxact XIDs */
int32   ncommitrels;/* number of delete-on-commit rels */
int32   nabortrels; /* number of delete-on-abort rels */
int32   ninvalmsgs; /* number of cache invalidation messages */
boolinitfileinval;  /* does relcache init file need invalidation? */
uint16  gidlen; /* length of the GID - GID follows the header */
XLogRecPtr  origin_lsn; /* lsn of this record at origin node */
TimestampTz origin_timestamp;   /* time of prepare at origin node
*/ <=== this is the time at origin which is not updated in
GlobalTransaction
} xl_xact_prepare;

regards,
Ajin Cherian
Fujitsu Australia




Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-17 Thread Simon Riggs
On Mon, 16 Nov 2020 at 22:53, Masahiko Sawada  wrote:
> On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:

> I don't think the doc is wrong. If DISABLE_PAGE_SKIPPING is specified,
> we not only set aggressive = true but also skip checking visibility
> map. For instance, see line 905 and line 963, lazy_scan_heap().

OK, so you're saying that the docs illustrate the true intention of
the patch, which I immediately accept since I know you were the
author. Forgive me for not discussing it with you first, I thought
this was a clear cut case.

But that then highlights another area where the docs are wrong...

> On Tue, Nov 17, 2020 at 5:52 AM Simon Riggs  wrote:
> > The docs do correctly say "Pages where all tuples are known to be
> > frozen can always be skipped". Checking the code, lazy_scan_heap()
> > comments say
> > "we can still skip pages that are all-frozen, since such pages do not
> > need freezing".

The docs say this:
"Pages where all tuples are known to be frozen can always be skipped."
Why bother to say that if the feature then ignores that point and
scans them anyway?
May I submit a patch to remove that sentence?

Anyway, we're back to where I started: looking for a user-initiated
command option that allows a table to scanned aggressively so that
relfrozenxid can be moved forward as quickly as possible. This is what
I thought that you were trying to achieve with DISABLE_PAGE_SKIPPING
option, my bad.

Now David J, above, says this would be VACUUM FREEZE, but I don't
think that is right. Setting VACUUM FREEZE has these effects: 1) makes
a vacuum aggressive, but it also 2) moves the freeze limit so high
that it freezes mostly everything. (1) allows the vacuum to reset
relfrozenxid, but (2) actually slows down the scan by making it freeze
more blocks than it would do normally.

So we have 3 ways to reset relfrozenxid by a user action:
VACUUM (DISABLE_PAGE_SKIPPING ON) - scans all blocks, deliberately
ignoring the ones it could have skipped. This certainly slows it down.
VACUUM (FREEZE ON) - freezes everything in its path, slowing down the
scan by writing too many blocks.
VACUUM (FULL on) - rewrites table and rebuilds index, so very slow

What I think we need is a 4th option which aims to move relfrozenxid
forwards as quickly as possible
* initiates an aggressive scan, so it does not skip blocks because of
busy buffer pins
* skip pages that are all-frozen, as we are allowed to do
* uses normal freeze limits, so we avoid writing to blocks if possible

If we do all 3 of those things, the scan will complete as quickly as
possible and reset relfrozenxid quickly. It would make sense to use
that in conjunction with index_cleanup=off

As an additional optimization, if we do find a row that needs freezing
on a data block, we should simply freeze *all* row versions on the
page, not just the ones below the selected cutoff. This is justified
since writing the block is the biggest cost and it doesn't make much
sense to leave a few rows unfrozen on a block that we are dirtying.

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Re: parallel distinct union and aggregate support patch

2020-11-17 Thread Dilip Kumar
On Sun, Nov 8, 2020 at 11:54 AM Dilip Kumar  wrote:
>
> On Tue, Nov 3, 2020 at 6:06 PM Dilip Kumar  wrote:
> >
> > On Thu, Oct 29, 2020 at 12:53 PM bu...@sohu.com  wrote:
> > >
> > > > 1) It's better to always include the whole patch series - including the
> > > > parts that have not changed. Otherwise people have to scavenge the
> > > > thread and search for all the pieces, which may be a source of issues.
> > > > Also, it confuses the patch tester [1] which tries to apply patches from
> > > > a single message, so it will fail for this one.
> > >  Pathes 3 and 4 do not rely on 1 and 2 in code.
> > >  But, it will fail when you apply the apatches 3 and 4 directly, because
> > >  they are written after 1 and 2.
> > >  I can generate a new single patch if you need.
> > >
> > > > 2) I suggest you try to describe the goal of these patches, using some
> > > > example queries, explain output etc. Right now the reviewers have to
> > > > reverse engineer the patches and deduce what the intention was, which
> > > > may be causing unnecessary confusion etc. If this was my patch, I'd try
> > > > to create a couple examples (CREATE TABLE + SELECT + EXPLAIN) showing
> > > > how the patch changes the query plan, showing speedup etc.
> > >  I written some example queries in to regress, include "unique" "union"
> > >  "group by" and "group by grouping sets".
> > >  here is my tests, they are not in regress
> > > ```sql
> > > begin;
> > > create table gtest(id integer, txt text);
> > > insert into gtest select t1.id,'txt'||t1.id from (select 
> > > generate_series(1,1000*1000) id) t1,(select generate_series(1,10) id) t2;
> > > analyze gtest;
> > > commit;
> > > set jit = off;
> > > \timing on
> > > ```
> > > normal aggregate times
> > > ```
> > > set enable_batch_hashagg = off;
> > > explain (costs off,analyze,verbose)
> > > select sum(id),txt from gtest group by txt;
> > >  QUERY PLAN
> > > -
> > >  Finalize GroupAggregate (actual time=6469.279..8947.024 rows=100 
> > > loops=1)
> > >Output: sum(id), txt
> > >Group Key: gtest.txt
> > >->  Gather Merge (actual time=6469.245..8165.930 rows=158 loops=1)
> > >  Output: txt, (PARTIAL sum(id))
> > >  Workers Planned: 2
> > >  Workers Launched: 2
> > >  ->  Sort (actual time=6356.471..7133.832 rows=53 loops=3)
> > >Output: txt, (PARTIAL sum(id))
> > >Sort Key: gtest.txt
> > >Sort Method: external merge  Disk: 11608kB
> > >Worker 0:  actual time=6447.665..7349.431 rows=317512 
> > > loops=1
> > >  Sort Method: external merge  Disk: 10576kB
> > >Worker 1:  actual time=6302.882..7061.157 rows=01 
> > > loops=1
> > >  Sort Method: external merge  Disk: 2kB
> > >->  Partial HashAggregate (actual time=2591.487..4430.437 
> > > rows=53 loops=3)
> > >  Output: txt, PARTIAL sum(id)
> > >  Group Key: gtest.txt
> > >  Batches: 17  Memory Usage: 4241kB  Disk Usage: 
> > > 113152kB
> > >  Worker 0:  actual time=2584.345..4486.407 
> > > rows=317512 loops=1
> > >Batches: 17  Memory Usage: 4241kB  Disk Usage: 
> > > 101392kB
> > >  Worker 1:  actual time=2584.369..4393.244 
> > > rows=01 loops=1
> > >Batches: 17  Memory Usage: 4241kB  Disk Usage: 
> > > 112832kB
> > >  ->  Parallel Seq Scan on public.gtest (actual 
> > > time=0.691..603.990 rows=333 loops=3)
> > >Output: id, txt
> > >Worker 0:  actual time=0.104..607.146 
> > > rows=3174970 loops=1
> > >Worker 1:  actual time=0.100..603.951 
> > > rows=3332785 loops=1
> > >  Planning Time: 0.226 ms
> > >  Execution Time: 9021.058 ms
> > > (29 rows)
> > >
> > > Time: 9022.251 ms (00:09.022)
> > >
> > > set enable_batch_hashagg = on;
> > > explain (costs off,analyze,verbose)
> > > select sum(id),txt from gtest group by txt;
> > >QUERY PLAN
> > > -
> > >  Gather (actual time=3116.666..5740.826 rows=100 loops=1)
> > >Output: (sum(id)), txt
> > >Workers Planned: 2
> > >Workers Launched: 2
> > >->  Parallel BatchHashAggregate (actual time=3103.181..5464.948 
> > > rows=33 loops=3)
> > >  Output: sum(id), txt
> > >  Group Key: gtest.txt
> > >  Worker 0:  actual time=3094.550..5486.992 rows=326082 loops=1
> > >  Worker 1:  actual time=3099.562..5480.111 rows=324729 loops=1
> > >  ->  Parallel Seq Scan on public.gtest (actual 
> > > 

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-17 Thread Amit Kapila
On Mon, Nov 16, 2020 at 3:20 PM Masahiko Sawada  wrote:
>
> On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian  wrote:
> >
> > Updated with a new test case
> > (contrib/test_decoding/t/002_twophase-streaming.pl) that tests
> > concurrent aborts during streaming prepare. Had to make a few changes
> > to the test_decoding stream_start callbacks to handle
> > "check-xid-aborted"
> > the same way it was handled in the non stream callbacks. Merged
> > Peter's v20-0006 as well.
> >
>
> Thank you for updating the patch.
>
> I have a question about the timestamp of PREPARE on a subscriber node,
> although this may have already been discussed.
>
> With the current patch, the timestamps of PREPARE are different
> between the publisher and the subscriber but the timestamp of their
> commits are the same. For example,
>
> -- There is 1 prepared transaction on a publisher node.
> =# select * from pg_prepared_xact;
>
>  transaction | gid |   prepared|  owner   | database
> -+-+---+--+--
>  510 | h1  | 2020-11-16 16:57:13.438633+09 | masahiko | postgres
> (1 row)
>
> -- This prepared transaction is replicated to a subscriber.
> =# select * from pg_prepared_xact;
>
>  transaction | gid |   prepared|  owner   | database
> -+-+---+--+--
>  514 | h1  | 2020-11-16 16:57:13.440593+09 | masahiko | postgres
> (1 row)
>
> These timestamps are different. Let's commit the prepared transaction
> 'h1' on the publisher and check the commit timestamps on both nodes.
>
> -- On the publisher node.
> =# select pg_xact_commit_timestamp('510'::xid);
>
>pg_xact_commit_timestamp
> ---
>  2020-11-16 16:57:13.474275+09
> (1 row)
>
> -- Commit prepared is also replicated to the subscriber node.
> =# select pg_xact_commit_timestamp('514'::xid);
>
>pg_xact_commit_timestamp
> ---
>  2020-11-16 16:57:13.474275+09
> (1 row)
>
> The commit timestamps are the same. At PREPARE we use the local
> timestamp when PREPARE is executed as 'prepared' time while at COMMIT
> PREPARED we use the origin's commit timestamp as the commit timestamp
> if the commit WAL has.
>

Doesn't this happen only if you set replication origins? Because
otherwise both PrepareTransaction() and
RecordTransactionCommitPrepared() used the current timestamp.

-- 
With Regards,
Amit Kapila.




Re: Heads-up: macOS Big Sur upgrade breaks EDB PostgreSQL installations

2020-11-17 Thread Dave Page
FYI, both Jonathan and I have now tested this on additional machines and
have been unable to reproduce the issue, so it seems like something odd
happened on my original upgrade rather than a general issue.

Apologies for the noise.

On Mon, Nov 16, 2020 at 9:27 AM Dave Page  wrote:

> Hi,
>
> This is more of a head-ups than anything else, as I suspect this may come
> up in various forums.
>
> The PostgreSQL installers for macOS (from EDB, possibly others too) create
> the data directory in /Library/PostgreSQL//data. This has been
> the case since the first release, 10+ years ago.
>
> It looks like the Big Sur upgrade has taken it upon itself to "fix" any
> filesystem permissions it doesn't like. On my system, this resulted in the
> data directory having 0755 permissions, which meant that PostgreSQL refused
> to start. Manually changing the permissions back to 0700 (0750 should also
> work) fixes the issue.
>
> I'm not sure there's much we can do about this - systems that are likely
> to be affected are already out there, and we obviously don't want to relax
> the permissions Postgres requires.
>
> --
> Dave Page
> Blog: http://pgsnake.blogspot.com
> Twitter: @pgsnake
>
> EDB: http://www.enterprisedb.com
>
>

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


Re: Asynchronous Append on postgres_fdw nodes.

2020-11-17 Thread Etsuro Fujita
On Mon, Oct 5, 2020 at 3:35 PM Etsuro Fujita  wrote:
> Yes, if there are no objections from you or Thomas or Robert or anyone
> else, I'll update Robert's patch as such.

Here is a new version of the patch (as promised in the developer
unconference in PostgresConf.CN & PGConf.Asia 2020):

* In Robert's patch [1] (and Horiguchi-san's, which was created based
on Robert's), ExecAppend() was modified to retrieve tuples from
async-aware children *before* the tuples will be needed, but I don't
think that's really a good idea, because the query might complete
before returning the tuples.  So I modified that function so that a
tuple is retrieved from an async-aware child *when* it is needed, like
Thomas' patch.  I used FDW callback functions proposed by Robert, but
introduced another FDW callback function ForeignAsyncBegin() for each
async-aware child to start an asynchronous data fetch at the first
call to ExecAppend() after ExecInitAppend() or ExecReScanAppend().

* For EvalPlanQual, I modified the patch so that async-aware children
are treated as if they were synchronous when executing EvalPlanQual.

* In Robert's patch, all async-aware children below Append nodes in
the query waiting for events to occur were managed by a single EState,
but I modified the patch so that such children are managed by each
Append node, like Horiguchi-san's patch and Thomas'.

* In Robert's patch, the FDW callback function
ForeignAsyncConfigureWait() allowed multiple events to be configured,
but I limited that function to only allow a single event to be
configured, just for simplicity.

* I haven't yet added some planner/resowner changes from Horiguchi-san's patch.

* I haven't yet done anything about the issue on postgres_fdw's
handling of concurrent data fetches by multiple ForeignScan nodes
(below *different* Append nodes in the query) using the same
connection discussed in [2].  I modified the patch to just disable
applying this feature to problematic test cases in the postgres_fdw
regression tests, by a new GUC enable_async_append.

Comments welcome!  The attached is still WIP and maybe I'm missing
something, though.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com


async-wip-2020-11-17.patch
Description: Binary data


Re: Cache relation sizes?

2020-11-17 Thread Amit Kapila
On Tue, Nov 17, 2020 at 4:13 AM Thomas Munro  wrote:
>
> On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
>  wrote:
> > I will look at your implementation more precisely latter.
>
> Thanks!  Warning:  I thought about making a thing like this for a
> while, but the patch itself is only a one-day prototype, so I am sure
> you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> really needs a nice big comment to explain the theory about why this
> value is fresh enough, based on earlier ideas about implicit memory
> barriers.  (Something like: if you're extending, you must have
> acquired the extension lock; if you're scanning you must have acquired
> a snapshot that only needs to see blocks that existed at that time and
> concurrent truncations are impossible; I'm wondering about special
> snapshots like VACUUM, and whether this is too relaxed for that, or if
> it's covered by existing horizon-based interlocking...).
>

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

-- 
With Regards,
Amit Kapila.




RE: POC: postgres_fdw insert batching

2020-11-17 Thread tsunakawa.ta...@fujitsu.com
Hello,


Modified the patch as I talked with Tomas-san.  The performance results of 
loading one million records into a hash-partitioned table with 8 partitions are 
as follows:

unpatched, local: 8.6 seconds
unpatched, fdw: 113.7 seconds
patched, fdw: 12.5 seconds  (9x improvement)

The test scripts are also attached.  Run prepare.sql once to set up tables and 
source data.  Run local_part.sql and fdw_part.sql to load source data into a 
partitioned table with local partitions and a partitioned table with foreign 
tables respectively.


Regards
Takayuki Tsunakawa



fdw.sql
Description: fdw.sql


fdw_part.sql
Description: fdw_part.sql


local.sql
Description: local.sql


local_part.sql
Description: local_part.sql


prepare.sql
Description: prepare.sql


v2-0001-Add-bulk-insert-for-foreign-tables.patch
Description: v2-0001-Add-bulk-insert-for-foreign-tables.patch


Re: Terminate the idle sessions

2020-11-17 Thread Li Japin

On Nov 17, 2020, at 2:07 PM, 
kuroda.hay...@fujitsu.com wrote:

Dear Li, David,

> Additionally, using postgres_fdw within the server doesn't cause issues,
> its using postgres_fdw and the remote server having this setting set to zero 
> that causes a problem.

I didn't know the fact that postgres_fdw can use within the server... Thanks.

I read optimize-setitimer patch, and looks basically good. I put what I 
understanding,
so please confirm it whether your implementation is correct.
(Maybe I missed some simultaneities, so please review anyone...)

[besic consept]

sigalrm_due_at means the time that interval timer will ring, and 
sigalrm_delivered means who calls schedule_alarm().
If fin_time of active_timeouts[0] is larger than or equal to sigalrm_due_at,
stop calling setitimer because handle_sig_alarm() will be call sooner.


Agree. The sigalrm_delivered means a timer has been handled by 
handle_sig_alarm(), so we should call setitimer() in next schedule_alarm().

[when call setitimer]

In the attached patch, setitimer() will be only called the following scenarios:

* when handle_sig_alarm() is called due to the pqsignal
* when a timeout is registered and its fin_time is later than active_timeous[0]
* when disable a timeout
* when handle_sig_alarm() is interrupted and rescheduled(?)

According to comments, handle_sig_alarm() may be interrupted because of the 
ereport.
I think if handle_sig_alarm() is interrupted before subsutituting 
sigalrm_due_at to true,
interval timer will be never set. Is it correct, or is my assumption wrong?


I’m not familiar with the system interrupt, however, the sigalrm_due_at is 
subsutitue between HOLD_INTERRUPTS()
and RESUM_INTERRUPTS(), so I think it cannot be interrupted.  The following 
comments comes from miscadmin.h.

> The HOLD_INTERRUPTS() and RESUME_INTERRUPTS() macros
> allow code to ensure that no cancel or die interrupt will be accepted,
> even if CHECK_FOR_INTERRUPTS() gets called in a subroutine.  The interrupt
> will be held off until CHECK_FOR_INTERRUPTS() is done outside any
> HOLD_INTERRUPTS() ... RESUME_INTERRUPTS() section.


Lastly, I found that setitimer is obsolete and should change to another one. 
According to my man page:

```
POSIX.1-2001, SVr4, 4.4BSD (this call first appeared in 4.2BSD).
POSIX.1-2008 marks getitimer() and setitimer() obsolete,
recommending the use of the POSIX timers API (timer_gettime(2), 
timer_settime(2), etc.) instead.
```

Do you have an opinion for this? I think it should be changed
if all platform can support timer_settime system call, but this fix affects all 
timeouts,
so more considerations might be needed.


Not sure! I find that Win32 do not support setitimer(), PostgreSQL emulate 
setitimer() by creating a persistent thread to handle
the timer setting and notification upon timeout.

So if we want to replace it, I think we should open a new thread to achieve 
this.

--
Best regards
Japin Li



Re: Cache relation sizes?

2020-11-17 Thread Kyotaro Horiguchi
At Mon, 16 Nov 2020 20:11:52 +1300, Thomas Munro  wrote 
in 
> After recent discussions about the limitations of relying on SEEK_END
> in a nearby thread[1], I decided to try to prototype a system for
> tracking relation sizes properly in shared memory.  Earlier in this
> thread I was talking about invalidation schemes for backend-local
> caches, because I only cared about performance.  In contrast, this new
> system has SMgrRelation objects that point to SMgrSharedRelation
> objects (better names welcome) that live in a pool in shared memory,
> so that all backends agree on the size.  The scheme is described in
> the commit message and comments.  The short version is that smgr.c
> tracks the "authoritative" size of any relation that has recently been
> extended or truncated, until it has been fsync'd.  By authoritative, I
> mean that there may be dirty buffers in that range in our buffer pool,
> even if the filesystem has vaporised the allocation of disk blocks and
> shrunk the file.
> 
> That is, it's not really a "cache".  It's also not like a shared
> catalog, which Konstantin was talking about... it's more like the pool
> of inodes in a kernel's memory.  It holds all currently dirty SRs
> (SMgrSharedRelations), plus as many clean ones as it can fit, with
> some kind of reclamation scheme, much like buffers.  Here, "dirty"
> means the size changed.
> 
> Attached is an early sketch, not debugged much yet (check undir
> contrib/postgres_fdw fails right now for a reason I didn't look into),
> and there are clearly many architectural choices one could make
> differently, and more things to be done... but it seemed like enough
> of a prototype to demonstrate the concept and fuel some discussion
> about this and whatever better ideas people might have...
> 
> Thoughts?
> 
> [1] 
> https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com

I was naively thinking that we could just remember the size of all
database files in a shared array but I realized that that needs a hash
table to translate rnodes into array indexes, which could grow huge..

The proposed way tries to make sure that the next fseek call tells the
truth before forgetting cached values.  On the other hand a
shared-smgrrel allows to defer fsyncing of a (local) smgrrel until
someone evicts the entry.  That seems to me to be the minimal
mechanism that allows us to keep us being able to know the right file
size at all times, getting rid of a need to remember the size of all
database files in shared memory.  I'm afraid that it might cause fsync
storms on very-huge systems, though..

However, if we try to do the similar thing in any other way, it seems
to be that it's going grown to be more like the pg_smgr catalog. But
that seems going to eat more memory and cause invalidation storms.

Sorry for the rambling thought, but I think this is basically in the
right direction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-17 Thread Fujii Masao




On 2020/11/13 20:24, Bharath Rupireddy wrote:

On Thu, Nov 12, 2020 at 10:06 AM Fujii Masao
 wrote:


Thanks for the analysis! I pushed the patch.



Thanks! Since we are replacing custom SIGHUP and SIGTERM handlers with
standard ones, how about doing the same thing in worker_spi.c? I
posted a patch previously [1] in this mail thread. If it makes sense,
please review it.


I agree to simplify the worker_spi code by making it use the standard
signal handlers. But as far as I read Craig Ringer's comments upthread
about worker_spi, it's not enough to just replace the dedicated SIGTERM
handler with the standard one. ISTM that probably worker_spi should
use the signal handler handling InterruptPending and ProcDiePending
like die() does. What do you think about Craig Ringer's comments?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION