date:20201027

Re: Support for NSS as a libpq TLS backend

2020-10-27 Thread Andres Freund

Hi,

On 2020-10-27 21:07:01 +0100, Daniel Gustafsson wrote:
> > On 2020-10-20 14:24:24 +0200, Daniel Gustafsson wrote:
> >> From 0cb0e6a0ce9adb18bc9d212bd03e4e09fa452972 Mon Sep 17 00:00:00 2001
> >> From: Daniel Gustafsson 
> >> Date: Thu, 8 Oct 2020 18:44:28 +0200
> >> Subject: [PATCH] Support for NSS as a TLS backend v12
> >> ---
> >> configure |  223 +++-
> >> configure.ac  |   39 +-
> >> contrib/Makefile  |2 +-
> >> contrib/pgcrypto/Makefile |5 +
> >> contrib/pgcrypto/nss.c|  773 +++
> >> contrib/pgcrypto/openssl.c|2 +-
> >> contrib/pgcrypto/px.c |1 +
> >> contrib/pgcrypto/px.h |1 +
> > 
> > Personally I'd like to see this patch broken up a bit - it's quite
> > large. Several of the changes could easily be committed separately, no?
> 
> Not sure how much of this makes sense committed separately (unless separately
> means in quick succession), but it could certainly be broken up for the sake 
> of
> making review easier.

Committing e.g. the pgcrypto pieces separately from the backend code
seems unproblematic. But yes, I would expect them to go in close to each
other. I'm mainly concerned with smaller review-able units.

Have you done testing to ensure that NSS PG cooperates correctly with
openssl PG? Is there a way we can make that easier to do? E.g. allowing
to build frontend with NSS and backend with openssl and vice versa?


> >> if test "$with_openssl" = yes ; then
> >> +  if test x"$with_nss" = x"yes" ; then
> >> +AC_MSG_ERROR([multiple SSL backends cannot be enabled 
> >> simultaneously"])
> >> +  fi
> > 
> > Based on a quick look there's no similar error check for the msvc
> > build. Should there be?
> 
> Thats a good question.  When embarking on this is seemed quite natural to me
> that it should be, but now I'm not so sure.  Maybe there should be a
> --with-openssl-preferred like how we handle readline/libedit or just allow
> multiple and let the last one win?  Do you have any input on what would make
> sense?
>
> The only thing I think makes no sense is to allow multiple ones at the same
> time given the current autoconf switches, even if it would just be to pick say
> pg_strong_random from one and libpq TLS from another.

Maybe we should just have --with-ssl={openssl,nss}? That'd avoid needing
to check for errors.

Even better, of course, would be to allow switching of the SSL backend
based on config options (PGC_POSTMASTER GUC for backend, connection
string for frontend). Mainly because that would make testing of
interoperability so much easier.  Obviously still a few places like
pgcrypto, randomness, etc, where only a compile time decision seems to
make sense.


> >> +  CLEANLDFLAGS="$LDFLAGS"
> >> +  # TODO: document this set of LDFLAGS
> >> +  LDFLAGS="-lssl3 -lsmime3 -lnss3 -lplds4 -lplc4 -lnspr4 $LDFLAGS"
> > 
> > Shouldn't this use nss-config or such?
> 
> Indeed it should, where available.  I've added rudimentary support for that
> without a fallback as of now.

When would we need a fallback?


> > I think it'd also be better if we could include these files as nss/ssl.h
> > etc - ssl.h is a name way too likely to conflict imo.
> 
> I've changed this to be nss/ssl.h and nspr/nspr.h etc, but the include path
> will still need the direct path to the headers (from autoconf) since nss.h
> includes NSPR headers as #include  and so on.

Hm. Then it's probably not worth going there...


> >> +static SECStatus
> >> +pg_cert_auth_handler(void *arg, PRFileDesc * fd, PRBool checksig, PRBool 
> >> isServer)
> >> +{
> >> +  SECStatus   status;
> >> +  Port   *port = (Port *) arg;
> >> +  CERTCertificate *cert;
> >> +  char   *peer_cn;
> >> +  int len;
> >> +
> >> +  status = SSL_AuthCertificate(CERT_GetDefaultCertDB(), port->pr_fd, 
> >> checksig, PR_TRUE);
> >> +  if (status == SECSuccess)
> >> +  {
> >> +  cert = SSL_PeerCertificate(port->pr_fd);
> >> +  len = strlen(cert->subjectName);
> >> +  peer_cn = MemoryContextAllocZero(TopMemoryContext, len + 1);
> >> +  if (strncmp(cert->subjectName, "CN=", 3) == 0)
> >> +  strlcpy(peer_cn, cert->subjectName + strlen("CN="), len 
> >> + 1);
> >> +  else
> >> +  strlcpy(peer_cn, cert->subjectName, len + 1);
> >> +  CERT_DestroyCertificate(cert);
> >> +
> >> +  port->peer_cn = peer_cn;
> >> +  port->peer_cert_valid = true;
> > 
> > Hm. We either should have something similar to
> > 
> > /*
> >  * Reject embedded NULLs in certificate common name to 
> > prevent
> >  * attacks like CVE-2009-4034.
> >  */
> > if (len != strlen(peer_cn))
> > {
> > erep

Re: Get memory contexts of an arbitrary backend process

2020-10-27 Thread torikoshia

On 2020-10-23 13:46, Kyotaro Horiguchi wrote:

Wait...

Attachments: 
0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch

For a moment I thought that the number is patch number but the
predecessors are 0002-Enabled..collect.patch and 0001-(same
name). It's not mandatory but we usually do as the follows and it's
the way of git.

v1-0001-Enabled...collect.patch
v2-0001-Enabled...collect.patch

The vn is added by -v option for git-format-patch.

Sorry for the confusion. I'll follow that way next time.

At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia
 wrote in

> > I added a shared hash table consisted of minimal members
> > mainly for managing whether the file is dumped or not.
> > Some members like 'loc' seem useful in the future, but I
> > haven't added them since it's not essential at this point.
> Yes, that would be good.
> +        /*
> +         * Since we allow only one session can request to dump
> memory context at
> +         * the same time, check whether the dump files already exist.
> +         */
> +        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
> &stat_tmp) == 0)
> +        {
> +            pg_usleep(100L);
> +        }
> If pg_get_backend_memory_contexts() is executed by two or more
> sessions at the same time, it cannot be run exclusively in this way.
> Currently it seems to cause a crash when do it so.
> This is easy to reproduce and can be done as follows.
> [session-1]
> BEGIN;
> LOCK TABKE t1;
>   [Session-2]
>   BEGIN;
>   LOCK TABLE t1; <- waiting
>     [Session-3]
>     select * FROM pg_get_backend_memory_contexts();
>       [Session-4]
>       select * FROM pg_get_backend_memory_contexts( session-2>);
> If you issue commit or abort at session-1, you will get SEGV.
> Instead of checking for the existence of the file, it might be better
> to use a hash (mcxtdumpHash) entry with LWLock.

Thanks!
Added a LWLock and changed the way from checking the file existence
to finding the hash entry.

> +        if (proc == NULL)
> +        {
> +            ereport(WARNING,
> +                    (errmsg("PID %d is not a PostgreSQL server
> process", dst_pid)));
> +            return (Datum) 1;
> +        }
> Shouldn't it clear the hash entry before return?

Yeah. added codes for removing the entry.

+   entry = AddEntryToMcxtdumpHash(dst_pid);
+
+		/* Check whether the target process is PostgreSQL backend process. 
*/

+   /* TODO: Check also whether backend or not. */
+   proc = BackendPidGetProc(dst_pid);
+
+   if (proc == NULL)
+   {
+   ereport(WARNING,
+   (errmsg("PID %d is not a PostgreSQL server 
process", dst_pid)));
+
+   LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
+
+   if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, 
NULL) == NULL)
+   elog(WARNING, "hash table corrupted");
+
+   LWLockRelease(McxtDumpHashLock);
+
+   return (Datum) 1;
+   }

Why do you enter a useles entry then remove it immedately?

Do you mean I should check the process existence first
since it enables us to skip entering hash entries?

+		PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) 
Int32GetDatum(dst_pid));

+   {
+   SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, 
InvalidBackendId);

"PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
"PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?

I'll go with "PROCSIG_DUMP_MEMCXT".

I thought that the hash table would prevent multiple reqestors from
making a request at once, but the patch doesn't seem to do that.

+   /* Wait until target process finished dumping file. */
+   while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)

This needs LWLock. And this could read the entry after reused by
another backend if the dumper process is gone. That isn't likely to
happen, but theoretically the other backend may set it to
MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.

Thanks for your notification.
I'll use an LWLock.

+   /*
+* Make dump file ends with 'D'.
+* This is checked by the caller when reading the file.
+*/
+   fputc('E', fpout);

Which is right?

Sorry, the comment was wrong..

+   fputc('E', fpout);
+
+   CHECK_FOR_INTERRUPTS();

This means that the process accepts another request and rewrite the
file even while the first requester is reading it. And, the file can
be removed by the first requestor before the second requestor can read
it.

I added CHECK_FOR_INTERRUPTS() here to make the dump cancellation
possible, however, considering your indication, it needs to think
about a way to handle only the dump cancellation.

+   mcxtdumpHash = ShmemInitHash("mcxtdump hash",
+  
SHMEM_MEMCONT

Re: list of extended statistics on psql

2020-10-27 Thread Tatsuro Yamada


Hi Michael-san and Hackers,

On 2020/09/30 15:19, Michael Paquier wrote:

On Thu, Sep 17, 2020 at 02:55:31PM +0900, Michael Paquier wrote:

Could you provide at least a rebased version of the patch?  The CF bot
is complaning here.


Not seeing this answered after two weeks, I have marked the patch as
RwF for now.
--
Michael



Sorry for the delayed reply.

I re-based the patch on the current head and did some
refactoring.
I think the size of extended stats are not useful for DBA.
Should I remove it?

Changes:

   - Use a keyword "defined" instead of "not built"
   - Use COALESCE function for size for extended stats

Results of \dX and \dX+:

postgres=# \dX
List of extended statistics
Schema|   Name|   Definition| N_distinct | Dependencies |   Mcv
-+---+-++--+-
  public  | hoge1_ext | a, b FROM hoge1 | defined| defined  | 
defined
  hoge1schema | hoge1_ext | a, b FROM hoge1 | built  | built| built
(2 rows)

postgres=# \dX+
 List of extended statistics
Schema|   Name|   Definition| N_distinct | Dependencies |   Mcv 
  | N_size | D_size | M_size
-+---+-++--+-+++
  public  | hoge1_ext | a, b FROM hoge1 | defined| defined  | 
defined |  0 |  0 |  0
  hoge1schema | hoge1_ext | a, b FROM hoge1 | built  | built| built 
  | 13 | 40 |   6126
(2 rows)

Query of \dX+:
==
 SELECT
 stxnamespace::pg_catalog.regnamespace AS "Schema",
 stxname AS "Name",
 pg_catalog.format('%s FROM %s',
   (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ')
FROM pg_catalog.unnest(es.stxkeys) s(attnum)
JOIN pg_catalog.pg_attribute a
ON (es.stxrelid = a.attrelid
AND a.attnum = s.attnum
AND NOT a.attisdropped)),
 es.stxrelid::regclass) AS "Definition",
 CASE WHEN esd.stxdndistinct IS NOT NULL THEN 'built'
  WHEN 'd' = any(stxkind) THEN 'defined'
 END AS "N_distinct",
 CASE WHEN esd.stxddependencies IS NOT NULL THEN 'built'
  WHEN 'f' = any(stxkind) THEN 'defined'
 END AS "Dependencies",
 CASE WHEN esd.stxdmcv IS NOT NULL THEN 'built'
  WHEN 'm' = any(stxkind) THEN 'defined'
 END AS "Mcv",
 COALESCE(pg_catalog.length(stxdndistinct), 0) AS "N_size",
 COALESCE(pg_catalog.length(stxddependencies), 0) AS "D_size",
 COALESCE(pg_catalog.length(stxdmcv), 0) AS "M_size"
 FROM pg_catalog.pg_statistic_ext es
 LEFT JOIN pg_catalog.pg_statistic_ext_data esd
 ON es.oid = esd.stxoid
 INNER JOIN pg_catalog.pg_class c
 ON es.stxrelid = c.oid
 ORDER BY 1, 2;


Regards,
Tatsuro Yamada

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..fd860776af 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1918,6 +1918,20 @@ testdb=>
 
 
   
+  
+  
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the 
pattern
+are listed.
+If + is appended to the command name, each extended 
statistics
+is listed with its size.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a83d5dfc..c6f1653cb7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -930,6 +930,9 @@ exec_command_d(PsqlScanState scan_state, bool 
active_branch, const char *cmd)
else
success = listExtensions(pattern);
break;
+   case 'X':   /* Extended Statistics 
*/
+   success = listExtendedStats(pattern, 
show_verbose);
+   break;
case 'y':   /* Event Triggers */
success = listEventTriggers(pattern, 
show_verbose);
break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 07d640021c..e0ca32d34b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4397,6 +4397,98 @@ listEventTriggers(const char *pattern, bool verbose)
return true;
 }
 
+/*
+ * \dX
+ *
+ * Briefly describes extended statistics.
+ */
+bool
+listExtendedStats(const char *pattern, bool verbose)
+{
+   PQExpBufferData buf;
+   PGresult   *res;
+   printQueryOpt myopt = pset.popt;
+
+   if (pset.sversion < 10

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-10-27 Thread Tatsuo Ishii

> Status update for a commitfest entry.
> 
> This patch is ReadyForCommitter. It applies and passes the CI. There are no 
> unanswered questions in the discussion. 
> 
> The discussion started in 2015 with a patch by Jeff Janes. Later it was 
> revived by Pavan Deolasee.  After it was picked up by Ibrar Ahmed and 
> finally, it was rewritten by me, so I moved myself from reviewers to authors 
> as well.
> 
> The latest version was reviewed and tested by Ibrar Ahmed. The patch doesn't 
> affect COPY FREEZE performance and significantly decreases the time of the 
> following VACUUM.

I have tested the patch on my laptop (mem 16GB, SSD 512GB) using the
data introduced in up thread and saw that VACCUM after COPY FREEZE is
nearly 60 times faster than current master branch. Quite impressive.

By the way, I noticed following comment:
+   /*
+* vmbuffer should be already pinned by 
RelationGetBufferForTuple,
+* Though, it's fine if is not. all_frozen is just an 
optimization.
+*/

could be enhanced like below. What do you think?
+   /*
+* vmbuffer should be already pinned by 
RelationGetBufferForTuple.
+* Though, it's fine if it is not. all_frozen is just 
an optimization.
+*/


Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [spam] Re: list of extended statistics on psql

2020-10-27 Thread Tatsuro Yamada


Hi Michael-san and Hackers,

On 2020/09/30 15:19, Michael Paquier wrote:

On Thu, Sep 17, 2020 at 02:55:31PM +0900, Michael Paquier wrote:

Could you provide at least a rebased version of the patch?  The CF bot
is complaning here.


Not seeing this answered after two weeks, I have marked the patch as
RwF for now.
--
Michael



Sorry for the delayed reply.

I re-based the patch on the current head and did some
refactoring.
I think the size of extended stats are not useful for DBA.
Should I remove it?

Changes:

  - Use a keyword "defined" instead of "not built"
  - Use COALESCE function for size for extended stats

Results of \dX and \dX+:

postgres=# \dX
   List of extended statistics
   Schema|   Name|   Definition| N_distinct | Dependencies |   Mcv
-+---+-++--+-
 public  | hoge1_ext | a, b FROM hoge1 | defined| defined  | defined
 hoge1schema | hoge1_ext | a, b FROM hoge1 | built  | built| built
(2 rows)

postgres=# \dX+
List of extended statistics
   Schema|   Name|   Definition| N_distinct | Dependencies |   Mcv  
 | N_size | D_size | M_size
-+---+-++--+-+++
 public  | hoge1_ext | a, b FROM hoge1 | defined| defined  | 
defined |  0 |  0 |  0
 hoge1schema | hoge1_ext | a, b FROM hoge1 | built  | built| built  
 | 13 | 40 |   6126
(2 rows)

Query of \dX+:
==
SELECT
stxnamespace::pg_catalog.regnamespace AS "Schema",
stxname AS "Name",
pg_catalog.format('%s FROM %s',
  (SELECT pg_catalog.string_agg(pg_catalog.quote_ident(a.attname),', ')
   FROM pg_catalog.unnest(es.stxkeys) s(attnum)
   JOIN pg_catalog.pg_attribute a
   ON (es.stxrelid = a.attrelid
   AND a.attnum = s.attnum
   AND NOT a.attisdropped)),
es.stxrelid::regclass) AS "Definition",
CASE WHEN esd.stxdndistinct IS NOT NULL THEN 'built'
 WHEN 'd' = any(stxkind) THEN 'defined'
END AS "N_distinct",
CASE WHEN esd.stxddependencies IS NOT NULL THEN 'built'
 WHEN 'f' = any(stxkind) THEN 'defined'
END AS "Dependencies",
CASE WHEN esd.stxdmcv IS NOT NULL THEN 'built'
 WHEN 'm' = any(stxkind) THEN 'defined'
END AS "Mcv",
COALESCE(pg_catalog.length(stxdndistinct), 0) AS "N_size",
COALESCE(pg_catalog.length(stxddependencies), 0) AS "D_size",
COALESCE(pg_catalog.length(stxdmcv), 0) AS "M_size"
FROM pg_catalog.pg_statistic_ext es
LEFT JOIN pg_catalog.pg_statistic_ext_data esd
ON es.oid = esd.stxoid
INNER JOIN pg_catalog.pg_class c
ON es.stxrelid = c.oid
ORDER BY 1, 2;


Regards,
Tatsuro Yamada
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..fd860776af 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1918,6 +1918,20 @@ testdb=>
 
 
   
+  
+  
+\dX [ pattern ]
+
+
+Lists extended statistics.
+If pattern
+is specified, only those extended statistics whose names match the 
pattern
+are listed.
+If + is appended to the command name, each extended 
statistics
+is listed with its size.
+
+
+  
 
   
 \dy[+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a83d5dfc..c6f1653cb7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -930,6 +930,9 @@ exec_command_d(PsqlScanState scan_state, bool 
active_branch, const char *cmd)
else
success = listExtensions(pattern);
break;
+   case 'X':   /* Extended Statistics 
*/
+   success = listExtendedStats(pattern, 
show_verbose);
+   break;
case 'y':   /* Event Triggers */
success = listEventTriggers(pattern, 
show_verbose);
break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 07d640021c..e0ca32d34b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4397,6 +4397,98 @@ listEventTriggers(const char *pattern, bool verbose)
return true;
 }
 
+/*
+ * \dX
+ *
+ * Briefly describes extended statistics.
+ */
+bool
+listExtendedStats(const char *pattern, bool verbose)
+{
+   PQExpBufferData buf;
+   PGresult   *res;
+   printQueryOpt myopt = pset.popt;
+
+   if (pset.sversion < 10)
+   {
+   char

Re: duplicate function oid symbols

2020-10-27 Thread Michael Paquier

On Tue, Oct 27, 2020 at 05:40:09PM -0400, John Naylor wrote:
> Ok, here is a patch to fix that, and also throw an error if pg_proc.dat has
> an explicitly defined symbol.

I am not seeing any uses of HEAP_TABLE_AM_HANDLER_OID in the wild, so
+1 for just removing it, and not back-patch.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] logical decoding of two-phase transactions

2020-10-27 Thread Peter Smith

Hi Ajin.

I have re-checked the v13 patches for how my remaining review comments
have been addressed.

On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian  wrote:
>
> > 
> > v12-0002. File: src/backend/replication/logical/reorderbuffer.c
> > 
> >
> > COMMENT
> > Line 2401
> > /*
> >  * We are here due to one of the 3 scenarios:
> >  * 1. As part of streaming in-progress transactions
> >  * 2. Prepare of a two-phase commit
> >  * 3. Commit of a transaction.
> >  *
> >  * If we are streaming the in-progress transaction then discard the
> >  * changes that we just streamed, and mark the transactions as
> >  * streamed (if they contained changes), set prepared flag as false.
> >  * If part of a prepare of a two-phase commit set the prepared flag
> >  * as true so that we can discard changes and cleanup tuplecids.
> >  * Otherwise, remove all the
> >  * changes and deallocate the ReorderBufferTXN.
> >  */
> > ~
> > The above comment is beyond my understanding. Anything you could do to
> > simplify it would be good.
> >
> > For example, when viewing this function in isolation I have never
> > understood why the streaming flag and rbtxn_prepared(txn) flag are not
> > possible to be set at the same time?
> >
> > Perhaps the code is relying on just internal knowledge of how this
> > helper function gets called? And if it is just that, then IMO there
> > really should be some Asserts in the code to give more assurance about
> > that. (Or maybe use completely different flags to represent those 3
> > scenarios instead of bending the meanings of the existing flags)
> >
>
> Left this for now, probably re-look at this at a later review.
> But just to explain; this function is what does the main decoding of
> changes of a transaction.
>  At what point this decoding happens is what this feature and the
> streaming in-progress feature is about. As of PG13, this decoding only
> happens at commit time. With the streaming of in-progress txn feature,
> this began to happen (if streaming enabled) at the time when the
> memory limit for decoding transactions was crossed. This 2PC feature
> is supporting decoding at the time of a PREPARE transaction.
> Now, if streaming is enabled and streaming has started as a result of
> crossing the memory threshold, then there is no need to
> again begin streaming at a PREPARE transaction as the transaction that
> is being prepared has already been streamed. Which is why this
> function will not be called when a streaming transaction is prepared
> as part of a two-phase commit.

AFAIK the last remaining issue now is only about the complexity of the
aforementioned code/comment. If you want to defer changing that until
we can come up with something better, then that is OK by me.

Apart from that I have no other pending review comments at this time.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add important info about ANALYZE after create Functional Index

2020-10-27 Thread Michael Paquier

On Tue, Oct 27, 2020 at 11:06:22AM -0300, Fabrízio de Royes Mello wrote:
> When we create a new table or index they will not have statistics until an
> ANALYZE happens. This is the default behaviour and I think is not a big
> problem here, but we need to add some note on docs about the need of
> statistics for indexes on expressions.
> 
> But IMHO there is a misbehaviour with the implementation of CONCURRENTLY on
> REINDEX because running it will lose the statistics. Have a look the
> example below:
>
> [...] 
> 
> So IMHO here is the place we should rework a bit to execute ANALYZE as a
> last step.

I agree that this is not user-friendly, and I suspect that we will
need to do something within index_concurrently_swap() to fill in the
stats of the new index from the data of the old one (not looked at
that in details yet).
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums verification in the backend

2020-10-27 Thread Michael Paquier

On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote:
> I think it's also worth noting that the IOLock is now acquired just
> before getting the buffer state, and released after the read (or after
> finding that the buffer is dirty).  This is consistent with how it's
> done elsewhere, so I'm fine.

Consistency is the point.  This API should be safe to use by design.
I have done some extra performance tests similar to what I did
upthread, and this version showed similar numbers.

> Other than that I'm quite happy with the changes you made, thanks a lot!

Thanks for confirming.  I have gone through the whole set today,
splitted the thing into two commits and applied them.  We had
buildfarm member florican complain about a mistake in one of the
GetDatum() calls that I took care of already, and there is nothing
else on my radar.
--
Michael


signature.asc
Description: PGP signature

Re: Implementing Incremental View Maintenance

2020-10-27 Thread Yugo NAGATA

Hi Anastasia Lubennikova,

I am writing this to you because I would like to ask the commitfest
manager something. 

The status of the patch was changed to "Waiting on Author" from
"Ready for Committer" at the beginning of this montfor the reason
that rebase was necessary. Now I updated the patch, so can I change
the status back to "Ready for Committer"?

Regards,
Yugo Nagata

On Mon, 5 Oct 2020 18:16:18 +0900
Yugo NAGATA  wrote:

> Hi,
> 
> Attached is the rebased patch (v18) to add support for Incremental
> Materialized View Maintenance (IVM). It is able to be applied to
> current latest master branch.

-- 
Yugo NAGATA

Re: Log message for GSS connection is missing once connection authorization is successful.

2020-10-27 Thread Bharath Rupireddy

On Wed, Oct 28, 2020 at 8:29 AM vignesh C  wrote:
>
> Log message for GSS connection is missing once connection
> authorization is successful. We have similar log messages for SSL
> connections once the connection authorization is successful. This
> message will help the user to identify the connection that was
> selected from the logfile. I'm not sure if this log message was
> intentionally left out due to some reason for GSS.
> If the above analysis looks correct, then please find a patch that
> adds log for gss connections.
>
> Thoughts?
>

+1 for the idea. This is useful in knowing whether or not the user is
authenticated using GSS APIs.

Here are few comments on the patch:

1. How about using(like below) #ifdef, #elif ... #endif directives
instead of #ifdef, #endif, #ifdef, #endif?

#ifdef USE_SSL
   blah,blah,blah...
#elif defined(ENABLE_GSS)
   blah,blah,blah...
#else
   blah,blah,blah...
#endif

2. I think we must use be_gssapi_get_auth(port) instead of
be_gssapi_get_enc(port) in the if condition, because we log for gss
authentications irrespective of encoding is enabled or not. Put it
another way, maybe gss authentications are possible without
encoding[1]. We can have the information whether the encryption is
enabled or not in the log message, be_gssapi_get_enc(port) ? _("on") :
_("off"),.
#ifdef ENABLE_GSS
if (be_gssapi_get_enc(port))
ereport(LOG,

We do not need be_gssapi_get_auth(port) ? _("on") : _("off") this in
the log message, only in the if condition we need this check.

[1] By looking at the below code it seems that gss authentication
without encryption is possible.
#ifdef ENABLE_GSS
port->gss->auth = true;
if (port->gss->enc)
status = pg_GSS_checkauth(port);
else
{
sendAuthRequest(port, AUTH_REQ_GSS, NULL, 0);
status = pg_GSS_recvauth(port);
}

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Disable WAL logging to speed up data loading

2020-10-27 Thread osumi.takami...@fujitsu.com

Hi,


I wrote and attached the first patch to disable WAL logging.
This patch passes the regression test of check-world already
and is formatted by pgindent.

Also, I conducted 3 types of performance tests to
clarify the rough benefits of the patch.

I compared two wal_levels both 'minimal' and 'none'.
For both levels, I measured
(1) cluster's restore from pg_dumpall output,
(2) COPY initial data to a defined table as initial data loading, and
(3) COPY data to a defined table with tuples, 3 times each for all cases.
After that, calculated the average and the ratio of the loading speed.
The conclusion is that wal_level=none cuts about 20% of the loading speed
compared to 'minimal' in the three cases above. For sure, we could tune the 
configuration of
postgresql.conf further but roughly it's meaningful to share the result, I 
think.
'shared_buffers' was set to 40% of RAM while 'maintenance_work_mem'
was set to 20%. I set max_wal_senders = 0 this time.

The input data was generated from pgbench with 1000 scale factor.
It's about 9.3GB. For the table definition or
the initial data for appended data loading test case,
I used pgbench to set up the schema as well. 
Sharing other scenario to measure is welcome.

I need to say that the current patch doesn't take the change of wal_level='none'
from/to other ones into account fully or other commands like ones for two phase 
commit yet.
Sorry for that.

Also, to return the value of last shut down LSN in XLogInsert(),
it acquires lock of control file every time, which was not good clearly.
I tried to replace that code like having a LSN cache as one variable
but I was not sure where I should put the function to set the initial value of 
the LSN.
After LocalProcessControlFile() in PostmasterMain() was not the right place.
I'd be happy if someone gives me an advice about it.


Best,
Takamichi Osumi


disable_WAL_logging_v01.patch
Description: disable_WAL_logging_v01.patch

Re: Internal key management system

2020-10-27 Thread Craig Ringer

On Wed, Oct 28, 2020 at 9:43 AM Bruce Momjian  wrote:
>

> I don't know much about how to hook into that stuff so if you have an
> idea, I am all ears.

Yeah, I have a reasonable idea. The main thing will be to re-read the patch
and put it into more concrete terms, which I'll try to find time for soon.
I need to find time to craft a proper demo that uses a virtual hsm, and can
also demonstrate how to use the host TPM or a Yubikey using the simple
openssl engine interfaces or a URI.

 I have used OpenSSL with Yubikey via pksc11.  You
> can see the use of it on slide 57 and following:
>
> https://momjian.us/main/writings/crypto_hw_config.pdf#page=57
>
> Interestingly, that still needed the user to type in a key to unlock the
> Yubikey, so we might need PKCS11 and a password for the same server
> start.
>

Yes, that's possible. But in that case the passphrase will be asked for by
openssl only when required, and we'll need to supply an openssl askpass
hook.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-10-27 Thread Andres Freund

Hi,

On 2020-10-15 01:37:35 -0700, Andres Freund wrote:
> Attached is a *prototype* implemention of this concept, which clearly is
> lacking some comment work (and is intentionally lacking some
> re-indentation).
> 
> I described my thoughts about how to limit the horizons for temp tables in
> https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Attached is an updated version of this patch. Quite a bit of polish,
added removal of the isTopLevel arguments added a7212be8b9e that are now
unnecessary, and changed the initialization of the temp table horizons
to be latestCompletedXid + 1 instead of just latestCompletedXid when no
xid is assigned.


> Besides comments this probably mainly needs a bit more tests around temp
> table vacuuming. Should have at least an isolation test that verifies
> that temp table rows can be a) vacuumed b) pruned away in the presence
> of other sessions with xids.

I added an isolationtester test for this. It verifies that dead rows in
temp tables get vacuumed and pruned despite concurrent sessions having
older snapshots. It does so by forcing an IOS and checking the number of
heap fetches reported by EXPLAIN. I also added a companion test for
permanent relations, ensuring that such rows do not get removed.


Any comments? Otherwise I'll push that patch tomorrow.

Greetings,

Andres Freund
>From 1ba3c5eeb250d3e755ad5c044aede1816d040ae4 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 27 Oct 2020 20:39:57 -0700
Subject: [PATCH v2] Centralize horizon determination for temp tables, fix bug
 due to skew.

Also add tests to verify that temp tables can be pruned despite other
sessions having older snapshots.

Author: Andres Freund
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyh...@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvx...@alap3.anarazel.de
---
 src/include/commands/cluster.h   |   3 +-
 src/include/commands/vacuum.h|   1 -
 src/backend/access/heap/vacuumlazy.c |   1 -
 src/backend/commands/cluster.c   |  28 +--
 src/backend/commands/vacuum.c|  76 +++---
 src/backend/storage/ipc/procarray.c  |  59 -
 src/test/isolation/expected/horizons.out | 281 +++
 src/test/isolation/isolation_schedule|   1 +
 src/test/isolation/specs/horizons.spec   | 165 +
 9 files changed, 541 insertions(+), 74 deletions(-)
 create mode 100644 src/test/isolation/expected/horizons.out
 create mode 100644 src/test/isolation/specs/horizons.spec

diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 1eb144204b6..e05884781b9 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -19,8 +19,7 @@
 
 
 extern void cluster(ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options,
-		bool isTopLevel);
+extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d9475c99890..a4cd7214009 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -267,7 +267,6 @@ extern void vacuum_set_xid_limits(Relation rel,
   int freeze_min_age, int freeze_table_age,
   int multixact_freeze_min_age,
   int multixact_freeze_table_age,
-  bool isTopLevel,
   TransactionId *oldestXmin,
   TransactionId *freezeLimit,
   TransactionId *xidFullScanLimit,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168dc..be5439dd9d8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -471,7 +471,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 		  params->freeze_table_age,
 		  params->multixact_freeze_min_age,
 		  params->multixact_freeze_table_age,
-		  true, /* we must be a top-level command */
 		  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
 		  &MultiXactCutoff, &mxactFullScanLimit);
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0d647e912c4..04d12a7ece6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,13 +67,10 @@ typedef struct
 } RelToCluster;
 
 
-static void rebuild_relation(Relation OldHeap, Oid indexOid,
-			 bool isTopLevel, bool verbose);
+static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose);
 static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
-			bool isTopLevel, bool verbose,
-			bool *pSwapToastByContent,
-			TransactionId *pFreezeXid,
-			MultiXactId *pCutoffMulti);
+			bool verbose, bool *pSwapToastByContent,
+

Re: Track statistics for streaming of in-progress transactions

2020-10-27 Thread Amit Kapila

On Fri, Oct 23, 2020 at 10:24 AM Amit Kapila  wrote:
>
> On Thu, Oct 22, 2020 at 2:09 PM Amit Kapila  wrote:
> >
>
> I have fixed the above comment and rebased the patch. I have changed
> the docs a bit to add more explanation about the counters. Let me know
> if you have any more comments. Thanks Dilip and Sawada-San for
> reviewing this patch.
>

Attached is an updated patch with minor changes in docs and cosmetic
changes. I am planning to push this patch tomorrow unless there are
any more comments/suggestions.

-- 
With Regards,
Amit Kapila.


v4-0001-Track-statistics-for-streaming-of-changes-from-Re.patch
Description: Binary data

Re: partition routing layering in nodeModifyTable.c

2020-10-27 Thread Amit Langote

On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas  wrote:
> On 23/10/2020 12:37, Amit Langote wrote:
> > To explain these numbers a bit, "overheaul update/delete processing"
> > patch improves the performance of that benchmark by allowing the
> > updates to use run-time pruning when executing generic plans, which
> > they can't today.
> >
> > However without "lazy-ResultRelInfo-initialization" patch,
> > ExecInitModifyTable() (or InitPlan() when I ran those benchmarks) can
> > be seen to be spending time initializing all of those result
> > relations, whereas only one of those will actually be used.
> >
> > As mentioned further in that email, it's really the locking of all
> > relations by AcquireExecutorLocks() that occurs even before we enter
> > the executor that's a much thornier bottleneck for this benchmark.
> > But the ResultRelInfo initialization bottleneck sounded like it could
> > get alleviated in a relatively straightforward manner.  The patches
> > that were developed for attacking the locking bottleneck would require
> > further reflection on whether they are correct.
> >
> > (Note: I've just copy pasted the numbers I reported in that email.  To
> > reproduce, I'll have to rebase the "overhaul update/delete processing"
> > patch on this one, which I haven't yet done.)
>
> Ok, thanks for the explanation, now I understand.
>
> But since this applies on top of the "overhaul update/delete processing"
> patch, let's tackle that patch set next. Could you rebase that, please?

Actually, I made lazy-ResultRelInfo-initialization apply on HEAD
directly at one point because of its separate CF entry, that is, to
appease the CF app's automatic patch tester that wouldn't know to
apply the other patch first.  Because both of these patch sets want to
change thow ModifyTable works, there are conflicts.

The "overhaul update/delete processing" patch is somewhat complex and
I expect some amount of back and forth on its design points.  OTOH,
the lazy-ResultRelInfo-initialization patch is straightforward enough
that I hoped it would be easier to bring it into a committable state
than the other.  But I can see why one may find it hard to justify
committing the latter without the former already in, because the
bottleneck it purports to alleviate (that of eager ResultRelInfo
initialization) is not apparent until update/delete can use run-time
pruning.

Anyway, I will post the rebased patch on the "overhaul update/delete
processing" thread.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Log message for GSS connection is missing once connection authorization is successful.

2020-10-27 Thread vignesh C

Hi,

Log message for GSS connection is missing once connection
authorization is successful. We have similar log messages for SSL
connections once the connection authorization is successful. This
message will help the user to identify the connection that was
selected from the logfile. I'm not sure if this log message was
intentionally left out due to some reason for GSS.
If the above analysis looks correct, then please find a patch that
adds log for gss connections.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 95c906b9eaf1493ad10ac65d6cf7b27a7dd6acb9 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 28 Oct 2020 08:19:06 +0530
Subject: [PATCH v1] Log message for GSS connection is missing once connection
 authorization is successful.

Log message for GSS connection is missing once connection authorization is
successful. We have similar log message for SSL connections once the connection
authorization is successful. This message will help the user to identify the
connection that was selected from the logfile.
---
 src/backend/utils/init/postinit.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7..0fd38b7 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -267,6 +267,21 @@ PerformAuthentication(Port *port)
   be_tls_get_compression(port) ? _("on") : _("off";
 			else
 #endif
+#ifdef ENABLE_GSS
+			if (be_gssapi_get_enc(port))
+ereport(LOG,
+		(port->application_name != NULL
+		 ? errmsg("replication connection authorized: user=%s application_name=%s GSS enabled (gssapi autorization=%s, principal=%s)",
+  port->user_name,
+  port->application_name,
+  be_gssapi_get_auth(port) ? _("on") : _("off"),
+  be_gssapi_get_princ(port))
+		 : errmsg("replication connection authorized: user=%s GSS enabled (gssapi autorization=%s, principal=%s)",
+  port->user_name,
+  be_gssapi_get_auth(port) ? _("on") : _("off"),
+  be_gssapi_get_princ(port;
+			else
+#endif
 ereport(LOG,
 		(port->application_name != NULL
 		 ? errmsg("replication connection authorized: user=%s application_name=%s",
@@ -295,6 +310,20 @@ PerformAuthentication(Port *port)
   be_tls_get_compression(port) ? _("on") : _("off";
 			else
 #endif
+#ifdef ENABLE_GSS
+			if (be_gssapi_get_enc(port))
+ereport(LOG,
+		(port->application_name != NULL
+		 ? errmsg("connection authorized: user=%s database=%s application_name=%s GSS enabled (gssapi autorization=%s, principal=%s)",
+  port->user_name, port->database_name, port->application_name,
+  be_gssapi_get_auth(port) ? _("on") : _("off"),
+  be_gssapi_get_princ(port))
+		 : errmsg("connection authorized: user=%s database=%s GSS enabled (gssapi autorization=%s, principal=%s)",
+  port->user_name, port->database_name,
+  be_gssapi_get_auth(port) ? _("on") : _("off"),
+  be_gssapi_get_princ(port;
+			else
+#endif
 ereport(LOG,
 		(port->application_name != NULL
 		 ? errmsg("connection authorized: user=%s database=%s application_name=%s",
-- 
1.8.3.1

Re: Re: parallel distinct union and aggregate support patch

2020-10-27 Thread bu...@sohu.com

> On Tue, Oct 27, 2020 at 3:27 PM Dilip Kumar  wrote:
> >
> > On Fri, Oct 23, 2020 at 11:58 AM bu...@sohu.com  wrote:
> > >
> > > > Interesting idea.  So IIUC, whenever a worker is scanning the tuple it
> > > > will directly put it into the respective batch(shared tuple store),
> > > > based on the hash on grouping column and once all the workers are
> > > > doing preparing the batch then each worker will pick those baches one
> > > > by one, perform sort and finish the aggregation.  I think there is a
> > > > scope of improvement that instead of directly putting the tuple to the
> > > > batch what if the worker does the partial aggregations and then it
> > > > places the partially aggregated rows in the shared tuple store based
> > > > on the hash value and then the worker can pick the batch by batch.  By
> > > > doing this way, we can avoid doing large sorts.  And then this
> > > > approach can also be used with the hash aggregate, I mean the
> > > > partially aggregated data by the hash aggregate can be put into the
> > > > respective batch.
> > >
> > > Good idea. Batch sort suitable for large aggregate result rows,
> > > in large aggregate result using partial aggregation maybe out of memory,
> > > and all aggregate functions must support partial(using batch sort this is 
> > > unnecessary).
> > >
> > > Actually i written a batch hash store for hash aggregate(for pg11) like 
> > > this idea,
> > > but not write partial aggregations to shared tuple store, it's write 
> > > origin tuple and hash value
> > > to shared tuple store, But it's not support parallel grouping sets.
> > > I'am trying to write parallel hash aggregate support using batch shared 
> > > tuple store for PG14,
> > > and need support parallel grouping sets hash aggregate.
> >
> > I was trying to look into this patch to understand the logic in more
> > detail.  Actually, there are no comments at all so it's really hard to
> > understand what the code is trying to do.
> >
> > I was reading the below functions, which is the main entry point for
> > the batch sort.
> >
> > +static TupleTableSlot *ExecBatchSortPrepare(PlanState *pstate)
> > +{
> > ...
> > + for (;;)
> > + {
> > ...
> > + tuplesort_puttupleslot(state->batches[hash%node->numBatches], slot);
> > + }
> > +
> > + for (i=node->numBatches;i>0;)
> > + tuplesort_performsort(state->batches[--i]);
> > +build_already_done_:
> > + if (parallel)
> > + {
> > + for (i=node->numBatches;i>0;)
> > + {
> > + --i;
> > + if (state->batches[i])
> > + {
> > + tuplesort_end(state->batches[i]);
> > + state->batches[i] = NULL;
> > + }
> > + }
> >
> > I did not understand this part, that once each worker has performed
> > their local batch-wise sort why we are clearing the baches?  I mean
> > individual workers have their on batches so eventually they supposed
> > to get merged.  Can you explain this part and also it will be better
> > if you can add the comments.
>  
> I think I got this,  IIUC, each worker is initializing the shared
> short and performing the batch-wise sorting and we will wait on a
> barrier so that all the workers can finish with their sorting.  Once
> that is done the workers will coordinate and pick the batch by batch
> and perform the final merge for the batch.

Yes, it is. Each worker open the shared sort as "worker" (nodeBatchSort.c:134),
end of all worker performing, pick one batch and open it as 
"leader"(nodeBatchSort.c:54).

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-27 Thread Amit Kapila

On Wed, Oct 28, 2020 at 5:26 AM Tomas Vondra
 wrote:
>
> On Tue, Oct 27, 2020 at 06:19:42PM -0500, Justin Pryzby wrote:
> >On Tue, Oct 27, 2020 at 09:17:43AM +0530, Amit Kapila wrote:
> >> On Tue, Oct 27, 2020 at 8:51 AM Justin Pryzby  wrote:
> >> >
> >> > On Fri, Oct 23, 2020 at 10:45:34AM +0530, Amit Kapila wrote:
> >> > > On Fri, Oct 23, 2020 at 8:59 AM Amit Kapila  
> >> > > wrote:
> >> > >
> >> > > While updating the streaming stats patch, it occurred to me that we
> >> > > can write a better description spill_bytes as well. Attached contains
> >> > > the update to spill_bytes description.
> >> >
> >> > +This and other spill
> >> > +counters can be used to gauge the I/O occurred during logical 
> >> > decoding
> >> > +and accordingly can tune 
> >> > logical_decoding_work_mem.
> >> >
> >> > "gauge the IO occurred" is wrong.
> >> > Either: I/O *which* occured, or I/O occurring, or occurs.
> >> >
> >> > "can tune" should say "allow tuning".
> >> >
> >> > Like:
> >> > +This and other spill
> >> > +counters can be used to gauge the I/O which occurs during 
> >> > logical decoding
> >> > +and accordingly allow tuning of 
> >> > logical_decoding_work_mem.
> >> >
> >> > -Number of times transactions were spilled to disk. Transactions
> >> > +Number of times transactions were spilled to disk while 
> >> > performing
> >> > +decoding of changes from WAL for this slot. Transactions
> >> >
> >> > What about: "..while decoding changes.." (remove "performing" and "of").
> >> >
> >>
> >> All of your suggestions sound good to me. Find the patch attached to
> >> update the docs accordingly.
> >
> >@@ -2628,8 +2627,8 @@ SELECT pid, wait_event_type, wait_event FROM 
> >pg_stat_activity WHERE wait_event i
> >
> > Amount of decoded transaction data spilled to disk while performing
> > decoding of changes from WAL for this slot. This and other spill
> >-counters can be used to gauge the I/O occurred during logical 
> >decoding
> >-and accordingly can tune 
> >logical_decoding_work_mem.
> >+counters can be used to gauge the I/O which occurred during logical
> >+decoding and accordingly allow tuning  
> >logical_decoding_work_mem.
> >
> >Now that I look again, maybe remove "accordingly" ?
> >
>
> Yeah, the 'accordingly' seems rather unnecessary here. Let's remove it.
>

Removed and pushed.

-- 
With Regards,
Amit Kapila.

Re: Patch to fix FK-related selectivity estimates with constants

2020-10-27 Thread Tomas Vondra


On Tue, Oct 27, 2020 at 09:27:06PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

On Tue, Oct 27, 2020 at 01:58:56PM -0400, Tom Lane wrote:

Attached is a patch series that attacks it that way.



The patch sems fine to me, thanks for investigating and fixing this.


Thanks for looking at it!


I find it a bit strange that generate_base_implied_equalities_const adds
the rinfo to ec_derives, while generate_base_implied_equalities_no_const
does not. I understand it's correct as we don't lookup the non-const
clauses, and we want to keep the list as short as possible, but it seems
like a bit surprising/unexpected difference in behavior.


Yeah, perhaps.  I considered replacing ec_derives with two lists, one
for base-level derived clauses and one for join-level derived clauses,
but it didn't really seem worth the trouble.  This is something we
could change later if a need arises to be able to look back at non-const
base-level derived clauses.


I think casting the 'clause' to (Node*) in process_implied_equality is
unnecessary - it was probably needed when it was declared as Expr* but
the patch changes that.


Hm, thought I got rid of the unnecessary casts ... I'll look again.



Apologies, the casts are fine. I got it mixed up somehow.


As for the backpatching, I don't feel very strongly about it. It's
clearly a bug/thinko in the code, and I don't see any obvious risks in
backpatching it (no ABI breaks etc.).


I had two concerns about possible extension breakage from a back-patch:

* Changing the set of fields in ForeignKeyOptInfo is an ABI break.
We could minimize the risk by adding the new fields at the end in
the back branches, but it still wouldn't be zero-risk.

* Changing the expectation about whether process_implied_equality()
will fill left_ec/right_ec is an API break.  It's somewhat doubtful
whether there exist any callers outside equivclass.c, but again it's
not zero risk.

The other issue, entirely unrelated to code hazards, is whether this
is too big a change in planner behavior to be back-patched.  We've
often felt that destabilizing stable-branch plan choices is something
to be avoided.

Not to mention the whole issue of whether this patch has any bugs of
its own.

So on the whole I wouldn't want to back-patch, or at least not do so
very far.  Maybe there's an argument that v13 is still new enough to
deflect the concern about plan stability.



OK, understood.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Internal key management system

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 10:20:35AM -0400, Bruce Momjian wrote:
> I don't know much about how to hook into that stuff so if you have an
> idea, I am all ears.  I have used OpenSSL with Yubikey via pksc11.  You
> can see the use of it on slide 57 and following:
> 
>   https://momjian.us/main/writings/crypto_hw_config.pdf#page=57
> 
> Interestingly, that still needed the user to type in a key to unlock the
> Yubikey, so we might need PKCS11 and a password for the same server
> start.
> 
> I would like to get this moving forward so I will work on the idea of
> passing an open /dev/tty file descriptor from pg_ctl to the postmaster
> on start.

Here is an updated patch that uses an argument to pass an open /dev/tty
file descriptor to the postmaster.  It uses -R for initdb/pg_ctl, -R ###
for postmaster/postgres, and %R for cluster_passphrase_command.  Here is
a sample session:

--> $ initdb -R --cluster-passphrase-command '/tmp/pass_fd.sh "%p" %R'
The files belonging to this database system will be owned by user 
"postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.
Key management system is enabled.

fixing permissions on existing directory /u/pgsql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... America/New_York
creating configuration files ... ok
running bootstrap script ...
--> Enter database encryption pass phrase: 
B1D7B405EDCD97B7351DD3B7AE0637775FFBC6A2C2EEADAEC009A75A58A79F50
ok
performing post-bootstrap initialization ...
--> Enter database encryption pass phrase: 
B1D7B405EDCD97B7351DD3B7AE0637775FFBC6A2C2EEADAEC009A75A58A79F50
ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

pg_ctl -D /u/pgsql/data -l logfile start

$ pg_ctl stop
pg_ctl: PID file "/u/pgsql/data/postmaster.pid" does not exist
Is server running?
--> $ pg_ctl -l /u/pg/server.log -R start
waiting for server to start...
--> Enter database encryption pass phrase: 
B1D7B405EDCD97B7351DD3B7AE0637775FFBC6A2C2EEADAEC009A75A58A79F50
 done
server started

Attached is my updated patch, based on Masahiko Sawada's patch, and my
pass_fd.sh script.  

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index f043433..65422d5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** COPY postgres_log FROM '/full/path/to/lo
*** 7793,7798 
--- 7793,7836 
  
 
  
+
+ Encryption Key Management
+ 
+ 
+  
+   cluster_passphrase_command (string)
+   
+cluster_passphrase_command configuration parameter
+   
+   
+   
+
+ This option specifies an external command to be invoked when a
+ passphrase for key management system needs to be obtained.
+
+
+ The command must print the passphrase to the standard
+ output and have a zero exit code.  In the parameter value,
+ %p is replaced by a prompt string, and
+ %R represents the file descriptor for
+ reading/writing from the terminal; if specified at server start,
+ it is blank.  Since it can be blank, it is wise for it to be the
+ last command-line argument specified.  (Write %%
+ for a literal %.)  Note that the prompt string
+ will probably contain whitespace, so be sure to quote its use
+ adequately.  A single newline is stripped from the end of the
+ output if present.  The passphrase must be at least 64 bytes.
+
+
+ This parameter can only be set in the
+ postgresql.conf file or on the server
+ command line.
+
+   
+  
+ 
+
+ 
 
  Client Connection Defaults
  
*** dynamic_library_path = 'C:\tools\postgre
*** 9636,9641 
--- 9674,9695 
 

   
+ 
+   
+   key_management_enabled (boolean)
+   
+Key management configuration param

Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Tomas Vondra

On Tue, Oct 27, 2020 at 08:23:26PM +0300, Alexander Korotkov wrote:

On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov  wrote:

On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it 
affects multixact scalability much less than sizes of offsets\members buffers. I 
concur that patch 2 of the patchset does not seem documented enough.

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.

I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.

I did a quick review on this patch series. A couple comments:

0001

This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.

IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)

In any case, it'd make the lwlock.c changes unnecessary, I think.

0002

Specifies the number cached MultiXact by backend. Any SLRU lookup ...

should be 'number of cached ...'

0003

* Conditional variable for waiting till the filling of the next multixact
* will be finished.  See GetMultiXactIdMembers() and RecordNewMultiXact()
* for details.

Perhaps 'till the next multixact is filled' or 'gets full' would be
better. Not sure.

This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?

If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [patch] ENUM errdetail should mention bytes, not chars

2020-10-27 Thread Ian Lawrence Barwick

2020年10月27日(火) 20:00 Peter Eisentraut :
>
> On 2020-10-19 06:34, Julien Rouhaud wrote:
> >>  ERROR:  invalid enum label "ああ"
> >>  DETAIL:  Labels must be 63 characters or less.
> >>
> >> Attached trivial patch changes the message to:
> >>
> >>  DETAIL:  Labels must be 63 bytes or less.
> >>
> >> This matches the documentation, which states:
> >>
> >>  The length of an enum value's textual label is limited by the 
> >> NAMEDATALEN
> >>  setting compiled into PostgreSQL; in standard builds this means at 
> >> most
> >>  63 bytes.
> >>
> >>  https://www.postgresql.org/docs/current/datatype-enum.html
> >>
> >> I don't see any particular need to backpatch this.
> >
> > Indeed the message is wrong, and patch LGTM.
>
> Committed.

Thanks!

> Btw., the patch didn't update the regression test output.

Whoops... /me hangs head in shame and slinks away...

Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com

Re: Patch to fix FK-related selectivity estimates with constants

2020-10-27 Thread Tom Lane

Tomas Vondra  writes:
> On Tue, Oct 27, 2020 at 01:58:56PM -0400, Tom Lane wrote:
>>> Attached is a patch series that attacks it that way.

> The patch sems fine to me, thanks for investigating and fixing this.

Thanks for looking at it!

> I find it a bit strange that generate_base_implied_equalities_const adds
> the rinfo to ec_derives, while generate_base_implied_equalities_no_const
> does not. I understand it's correct as we don't lookup the non-const
> clauses, and we want to keep the list as short as possible, but it seems
> like a bit surprising/unexpected difference in behavior.

Yeah, perhaps.  I considered replacing ec_derives with two lists, one
for base-level derived clauses and one for join-level derived clauses,
but it didn't really seem worth the trouble.  This is something we
could change later if a need arises to be able to look back at non-const
base-level derived clauses.

> I think casting the 'clause' to (Node*) in process_implied_equality is
> unnecessary - it was probably needed when it was declared as Expr* but
> the patch changes that.

Hm, thought I got rid of the unnecessary casts ... I'll look again.

> As for the backpatching, I don't feel very strongly about it. It's
> clearly a bug/thinko in the code, and I don't see any obvious risks in
> backpatching it (no ABI breaks etc.).

I had two concerns about possible extension breakage from a back-patch:

* Changing the set of fields in ForeignKeyOptInfo is an ABI break.
We could minimize the risk by adding the new fields at the end in
the back branches, but it still wouldn't be zero-risk.

* Changing the expectation about whether process_implied_equality()
will fill left_ec/right_ec is an API break.  It's somewhat doubtful
whether there exist any callers outside equivclass.c, but again it's
not zero risk.

The other issue, entirely unrelated to code hazards, is whether this
is too big a change in planner behavior to be back-patched.  We've
often felt that destabilizing stable-branch plan choices is something
to be avoided.

Not to mention the whole issue of whether this patch has any bugs of
its own.

So on the whole I wouldn't want to back-patch, or at least not do so
very far.  Maybe there's an argument that v13 is still new enough to
deflect the concern about plan stability.

regards, tom lane

DROP INDEX CONCURRENTLY on partitioned index

2020-10-27 Thread Justin Pryzby

Forking this thread, since the existing CFs have been closed.
https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd

On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:
> On Sat, Sep 12, 2020 at 10:35:34AM +0900, Michael Paquier wrote:
> > On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:
> > > On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:
> > >> - CIC on partitioned relations.  (Should we also care about DROP INDEX
> > >> CONCURRENTLY as well?)
> > > 
> > > Do you have any idea what you think that should look like for DROP INDEX
> > > CONCURRENTLY ?
> > 
> > Making the maintenance of the partition tree consistent to the user is
> > the critical part here, so my guess on this matter is:
> > 1) Remove each index from the partition tree and mark the indexes as
> > invalid in the same transaction.  This makes sure that after commit no
> > indexes would get used for scans, and the partition dependency tree
> > pis completely removed with the parent table.  That's close to what we
> > do in index_concurrently_swap() except that we just want to remove the
> > dependencies with the partitions, and not just swap them of course.
> > 2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
> > should be fine as that prevents inserts.
> > 3) Finish the index drop.
> > 
> > Step 2) and 3) could be completely done for each index as part of
> > index_drop().  The tricky part is to integrate 1) cleanly within the
> > existing dependency machinery while still knowing about the list of
> > partitions that can be removed.  I think that this part is not that
> > straight-forward, but perhaps we could just make this logic part of
> > RemoveRelations() when listing all the objects to remove.
> 
> Thanks.
> 
> I see three implementation ideas..
> 
> 1. I think your way has an issue that the dependencies are lost.  If there's 
> an
> interruption, the user is maybe left with hundreds or thousands of detached
> indexes to clean up.  This is strange since there's actually no detach command
> for indexes (but they're implicitly "attached" when a matching parent index is
> created).  A 2nd issue is that DETACH currently requires an exclusive lock 
> (but
> see Alvaro's WIP patch).

I think this is a critical problem with this approach.  It's not ok if a
failure leaves behind N partition indexes not attached to any parent.
They may have pretty different names, which is a mess to clean up.

> 2. Maybe the easiest way is to mark all indexes invalid and then drop all
> partitions (concurrently) and then the partitioned parent.  If interrupted,
> this would leave a parent index marked "invalid", and some child tables with 
> no
> indexes.  I think this may be "ok".  The same thing is possible if a 
> concurrent
> build is interrupted, right ?

I think adding the "invalid" mark in the simple/naive way isn't enough - it has
to do everything DROP INDEX CONCURRENTLY does (of course).

> 3. I have a patch which changes index_drop() to "expand" a partitioned index 
> into
> its list of children.  Each of these becomes a List:
> | indexId, heapId, userIndexRelation, userHeapRelation, heaplocktag, 
> heaprelid, indexrelid
> The same process is followed as for a single index, but handling all 
> partitions
> at once in two transactions total.  Arguably, this is bad since that function
> currently takes a single Oid but would now ends up operating on a list of 
> indexes.

This is what's implemented in the attached.  It's very possible I've missed
opportunities for better simplification/integration.

-- 
Justin
>From 990db2a4016be3e92abe53f0600368258d2c8744 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 8 Sep 2020 12:12:44 -0500
Subject: [PATCH v1] WIP: implement DROP INDEX CONCURRENTLY on partitioned
 index

It's necessary to drop each of the partitions with the same concurrent logic as
for a normal index.  But, they each depend on the parent, partitioned index,
and the dependency can't be removed first, since DROP CONCURRENTLY must be the
first command in a txn, and it would leave a mess if it weren't transactional.

Also, it's desired if dropping N partitions concurrently requires only two
transaction waits, and not 2*N.

So drop the parent and all its children in a single loop, handling the entire
list of indexes at each stage.
---
 doc/src/sgml/ref/drop_index.sgml   |   2 -
 src/backend/catalog/dependency.c   |  85 +-
 src/backend/catalog/index.c| 384 -
 src/backend/commands/tablecmds.c   |  17 +-
 src/include/catalog/dependency.h   |   6 +
 src/test/regress/expected/indexing.out |   4 +-
 src/test/regress/sql/indexing.sql  |   3 +-
 7 files changed, 331 insertions(+), 170 deletions(-)

diff --git a/doc/src/sgml/ref/drop_index.sgml b/doc/src/sgml/ref/drop_index.sgml
index 85cf23bca2..0aedd71bd6 100644
--- a/doc/src/sgml/ref/drop_index.sgml
+++ b/doc/src/sgml/ref/d

CLUSTER on partitioned index

2020-10-27 Thread Justin Pryzby

Forking this thread, since the existing CFs have been closed.
https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd

On Tue, Oct 06, 2020 at 01:38:23PM +0900, Michael Paquier wrote:
> On Mon, Oct 05, 2020 at 10:07:33PM -0500, Justin Pryzby wrote:
> > Honestly, I think you're over-thinking and over-engineering indisclustered.
> > 
> > If "clusteredness" was something we offered to maintain across DML, I think
> > that might be important to provide stronger guarantees.  As it is now, I 
> > don't
> > think this patch is worth changing the catalog definition.
> 
> Well, this use case is new because we are discussing the relationship
> of indisclustered across multiple transactions for multiple indexes,
> so I'd rather have this discussion than not, and I have learnt
> the hard way with REINDEX that we should care a lot about the
> consistency of partition trees at any step of the operation.

indisclustered is only used as a default for "CLUSTER" (without USING).  The
worst thing that can happen if it's "inconsistent" is that "CLUSTER;" clusters
a table on the "old" clustered index (that it was already clustered on), which
is what would've happened before running some command which was interrupted.

> Let's
> imagine a simple example here, take this partition tree: p (parent),
> and two partitions p1 and p2.  p has two partitioned indexes i and j,
> indexes also present in p1 and p2 as i1, i2, j1 and j2.  Let's assume
> that the user has done a CLUSTER on p USING i that completes, meaning
> that i, i1 and i2 have indisclustered set.  Now let's assume that the
> user does a CLUSTER on p USING j this time, and that this command
> fails while processing p2, meaning that indisclustered is set for j1,
> i2, and perhaps i or j depending on what the patch does.

I think the state of "indisclustered" at that point is not critical.
The command failed, and the user can re-run it, or ALTER..SET CLUSTER.
Actually, I think the only inconsistent state is if two indexes are both marked
indisclustered.

I'm attaching a counter-proposal to your catalog change, which preserves
indisclustered on children of clustered, partitioned indexes, and invalidates
indisclustered when attaching unclustered indexes.

Also, I noticed that CREATE TABLE (LIKE.. INCLUDING INDEXES) doesn't preserve
indisclustered, but I can't say that's an issue.

-- 
Justin
>From dd4588352f99186f28fc666c497f85a87ac11da2 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Jun 2020 16:58:42 -0500
Subject: [PATCH v1 1/3] Implement CLUSTER of partitioned table..

This requires either specification of a partitioned index on which to cluster,
or that an partitioned index was previously set clustered.

TODO: handle DB-WIDE "CLUSTER;" for partitioned tables
new partitions need to inherit indisclustered ?
---
 doc/src/sgml/ref/cluster.sgml |   6 +
 src/backend/commands/cluster.c| 169 +++---
 src/bin/psql/tab-complete.c   |   1 +
 src/include/nodes/parsenodes.h|   5 +-
 src/test/regress/expected/cluster.out |  58 -
 src/test/regress/sql/cluster.sql  |  24 +++-
 6 files changed, 209 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index b9450e7366..0476cfff72 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -172,6 +172,12 @@ CLUSTER [VERBOSE]
 are periodically reclustered.

 
+   
+Clustering a partitioned table clusters each of its partitions using the
+index partition of the given partitioned index or (if not specified) the
+partitioned index marked as clustered.
+   
+
  
 
  
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 0d647e912c..1db8382a27 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -32,7 +32,9 @@
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
+#include "catalog/partition.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_inherits.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
 #include "commands/progress.h"
@@ -75,6 +77,9 @@ static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 			TransactionId *pFreezeXid,
 			MultiXactId *pCutoffMulti);
 static List *get_tables_to_cluster(MemoryContext cluster_context);
+static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
+		Oid indexOid);
+static void cluster_multiple_rels(List *rvs, bool isTopLevel, int options);
 
 
 /*---
@@ -116,7 +121,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
 			AccessExclusiveLock,
 			0,
 			RangeVarCallbackOwnsTable, NULL);
-		rel = table_open(tableOid, NoLock);
+		rel = table_open(tableOid, ShareUpdateExclusiveLock);
 
 		/*
 		 * Reject clustering a remote temp

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-27 Thread Tomas Vondra


On Tue, Oct 27, 2020 at 06:19:42PM -0500, Justin Pryzby wrote:

On Tue, Oct 27, 2020 at 09:17:43AM +0530, Amit Kapila wrote:

On Tue, Oct 27, 2020 at 8:51 AM Justin Pryzby  wrote:
>
> On Fri, Oct 23, 2020 at 10:45:34AM +0530, Amit Kapila wrote:
> > On Fri, Oct 23, 2020 at 8:59 AM Amit Kapila  wrote:
> >
> > While updating the streaming stats patch, it occurred to me that we
> > can write a better description spill_bytes as well. Attached contains
> > the update to spill_bytes description.
>
> +This and other spill
> +counters can be used to gauge the I/O occurred during logical 
decoding
> +and accordingly can tune 
logical_decoding_work_mem.
>
> "gauge the IO occurred" is wrong.
> Either: I/O *which* occured, or I/O occurring, or occurs.
>
> "can tune" should say "allow tuning".
>
> Like:
> +This and other spill
> +counters can be used to gauge the I/O which occurs during logical 
decoding
> +and accordingly allow tuning of 
logical_decoding_work_mem.
>
> -Number of times transactions were spilled to disk. Transactions
> +Number of times transactions were spilled to disk while performing
> +decoding of changes from WAL for this slot. Transactions
>
> What about: "..while decoding changes.." (remove "performing" and "of").
>

All of your suggestions sound good to me. Find the patch attached to
update the docs accordingly.


@@ -2628,8 +2627,8 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   
Amount of decoded transaction data spilled to disk while performing
decoding of changes from WAL for this slot. This and other spill
-counters can be used to gauge the I/O occurred during logical decoding
-and accordingly can tune logical_decoding_work_mem.
+counters can be used to gauge the I/O which occurred during logical
+decoding and accordingly allow tuning  
logical_decoding_work_mem.

Now that I look again, maybe remove "accordingly" ?



Yeah, the 'accordingly' seems rather unnecessary here. Let's remove it.


FWIW thanks to everyone working on this and getting the reworked version
of the 9290ad198b patch in. As an author of that patch I should have
paid more attention to this thread, and I appreciate the amount of work
spent on fixing it.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-27 Thread Justin Pryzby

On Tue, Oct 27, 2020 at 09:17:43AM +0530, Amit Kapila wrote:
> On Tue, Oct 27, 2020 at 8:51 AM Justin Pryzby  wrote:
> >
> > On Fri, Oct 23, 2020 at 10:45:34AM +0530, Amit Kapila wrote:
> > > On Fri, Oct 23, 2020 at 8:59 AM Amit Kapila  
> > > wrote:
> > >
> > > While updating the streaming stats patch, it occurred to me that we
> > > can write a better description spill_bytes as well. Attached contains
> > > the update to spill_bytes description.
> >
> > +This and other spill
> > +counters can be used to gauge the I/O occurred during logical 
> > decoding
> > +and accordingly can tune 
> > logical_decoding_work_mem.
> >
> > "gauge the IO occurred" is wrong.
> > Either: I/O *which* occured, or I/O occurring, or occurs.
> >
> > "can tune" should say "allow tuning".
> >
> > Like:
> > +This and other spill
> > +counters can be used to gauge the I/O which occurs during logical 
> > decoding
> > +and accordingly allow tuning of 
> > logical_decoding_work_mem.
> >
> > -Number of times transactions were spilled to disk. Transactions
> > +Number of times transactions were spilled to disk while performing
> > +decoding of changes from WAL for this slot. Transactions
> >
> > What about: "..while decoding changes.." (remove "performing" and "of").
> >
> 
> All of your suggestions sound good to me. Find the patch attached to
> update the docs accordingly.

@@ -2628,8 +2627,8 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i

 Amount of decoded transaction data spilled to disk while performing
 decoding of changes from WAL for this slot. This and other spill
-counters can be used to gauge the I/O occurred during logical decoding
-and accordingly can tune logical_decoding_work_mem.
+counters can be used to gauge the I/O which occurred during logical
+decoding and accordingly allow tuning  
logical_decoding_work_mem.

Now that I look again, maybe remove "accordingly" ?

-- 
Justin

Re: Patch to fix FK-related selectivity estimates with constants

2020-10-27 Thread Tomas Vondra


On Tue, Oct 27, 2020 at 01:58:56PM -0400, Tom Lane wrote:

I wrote:

Over in the thread at [1] it's discussed how our code for making
selectivity estimates using knowledge about FOREIGN KEY constraints
is busted in the face of EquivalenceClasses including constants.
...
Attached is a patch series that attacks it that way.




The patch sems fine to me, thanks for investigating and fixing this.

Two minor comments:

I find it a bit strange that generate_base_implied_equalities_const adds
the rinfo to ec_derives, while generate_base_implied_equalities_no_const
does not. I understand it's correct as we don't lookup the non-const
clauses, and we want to keep the list as short as possible, but it seems
like a bit surprising/unexpected difference in behavior.

I think casting the 'clause' to (Node*) in process_implied_equality is
unnecessary - it was probably needed when it was declared as Expr* but
the patch changes that.


As for the backpatching, I don't feel very strongly about it. It's
clearly a bug/thinko in the code, and I don't see any obvious risks in
backpatching it (no ABI breaks etc.). OTOH multi-column foreign keys are
not very common, and the query pattern seems rather unusual too, so the
risk is pretty low I guess. We certainly did not get many reports, so.



I'd failed to generate a test case I liked yesterday, but perhaps
the attached will do.  (While the new code is exercised in the
core regression tests already, it doesn't produce any visible
plan changes.)  I'm a little nervous about whether the plan
shape will be stable in the buildfarm, but it works for me on
both 64-bit and 32-bit machines, so probably it's OK.



Works fine on raspberry pi 4 (i.e. armv7l, 32-bit arm) too.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Question about make coverage-html

2020-10-27 Thread Peter Smith

> > Creating a coverage report is a two-step process. First, you run the
> > test you're interested in, with "make check" or similar. Then you create
> > a report for the source files you're interested in, with "make
> > coverage-html". You can run these commands in different subdirectories.
>
> > In this case, you want to do "cd src/test/subscription; make check", to
> > run those TAP tests, and then run "make coverage-html" from the top
> > folder. Or if you wanted to create coverage report that covers only
> > replication-related source code, for example, you could run it in the
> > src/backend/replication directory ("cd src/backend/replication; make
> > coverage-html").
>
> I agree with the OP that the documentation is a bit vague here.
> I think (maybe I'm wrong) that it's clear enough that you can run
> whichever test case(s) you want, but this behavior of generating a
> partial coverage report is less clear.  Maybe instead of
>
> The "make" commands also work in subdirectories.
>
> we could say
>
> You can run the "make coverage-html" command in a subdirectory
> if you want a coverage report for only a portion of the code tree.

Thank you for the clarifications and the updated documentation.

Kind Regards,
Peter Smith
Fujitsu Australia

Re: cutting down the TODO list thread

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 04:54:24PM -0400, John Naylor wrote:
> 
> 
> On Tue, Oct 27, 2020 at 3:52 PM Bruce Momjian  wrote:
> 
> 
> Do any of these limitations need to be documented before removing them
> from the TODO list?
> 
> 
> I see two areas that might use a mention:
> 
> - pg_settings not displaying custom variables
> - SQL standard difference with REVOKE ROLE (I haven't looked further into 
> this)

OK, thanks.  Do you want to work on a doc patch or should I?  Having it
the docs at least warns our users.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: duplicate function oid symbols

2020-10-27 Thread John Naylor

On Tue, Oct 27, 2020 at 9:51 AM Tom Lane  wrote:

> John Naylor  writes:
> > I noticed that the table AM abstraction introduced the symbol
> > HEAP_TABLE_AM_HANDLER_OID, although we already have a convention for
> > defining symbols automatically for builtin functions, which in this case
> is
> > (currently unused) F_HEAP_TABLEAM_HANDLER.
>
> Yeah, that seems wrong.  I'd just remove HEAP_TABLE_AM_HANDLER_OID.
> As long as we're not back-patching the change, it seems like a very
> minor thing to fix, if anyone outside core is referencing the old name.
>

Ok, here is a patch to fix that, and also throw an error if pg_proc.dat has
an explicitly defined symbol.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v1-0001-Use-the-standard-symbol-for-the-builtin-function-.patch
Description: Binary data

Re: cutting down the TODO list thread

2020-10-27 Thread John Naylor

On Tue, Oct 27, 2020 at 3:52 PM Bruce Momjian  wrote:

>
> Do any of these limitations need to be documented before removing them
> from the TODO list?
>

I see two areas that might use a mention:

- pg_settings not displaying custom variables
- SQL standard difference with REVOKE ROLE (I haven't looked further into
this)

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: GROUP BY optimization

2020-10-27 Thread Tomas Vondra

On Mon, Oct 26, 2020 at 11:40:40AM +0100, Dmitry Dolgov wrote:

On Mon, Oct 26, 2020 at 01:28:59PM +0400, Pavel Borisov wrote:
> Thanks for your interest! FYI there is a new thread about this topic [1]
> with the next version of the patch and more commentaries (I've created
> it for visibility purposes, but probably it also created some confusion,
> sorry for that).
>
> Thanks!

I made a very quick look at your updates and noticed that it is intended to
be simple and some parts of the code are removed as they have little test
coverage. I'd propose vice versa to increase test coverage to enjoy more
precise cost calculation and probably partial grouping.

Or maybe it's worth to benchmark both patches and then re-decide what we
want more to have a more complicated or a simpler version.

Good to know that this feature is not stuck anymore and we have more than
one proposal.
Thanks!

Just to clarify, the patch that I've posted in another thread mentioned
above is not an alternative proposal, but a development of the same
patch I had posted in this thread. As mentioned in [1], reduce of
functionality is an attempt to reduce the scope, and as soon as the base
functionality looks good enough it will be returned back.

I find it hard to follow two similar threads trying to do the same (or
very similar) things in different ways. Is there any chance to join
forces and produce a single patch series merging the changes? With the
"basic" functionality at the beginning, then patches with the more
complex stuff. That's the usual way, I think.

As I said in my response on the other thread [1], I think constructing
additional paths with alternative orderings of pathkeys is the right
approach. Otherwise we can't really deal with optimizations above the
place where we consider this optimization.

That's essentially what I was trying in explain May 16 response [2]
when I actually said this:

   So I don't think there will be a single "interesting" grouping
   pathkeys (i.e. root->group_pathkeys), but a collection of pathkeys.
   And we'll need to build grouping paths for all of those, and leave
   the planner to eventually pick the one giving us the cheapest plan.

I wouldn't go as far as saying the approach in this patch (i.e. picking
one particular ordering) is doomed, but it's going to be very hard to
make it work reliably. Even if we get the costing *at this node* right,
who knows how it'll affect costing of the nodes above us?

So if I can suggest something, I'd merge the two patches, adopting the
path-based approach. With the very basic functionality/costing in the
first patch, and the more advanced stuff in additional patches.

Does that make sense?

regards

[1] 
https://www.postgresql.org/message-id/20200901210743.lutgvnfzntvhuban%40development

[2] 
https://www.postgresql.org/message-id/20200516145609.vm7nrqy7frj4ha6r%40development

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: cutting down the TODO list thread

2020-10-27 Thread John Naylor

On Tue, Oct 27, 2020 at 4:00 PM Thomas Munro  wrote:

> On Wed, Oct 28, 2020 at 8:36 AM Andres Freund  wrote:
> > On 2020-10-27 15:24:35 -0400, John Naylor wrote:
> > > - Allow WAL replay of CREATE TABLESPACE to work when the directory
> > > structure on the recovery computer is different from the original
> > >   Thread quote: "part of the difficult, perhaps-not-worth doing
> impossible
> > > problems"
> >
> > I think we ought to do something here. Mostly because the current
> > situation makes it impossible to test many things on a single
> > system. And we have a partial solution with the tablespace mapping
> > files.
>
> +1, we need to get something like this working so that we can write
> decent replication tests.  FWIW there was another little thread on the
> topic, not listed there:
>
>
> https://www.postgresql.org/message-id/flat/CALfoeisEF92F5nJ-aAcuWTvF_Aogxq_1bHLem_kVfM_tHc2mfg%40mail.gmail.com
>

Thanks, I've added this thread to the entry.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support for NSS as a libpq TLS backend

2020-10-27 Thread Heikki Linnakangas


On 27/10/2020 22:07, Daniel Gustafsson wrote:

/*
 * Track whether the NSS database has a password set or not. There is no API
 * function for retrieving password status, so we simply flip this to true in
 * case NSS invoked the password callback - as that will only happen in case
 * there is a password. The reason for tracking this is that there are calls
 * which require a password parameter, but doesn't use the callbacks provided,
 * so we must call the callback on behalf of these.
 */
static bool has_password = false;


This is set in PQssl_passwd_cb function, but never reset. That seems 
wrong. The NSS database used in one connection might have a password, 
while another one might not. Or have I completely misunderstood this?


- Heikki

Re: cutting down the TODO list thread

2020-10-27 Thread Thomas Munro

On Wed, Oct 28, 2020 at 8:36 AM Andres Freund  wrote:
> On 2020-10-27 15:24:35 -0400, John Naylor wrote:
> > - Allow WAL replay of CREATE TABLESPACE to work when the directory
> > structure on the recovery computer is different from the original
> >   Thread quote: "part of the difficult, perhaps-not-worth doing impossible
> > problems"
>
> I think we ought to do something here. Mostly because the current
> situation makes it impossible to test many things on a single
> system. And we have a partial solution with the tablespace mapping
> files.

+1, we need to get something like this working so that we can write
decent replication tests.  FWIW there was another little thread on the
topic, not listed there:

https://www.postgresql.org/message-id/flat/CALfoeisEF92F5nJ-aAcuWTvF_Aogxq_1bHLem_kVfM_tHc2mfg%40mail.gmail.com

More aggressive vacuuming of unlogged relations?

2020-10-27 Thread Andres Freund

Hi,

The patch in [1] makes the horizon logic in procarray.c aware of temp
tables not needing to care about other session's snapshots (also
discussed in[2]). Extending a7212be8b9e, which did that for VACUUM, but
not HOT pruning etc.

While polishing that patch I was wondering whether there are other
classes of relations that we might want to treat differently. And
there's one more that we don't special case right now: unlogged tables.

As unlogged tables aren't replicated via physical rep, we don't need to
apply vacuum_defer_cleanup_age, hot standby feedback and slot based
horizons.

The obvious question is, is that worth doing? My intuition is that yes,
it probably is: Unlogged tables are often used for hotly updated
transient state, allowing that to be cleaned up more aggressively will
reduce bloat.

Comments?

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20201015083735.derdzysdtqdvxshp%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/20201014203103.72oke6hqywcyhx7s%40alap3.anarazel.de

Re: [patch] Fix checksum verification in base backups for zero page headers

2020-10-27 Thread Anastasia Lubennikova


On 26.10.2020 04:13, Michael Paquier wrote:

On Fri, Oct 23, 2020 at 08:00:08AM +0900, Michael Paquier wrote:

Yeah, we could try to make the logic a bit more complicated like
that.  However, for any code path relying on a page read without any
locking insurance, we cannot really have a lot of trust in any of the
fields assigned to the page as this could just be random corruption
garbage, and the only thing I am ready to trust here a checksum
mismatch check, because that's the only field on the page that's
linked to its full contents on the 8k page.  This also keeps the code
simpler.

A small update here.  I have extracted the refactored part for
PageIsVerified() and committed it as that's independently useful.
This makes the patch proposed here simpler on HEAD, leading to the
attached.
--
Michael


Thank you for committing the first part.

In case you need a second opinion on the remaining patch, it still looks 
good to me.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: cutting down the TODO list thread

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 03:46:14PM -0400, Bruce Momjian wrote:
> On Tue, Oct 27, 2020 at 03:24:35PM -0400, John Naylor wrote:
> > As I mentioned in [1], I've volunteered to clear out the TODO list of items
> > that appear to be too difficult, controversial, or otherwise not worth 
> > doing to
> > warrant being listed there. I'll be working a few sections at a time, and 
> > every
> > so often I'll have a list of proposed items for removal. If I don't hear
> > objections, I'll remove the items after a few days while going through the 
> > next
> > set.
> > 
> > Where there's an email thread, I've skimmed a few messages to get a sense of
> > the community's thoughts on it. Where easily determined, I've taken age into
> > account, insofar as something from 2017 is going to get much more benefit of
> > doubt than something from 2008. I've added after each item a phrase that 
> > sums
> > up the reason I believe it doesn't belong anymore. Feedback welcome, of 
> > course,
> > although I suspect there won't be much.
> 
> Thanks for working on this.  It certainly needs new eyes (not mine).  ;-)
> 
> I am fine reomving all the items below.  I am kind of disappointed we
> have these _stuck_ items, but I don't see a clear way forward, so let's
> just remove them and see what requests we get for them.

Do any of these limitations need to be documented before removing them
from the TODO list?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: cutting down the TODO list thread

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 03:24:35PM -0400, John Naylor wrote:
> As I mentioned in [1], I've volunteered to clear out the TODO list of items
> that appear to be too difficult, controversial, or otherwise not worth doing 
> to
> warrant being listed there. I'll be working a few sections at a time, and 
> every
> so often I'll have a list of proposed items for removal. If I don't hear
> objections, I'll remove the items after a few days while going through the 
> next
> set.
> 
> Where there's an email thread, I've skimmed a few messages to get a sense of
> the community's thoughts on it. Where easily determined, I've taken age into
> account, insofar as something from 2017 is going to get much more benefit of
> doubt than something from 2008. I've added after each item a phrase that sums
> up the reason I believe it doesn't belong anymore. Feedback welcome, of 
> course,
> although I suspect there won't be much.

Thanks for working on this.  It certainly needs new eyes (not mine).  ;-)

I am fine reomving all the items below.  I am kind of disappointed we
have these _stuck_ items, but I don't see a clear way forward, so let's
just remove them and see what requests we get for them.

---

> **Administration
> 
> - Have custom variables be transaction-safe
>   Old and found to be difficult after attempting
> 
> - Allow custom variables to appear in pg_settings()
>   Old and controversial
> 
> - Implement the SQL-standard mechanism whereby REVOKE ROLE revokes only the
> privilege granted by the invoking role, and not those granted by other roles
>   Old and difficult
> 
> - Prevent query cancel packets from being replayed by an attacker, especially
> when using SSL
>   Old and difficult
> 
> *Configuration files
> 
> - Consider normalizing fractions in postgresql.conf, perhaps using '%'
>   At the time (2007), some gucs used an actual percentage.
> 
> - Add external tool to auto-tune some postgresql.conf parameters
>   There are already out-of-core tools that try to do this.
> 
> - Create utility to compute accurate random_page_cost value
>   Seems outdated: In the current age of SSDs and cloud environments, it's 
> often
> just set to 1.1, and there hasn't been a demand to be more accurate than that.
> 
> - Allow synchronous_standby_names to be disabled after communication failure
> with all synchronous standby servers exceeds some timeout
>   Controversial
> 
> - Adjust rounding behavior for numeric GUC values
>   Controversial
> 
> *Tablespaces
> 
> - Allow WAL replay of CREATE TABLESPACE to work when the directory structure 
> on
> the recovery computer is different from the original
>   Thread quote: "part of the difficult, perhaps-not-worth doing impossible
> problems"
> 
> - Allow per-tablespace quotas
>   This seems to point to the larger problem space of disk space monitoring, 
> and
> should probably be phrased thusly, and is a much bigger project or set of
> projects.
> 
> - Allow tablespaces on RAM-based partitions for temporary objects
>   In the thread, what's desired is the ability to have some amount of
> durability on a RAM-disk without WAL logging.
> 
> - Close race in DROP TABLESPACE on Windows
>   This refers to buildfarm failures from 2014.
> 
> *Statistics Collector
> 
> - Track number of WAL files ready to be archived in pg_stat_archiver
>   Thread quote: "pg_stat_archiver already has a column for last_archived_wal
> and last_failed_wal, so you can already work out how many files there must be
> between then and now"
> 
> *Point-In-Time Recovery
> 
> - Allow archive_mode to be changed without server restart
>   Controversial and old
> 
> *Standby server mode
> 
> - Allow pg_xlogfile_name() to be used in recovery mode
>   Controversial and old
> 
> - Change walsender so that it applies per-role settings
>   Old and possibly obsolete

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: cutting down the TODO list thread

2020-10-27 Thread Andres Freund

Hi,

On 2020-10-27 15:24:35 -0400, John Naylor wrote:
> As I mentioned in [1], I've volunteered to clear out the TODO list of items
> that appear to be too difficult, controversial, or otherwise not worth
> doing to warrant being listed there. I'll be working a few sections at a
> time, and every so often I'll have a list of proposed items for removal. If
> I don't hear objections, I'll remove the items after a few days while going
> through the next set.
> 
> Where there's an email thread, I've skimmed a few messages to get a sense
> of the community's thoughts on it. Where easily determined, I've taken age
> into account, insofar as something from 2017 is going to get much more
> benefit of doubt than something from 2008. I've added after each item a
> phrase that sums up the reason I believe it doesn't belong anymore.
> Feedback welcome, of course, although I suspect there won't be much.
>

> - Prevent query cancel packets from being replayed by an attacker,
> especially when using SSL
>   Old and difficult

FWIW, I don't think we should remove this. Our current solution has some
serious issues that we should address at some point.



> - Allow WAL replay of CREATE TABLESPACE to work when the directory
> structure on the recovery computer is different from the original
>   Thread quote: "part of the difficult, perhaps-not-worth doing impossible
> problems"

I think we ought to do something here. Mostly because the current
situation makes it impossible to test many things on a single
system. And we have a partial solution with the tablespace mapping
files.


Greetings,

Andres Freund

Re: SEARCH and CYCLE clauses

2020-10-27 Thread Anastasia Lubennikova

On 10.10.2020 08:25, Pavel Stehule wrote:

Hi

pá 9. 10. 2020 v 12:17 odesílatel Pavel Stehule 
mailto:pavel.steh...@gmail.com>> napsal:

pá 9. 10. 2020 v 11:40 odesílatel Peter Eisentraut
mailto:peter.eisentr...@2ndquadrant.com>> napsal:

On 2020-09-22 20:29, Pavel Stehule wrote:
> The result is correct. When I tried to use UNION instead
UNION ALL, the
> pg crash

I fixed the crash, but UNION [DISTINCT] won't actually work
here because
row/record types are not hashable.  I'm leaving the partial
support in,
but I'm documenting it as currently not supported.

 I think so UNION is a common solution against the cycles. So
missing support for this specific case is not a nice thing. How
much work is needed for hashing rows. It should not be too much code.

> looks so clause USING in cycle detection is unsupported for
DB2 and
> Oracle - the examples from these databases doesn't work on
PG without
> modifications

Yeah, the path clause is actually not necessary from a user's
perspective, but it's required for internal bookkeeping.  We
could
perhaps come up with a mechanism to make it invisible coming
out of the
CTE (maybe give the CTE a target list internally), but that
seems like a
separate project.

The attached patch fixes the issues you have reported (also
the view
issue from the other email).  I have also moved the whole rewrite
support to a new file to not blow up rewriteHandler.c so much.

-- 
Peter Eisentraut http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training &
Services

This patch is based on transformation CYCLE and SEARCH clauses to 
specific expressions - it is in agreement with ANSI SQL

There is not a problem with compilation
Nobody had objections in discussion
There are enough regress tests and documentation
check-world passed
doc build passed

I'll mark this patch as ready for committer

Possible enhancing for this feature (can be done in next steps)

1. support UNION DISTINCT
2. better compatibility with Oracle and DB2 (USING clause can be optional)

Regards

Pavel

Status update for a commitfest entry.
According to cfbot patch no longer applies. So I moved it to waiting on 
author.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

cutting down the TODO list thread

2020-10-27 Thread John Naylor

As I mentioned in [1], I've volunteered to clear out the TODO list of items
that appear to be too difficult, controversial, or otherwise not worth
doing to warrant being listed there. I'll be working a few sections at a
time, and every so often I'll have a list of proposed items for removal. If
I don't hear objections, I'll remove the items after a few days while going
through the next set.

Where there's an email thread, I've skimmed a few messages to get a sense
of the community's thoughts on it. Where easily determined, I've taken age
into account, insofar as something from 2017 is going to get much more
benefit of doubt than something from 2008. I've added after each item a
phrase that sums up the reason I believe it doesn't belong anymore.
Feedback welcome, of course, although I suspect there won't be much.

**Administration

- Have custom variables be transaction-safe
  Old and found to be difficult after attempting

- Allow custom variables to appear in pg_settings()
  Old and controversial

- Implement the SQL-standard mechanism whereby REVOKE ROLE revokes only the
privilege granted by the invoking role, and not those granted by other roles
  Old and difficult

- Prevent query cancel packets from being replayed by an attacker,
especially when using SSL
  Old and difficult

*Configuration files

- Consider normalizing fractions in postgresql.conf, perhaps using '%'
  At the time (2007), some gucs used an actual percentage.

- Add external tool to auto-tune some postgresql.conf parameters
  There are already out-of-core tools that try to do this.

- Create utility to compute accurate random_page_cost value
  Seems outdated: In the current age of SSDs and cloud environments, it's
often just set to 1.1, and there hasn't been a demand to be more accurate
than that.

- Allow synchronous_standby_names to be disabled after communication
failure with all synchronous standby servers exceeds some timeout
  Controversial

- Adjust rounding behavior for numeric GUC values
  Controversial

*Tablespaces

- Allow WAL replay of CREATE TABLESPACE to work when the directory
structure on the recovery computer is different from the original
  Thread quote: "part of the difficult, perhaps-not-worth doing impossible
problems"

- Allow per-tablespace quotas
  This seems to point to the larger problem space of disk space monitoring,
and should probably be phrased thusly, and is a much bigger project or set
of projects.

- Allow tablespaces on RAM-based partitions for temporary objects
  In the thread, what's desired is the ability to have some amount of
durability on a RAM-disk without WAL logging.

- Close race in DROP TABLESPACE on Windows
  This refers to buildfarm failures from 2014.

*Statistics Collector

- Track number of WAL files ready to be archived in pg_stat_archiver
  Thread quote: "pg_stat_archiver already has a column for
last_archived_wal and last_failed_wal, so you can already work out how many
files there must be between then and now"

*Point-In-Time Recovery

- Allow archive_mode to be changed without server restart
  Controversial and old

*Standby server mode

- Allow pg_xlogfile_name() to be used in recovery mode
  Controversial and old

- Change walsender so that it applies per-role settings
  Old and possibly obsolete

--
[1]
https://www.postgresql.org/message-id/CAFBsxsHbqMzDoGB3eAGmpcpB%2B7uae%2BLLi_G%2Bo8HMEECM9CbQcQ%40mail.gmail.com


-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Use-after-free in 12- EventTriggerAlterTableEnd

2020-10-27 Thread Tom Lane

Arseny Sher  writes:
> Valgrind on our internal buildfarm complained about use-after-free
> during currentEventTriggerState->commandList manipulations, e.g. lappend
> in EventTriggerCollectSimpleCommand. I've discovered that the source of
> problem is EventTriggerAlterTableEnd not bothering to switch into its
> own context before appending to the list. ced138e8cba fixed this in
> master and 13 but wasn't backpatched further, so I see the problem in
> 12-.

Yeah, that clearly should have been back-patched --- the fact that it
accidentally didn't fail in the most common case wasn't a good reason
for leaving the bug in place.  I'm not excited about the test case
ced138e8cba added though, so I think your proposed patch is fine.
Will push shortly.

regards, tom lane

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-10-27 Thread Anastasia Lubennikova

Status update for a commitfest entry.

This patch is ReadyForCommitter. It applies and passes the CI. There are no 
unanswered questions in the discussion. 

The discussion started in 2015 with a patch by Jeff Janes. Later it was revived 
by Pavan Deolasee.  After it was picked up by Ibrar Ahmed and finally, it was 
rewritten by me, so I moved myself from reviewers to authors as well.

The latest version was reviewed and tested by Ibrar Ahmed. The patch doesn't 
affect COPY FREEZE performance and significantly decreases the time of the 
following VACUUM.

Re: Deleting older versions in unique indexes to avoid page splits

2020-10-27 Thread Peter Geoghegan

On Tue, Oct 27, 2020 at 2:44 AM Simon Riggs  wrote:
> While it is important we investigate the worst cases, I don't see this
> is necessarily bad.

I looked at "perf top" a few times when the test from yesterday ran. I
saw that the proposed delete mechanism was the top consumer of CPU
cycles. It seemed as if the mechanism was very expensive. However,
that's definitely the wrong conclusion about what happens in the
general case, or even in slightly less extreme cases. It at least
needs to be put in context.

I reran exactly the same benchmark overnight, but added a 10k TPS rate
limit this time (so about a third of the TPS that's possible without a
limit). I also ran it for longer, and saw much improved latency. (More
on the latency aspect below, for now I want to talk about "perf top").

The picture with "perf top" changed significantly with a 10k TPS rate
limit, even though the workload itself is very similar. Certainly the
new mechanism/function is still quite close to the top consumer of CPU
cycles. But it no longer uses more cycles than the familiar super hot
functions that you expect to see right at the top with pgbench (e.g.
_bt_compare(), hash_search_with_hash_value()). It's now something like
the 4th or 5th hottest function (I think that that means that the cost
in cycles is more than an order of magnitude lower, but I didn't
check). Just adding this 10k TPS rate limit makes the number of CPU
cycles consumed by the new mechanism seem quite reasonable. The
benefits that the patch brings are not diminished at all compared to
the original no-rate-limit variant -- the master branch now only takes
slightly longer to completely bloat all its indexes with this 10k TPS
limit (while the patch avoids even a single page split -- no change
there).

Again, this is because the mechanism is a backstop. It only works as
hard as needed to avoid unnecessary page splits. When the system is
working as hard as possible to add version churn to indexes (which is
what the original/unthrottled test involved), then the mechanism also
works quite hard. In this artificial and contrived scenario, any
cycles we can save from cleaning up bloat (by micro optimizing the
code in the patch) go towards adding even more bloat instead...which
necessitates doing more cleanup. This is why optimizing the code in
the patch with this unrealistic scenario in mind is subject to sharp
diminishing returns. It's also why you can get a big benefit from the
patch when the new mechanism is barely ever used. I imagine that if I
ran the same test again but with a 1k TPS limit I would hardly see the
new mechanism in "perf top" at allbut in the end the bloat
situation would be very similar.

I think that you could model this probabilistically if you had the
inclination. Yes, the more you throttle throughput (by lowering the
pgbench rate limit further), the less chance you have of any given
leaf page splitting moment to moment (for the master branch). But in
the long run every original leaf page splits at least once anyway,
because each leaf page still only has to be unlucky once. It is still
inevitable that they'll all get split eventually (and probably not
before too long), unless and until you *drastically* throttle pgbench.

I believe that things like opportunistic HOT chain truncation (heap
pruning) and index tuple LP_DEAD bit setting are very effective much
of the time. The problem is that it's easy to utterly rely on them
without even realizing it, which creates hidden risk that may or may
not result in big blow ups down the line. There is nothing inherently
wrong with being lazy or opportunistic about cleaning up garbage
tuples -- I think that there are significant advantages, in fact. But
only if it isn't allowed to create systemic risk. More concretely,
bloat cannot be allowed to become concentrated in any one place -- no
individual query should have to deal with more than 2 or 3 versions
for any given logical row. If we're focussed on the *average* number
of physical versions per logical row then we may reach dramatically
wrong conclusions about what to do (which is a problem in a world
where autovacuum is supposed to do most garbage collection...unless
your app happens to look like standard pgbench!).

And now back to latency with this 10k TPS limited variant I ran last
night. After 16 hours we have performed 8 runs, each lasting 2 hours.
In the original chronological order, these runs are:

patch_1_run_16.out: "tps = 1.095914 (including connections establishing)"
master_1_run_16.out: "tps = 1.171945 (including connections establishing)"
patch_1_run_32.out: "tps = 1.082975 (including connections establishing)"
master_1_run_32.out: "tps = 1.533370 (including connections establishing)"
patch_2_run_16.out: "tps = 1.076865 (including connections establishing)"
master_2_run_16.out: "tps = 9997.933215 (including connections establishing)"
patch_2_run_32.out: "tps = .911988 (including connections establishing)"
master_2_r

Re: Patch to fix FK-related selectivity estimates with constants

2020-10-27 Thread Tom Lane

I wrote:
> Over in the thread at [1] it's discussed how our code for making
> selectivity estimates using knowledge about FOREIGN KEY constraints
> is busted in the face of EquivalenceClasses including constants.
> ...
> Attached is a patch series that attacks it that way.

I'd failed to generate a test case I liked yesterday, but perhaps
the attached will do.  (While the new code is exercised in the
core regression tests already, it doesn't produce any visible
plan changes.)  I'm a little nervous about whether the plan
shape will be stable in the buildfarm, but it works for me on
both 64-bit and 32-bit machines, so probably it's OK.

regards, tom lane

diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a46b1573bd..6c9a5e26dd 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5843,6 +5843,56 @@ select t1.b, ss.phv from join_ut1 t1 left join lateral
 drop table join_pt1;
 drop table join_ut1;
 --
+-- test estimation behavior with multi-column foreign key and constant qual
+--
+begin;
+create table fkest (x integer, x10 integer, x10b integer, x100 integer);
+insert into fkest select x, x/10, x/10, x/100 from generate_series(1,1000) x;
+create unique index on fkest(x, x10, x100);
+analyze fkest;
+explain (costs off)
+select * from fkest f1
+  join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
+  join fkest f3 on f1.x = f3.x
+  where f1.x100 = 2;
+QUERY PLAN 
+---
+ Nested Loop
+   ->  Hash Join
+ Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
+ ->  Seq Scan on fkest f2
+   Filter: (x100 = 2)
+ ->  Hash
+   ->  Seq Scan on fkest f1
+ Filter: (x100 = 2)
+   ->  Index Scan using fkest_x_x10_x100_idx on fkest f3
+ Index Cond: (x = f1.x)
+(10 rows)
+
+alter table fkest add constraint fk
+  foreign key (x, x10b, x100) references fkest (x, x10, x100);
+explain (costs off)
+select * from fkest f1
+  join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
+  join fkest f3 on f1.x = f3.x
+  where f1.x100 = 2;
+ QUERY PLAN  
+-
+ Hash Join
+   Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
+   ->  Hash Join
+ Hash Cond: (f3.x = f2.x)
+ ->  Seq Scan on fkest f3
+ ->  Hash
+   ->  Seq Scan on fkest f2
+ Filter: (x100 = 2)
+   ->  Hash
+ ->  Seq Scan on fkest f1
+   Filter: (x100 = 2)
+(11 rows)
+
+rollback;
+--
 -- test that foreign key join estimation performs sanely for outer joins
 --
 begin;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1403e0ffe7..dd60d6a1f3 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1975,6 +1975,35 @@ select t1.b, ss.phv from join_ut1 t1 left join lateral
 
 drop table join_pt1;
 drop table join_ut1;
+
+--
+-- test estimation behavior with multi-column foreign key and constant qual
+--
+
+begin;
+
+create table fkest (x integer, x10 integer, x10b integer, x100 integer);
+insert into fkest select x, x/10, x/10, x/100 from generate_series(1,1000) x;
+create unique index on fkest(x, x10, x100);
+analyze fkest;
+
+explain (costs off)
+select * from fkest f1
+  join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
+  join fkest f3 on f1.x = f3.x
+  where f1.x100 = 2;
+
+alter table fkest add constraint fk
+  foreign key (x, x10b, x100) references fkest (x, x10, x100);
+
+explain (costs off)
+select * from fkest f1
+  join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
+  join fkest f3 on f1.x = f3.x
+  where f1.x100 = 2;
+
+rollback;
+
 --
 -- test that foreign key join estimation performs sanely for outer joins
 --

Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-27 Thread Justin Pryzby

On Wed, Oct 21, 2020 at 02:02:37PM -0300, Alvaro Herrera wrote:
> On 2020-Oct-21, Justin Pryzby wrote:
> 
> > I came up with this, which probably needs more than a little finesse.
> 
> Hmm, there are two important changes needed on this: 1) it must not emit
> CREATE lines for the child triggers; only the ALTER TABLE ONLY
>  lines to set disable state on the partition are needed.  2)
> tgparentid does not exist prior to pg13, so you need some additional
> trick to cover that case.

Thanks for looking.

> Also, I think the multipartition case is broken: if grandparent has
> trigger enabled, parent has trigger disabled and child trigger set to
> always, is that dumped correctly?  I think the right way to do this is
> change only the partitions that differ from the topmost partitioned
> table -- not their immediate parents; and use ONLY to ensure they don't
> affect downstream children.

I think either way could be ok - if you assume that the trigger was disabled
with ONLY, then it'd make sense to restore it with ONLY, but I think it's at
least as common to ALTER TABLE [*].  It might look weird to the user if they
used ALTER TABLE ONLY and the pg_dump output uses ALTER TABLE for that table,
and then again for all its children (or vice versa).  But it's fine as long as
the state is correctly restored.

There are serveral issues:
 - fail to preserve childs' tgenabled in CREATE TABLE PARTITION OF;
 - fail to preserve childs' tgenabled in pg_dump;
 - fail to preserve childs' comments in pg_dump;

I'm going step away from this patch at least for awhile, so I'm attaching my
latest in case it's useful.

-- 
Justin
>From 15cb95df4e3771f0bb3fe2f1fd8dfb94fe3f7a50 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 20 Oct 2020 23:19:07 -0500
Subject: [PATCH v2] pg_dump: output DISABLE/ENABLE for child triggers ..

..if their state does not match their parent
---
 src/bin/pg_dump/pg_dump.c| 62 +++-
 src/bin/pg_dump/pg_dump.h|  1 +
 src/bin/pg_dump/t/002_pg_dump.pl | 26 +++---
 3 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ff45e3fb8c..40844fa1e7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -7811,6 +7811,7 @@ getTriggers(Archive *fout, TableInfo tblinfo[], int numTables)
 i_tgconstrrelid,
 i_tgconstrrelname,
 i_tgenabled,
+i_tgisinternal,
 i_tgdeferrable,
 i_tginitdeferred,
 i_tgdef;
@@ -7829,21 +7830,46 @@ getTriggers(Archive *fout, TableInfo tblinfo[], int numTables)
 	tbinfo->dobj.name);
 
 		resetPQExpBuffer(query);
-		if (fout->remoteVersion >= 9)
+		if (fout->remoteVersion >= 13000)
 		{
 			/*
 			 * NB: think not to use pretty=true in pg_get_triggerdef.  It
 			 * could result in non-forward-compatible dumps of WHEN clauses
 			 * due to under-parenthesization.
+			 * Use tgparentid, which is available since v13.
 			 */
 			appendPQExpBuffer(query,
-			  "SELECT tgname, "
-			  "tgfoid::pg_catalog.regproc AS tgfname, "
-			  "pg_catalog.pg_get_triggerdef(oid, false) AS tgdef, "
-			  "tgenabled, tableoid, oid "
+			  "SELECT t.tgname, "
+			  "t.tgfoid::pg_catalog.regproc AS tgfname, "
+			  "pg_catalog.pg_get_triggerdef(t.oid, false) AS tgdef, "
+			  "t.tgenabled, t.tableoid, t.oid, t.tgisinternal "
 			  "FROM pg_catalog.pg_trigger t "
-			  "WHERE tgrelid = '%u'::pg_catalog.oid "
-			  "AND NOT tgisinternal",
+			  "LEFT JOIN pg_catalog.pg_trigger u ON u.oid = t.tgparentid "
+			  "WHERE t.tgrelid = '%u'::pg_catalog.oid "
+			  "AND (NOT t.tgisinternal OR t.tgenabled != u.tgenabled)",
+			  tbinfo->dobj.catId.oid);
+		}
+		else if (fout->remoteVersion >= 9)
+		{
+			/*
+			 * NB: think not to use pretty=true in pg_get_triggerdef.  It
+			 * could result in non-forward-compatible dumps of WHEN clauses
+			 * due to under-parenthesization.
+			 * The pg_depend joins handle v11-v12, which allow inherited row
+			 * triggers, but did not have tgparentid.  And do nothing between v9-v10.
+			 */
+			appendPQExpBuffer(query,
+			  "SELECT t.tgname, "
+			  "t.tgfoid::pg_catalog.regproc AS tgfname, "
+			  "pg_catalog.pg_get_triggerdef(t.oid, false) AS tgdef, "
+			  "t.tgenabled, t.tableoid, t.oid, t.tgisinternal "
+			  "FROM pg_catalog.pg_trigger t "
+			  "LEFT JOIN pg_catalog.pg_depend AS d ON "
+			  " d.classid = 'pg_trigger'::pg_catalog.regclass AND "
+			  " d.refclassid = 'pg_trigger'::pg_catalog.regclass AND d.objid = t.oid "
+			  "LEFT JOIN pg_catalog.pg_trigger AS u ON u.oid = refobjid "
+			  "WHERE t.tgrelid = '%u'::pg_catalog.oid "
+			  "AND (NOT t.tgisinternal OR t.tgenabled != u.tgenabled)",
 			  tbinfo->dobj.catId.oid);
 		}
 		else if (fout->remoteVersion >= 80300)
@@ -7903,6 +7929,7 @@ getTriggers(Archive *fout, TableInfo tblinfo[], int numTables)
 		i_tgconstrrelid = PQfnumber(res, "tgco

Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Alexander Korotkov

On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov  wrote:
> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> > Thanks for your review, Alexander!
> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> > Other changes seem correct to me too.
> >
> >
> > I've tried to find optimal value for cache size and it seems to me that it 
> > affects multixact scalability much less than sizes of offsets\members 
> > buffers. I concur that patch 2 of the patchset does not seem documented 
> > enough.
>
> Thank you.  I've made a few more minor adjustments to the patchset.
> I'm going to push 0001 and 0003 if there are no objections.

I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.

--
Regards,
Alexander Korotkov


v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v6-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v6-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch
Description: Binary data

Re: select_common_typmod

2020-10-27 Thread Peter Eisentraut


On 2020-10-26 10:05, Heikki Linnakangas wrote:

There might have been a tiny bug in transformValuesClause() because
while consolidating the typmods it does not take into account whether
the types are actually the same (as more correctly done in
transformSetOperationTree() and buildMergedJoinVar()).


Hmm, it seems so, but I could not come up with a test case to reach that
codepath. I think you'd need to create two types in the same
typcategory, but with different and incompatible typmod formats.


Yeah, something like that.


The patch also adds typmod resolution for hypothetical ordered-set
aggregate arguments. I couldn't come up with a test case that would
tickle that codepath either, but it seems like a sensible change. You
might want to mention it in the commit message though.


OK, committed with that.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Alexander Korotkov

On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it 
> affects multixact scalability much less than sizes of offsets\members 
> buffers. I concur that patch 2 of the patchset does not seem documented 
> enough.

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.

--
Regards,
Alexander Korotkov


v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v5-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch
Description: Binary data

Re: SQL:2011 PERIODS vs Postgres Ranges?

2020-10-27 Thread Paul Jungwirth


On 10/27/20 7:11 AM, Ibrar Ahmed wrote:
I have spent some more time on the patch and did a lot of cleanup along 
with some fixes, compilation errors, and warnings.


Thank you for taking a look at this! I've been swamped with ordinary 
work and haven't had a chance to focus on it for a while, but I'm hoping 
to make some improvements over the coming holidays, especially based on 
feedback from my talk at PgCon. There are a handful of small specific 
things I'd like to do, and then one big thing: add support for PERIODs. 
Vik said I could include his old patch for PERIODs, so I'd like to get 
that working on the latest master, and then rebase my own work on top of 
it. Then we can accept either ranges or PERIODs in various places 
(marked by TODOs in the code).


Vik also pointed out a way to check foreign keys without using 
range_agg. He thinks it may even be more efficient. On the other hand 
it's a much more complicated SQL statement. I'd like to do a performance 
comparison to get concrete numbers, but if we did use his query, then 
this patch wouldn't depend on multiranges anymore---which seems like a 
big aid to moving it forward. Assuming multiranges gets committed, we 
can always swap in the range_agg query depending on the performance 
comparison results.


I apologize for the slow progress here, and thank you for your help!

Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com

Re: "unix_socket_directories" should be GUC_LIST_INPUT?

2020-10-27 Thread Tom Lane

Robert Haas  writes:
> If we change this, is it going to be a compatibility break for the
> contents of postgresql.conf files?

I think not, at least not for the sorts of values you'd ordinarily
find in that variable, say '/tmp, /var/run/postgresql'.  Possibly
the behavior would change for pathnames containing spaces or the
like, but it is probably kinda broken for such cases today anyway.

In any case, Michael had promised to test this aspect before committing.

regards, tom lane

Re: "unix_socket_directories" should be GUC_LIST_INPUT?

2020-10-27 Thread Robert Haas

On Tue, Oct 27, 2020 at 9:45 AM Tom Lane  wrote:
> Peter Eisentraut  writes:
> > If we're going to change it I think we need an updated patch that covers
> > pg_dump.  (Even if we argue that pg_dump would not normally dump this
> > variable, keeping it up to date with respect to GUC_LIST_QUOTE seems
> > proper.)
>
> Right, I was definitely assuming that that would happen.

If we change this, is it going to be a compatibility break for the
contents of postgresql.conf files?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Implementing Incremental View Maintenance

2020-10-27 Thread Adam Brusselback

That was a good bit more work to get ready than I expected. It's broken
into two scripts, one to create the schema, the other to load data and
containing a couple check queries to ensure things are working properly
(checking the materialized tables against a regular view for accuracy).

The first test case is to give us a definitive result on what "agreed
pricing" is in effect at a point in time based on a product hierarchy
our customers setup, and allow pricing to be set on nodes in that
hierarchy, as well as specific products (with an order of precedence).
The second test case maintains some aggregated amounts / counts / boolean
logic at an "invoice" level for all the detail lines which make up that
invoice.

Both of these are real-world use cases which were simplified a bit to make
them easier to understand. We have other use cases as well, but with how
much time this took to prepare i'll keep it at this for now.
If you need anything clarified or have any issues, just let me know.

On Fri, Oct 23, 2020 at 3:58 AM Yugo NAGATA  wrote:

> Hi Adam,
>
> On Thu, 22 Oct 2020 10:07:29 -0400
> Adam Brusselback  wrote:
>
> > Hey there Yugo,
> > I've asked a coworker to prepare a self contained example that
> encapsulates
> > our multiple use cases.
>
> Thank you very much!
>
> > The immediate/eager approach is exactly what we need, as within the same
> > transaction we have statements that can cause one of those "materialized
> > tables" to be updated, and then sometimes have the need to query that
> > "materialized table" in a subsequent statement and need to see the
> changes
> > reflected.
>
> The proposed patch provides the exact this feature and I think this will
> meet
> your needs.
>
> > As soon as my coworker gets that example built up I'll send a followup
> with
> > it attached.
>
> Great! We are looking forward to it.
>
> Regards,
> Yugo Nagata
>
> --
> Yugo NAGATA 
>

02_materialized_test_data.sql
Description: Binary data

01_materialized_test_schema.sql
Description: Binary data

Re: Parallel copy

2020-10-27 Thread vignesh C

On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma  wrote:
>
> >
> > I think, if possible, all these if-else checks in CopyFrom() can be
> > moved to a single function which can probably be named as
> > IdentifyCopyInsertMethod() and this function can be called in
> > IsParallelCopyAllowed(). This will ensure that in case of Parallel
> > Copy when the leader has performed all these checks, the worker won't
> > do it again. I also feel that it will make the code look a bit
> > cleaner.
> >
>
> Just rewriting above comment to make it a bit more clear:
>
> I think, if possible, all these if-else checks in CopyFrom() should be
> moved to a separate function which can probably be named as
> IdentifyCopyInsertMethod() and this function called from
> IsParallelCopyAllowed() and CopyFrom() functions. It will only be
> called from CopyFrom() when IsParallelCopy() returns false. This will
> ensure that in case of Parallel Copy if the leader has performed all
> these checks, the worker won't do it again. I also feel that having a
> separate function containing all these checks will make the code look
> a bit cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: automatic analyze: readahead - add "IO read time" log message

2020-10-27 Thread Stephen Frost

Greetings Jakub,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> > The analyze is doing more-or-less random i/o since it's skipping through
> > the table picking out select blocks, not doing regular sequential i/o.
> VS
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
> >Not really sure what's interesting here, but it does look like we're
> >skipping through the table as expected.
> 
> Yes, but not randomly in this case. I wanted to point out that this is 
> incrementing block number, therefore I've included this debug output which 
> might trigger readahead heuristics.

Sure, it's incrementing, but it's skipping- this is very similar to what
we do with Bitmap Heap Scans, and that makes it different from a typical
sequential scan.

> Perhaps this depends on how the table was built / vacuumed ? (in this case, 
> pure-INSERT-only; I would expect the same in time series DBs and DWHs). 

No, this is how ANALYZE behaves and hasn't got much to do with how the
table was built or vacuumed.

> > With all those 'readahead' calls it certainly makes one wonder if the
> > Linux kernel is reading more than just the block we're looking for
> > because it thinks we're doing a sequential read and will therefore want
> > the next few blocks when, in reality, we're going to skip past them,
> > meaning that any readahead the kernel is doing is likely just wasted
> > I/O.
> 
> I've done some quick&dirty tests with blockdev --setra/setfra 0 after 
> spending time looking at the smgr/md/fd API changes required to find 
> shortcut, but I'm getting actually a little bit worse timings at least on 
> "laptop DB tests". One thing that I've noticed is that needs to be only for 
> automatic-analyze, but not for automatic-vacuum where apparently there is 
> some boost due to readahead.

Interesting that you weren't seeing any benefit to disabling readahead.
Were you able to see where the time in the kernel was going when
readahead was turned off for the ANALYZE?

Note that you shouldn't need to make changes to smgr/md/fd to leverage
posix_fadvise- what you would do is use PrefetchBuffer(), see
BitmapPrefetch().

The VACUUM case is going to be complicated by what's in the visibility
map.  A VACUUM that isn't able to skip any pages yet is certainly going
to benefit from the kernel's readahead, but a VACUUM that's able to skip
over pages likely wouldn't benefit as much.

The way to test this would look something like:

- COPY a bunch of data into a table
- make sure to commit that and any other ongoing transactions
- VACUUM FREEZE the table
- check the visibility map to make sure most of the pages are marked as
  all-frozen after the VACUUM FREEZE
- randomly UPDATE the table, to really get the effect, maybe update 20%
  of the pages while leaving the rest alone (and therefore 80% of the
  table should still have the all-frozen bit set in the visibility map)
- *then* do a VACUUM on the table and see what happens with different
  amounts of read-ahead (or, ideally, with posix_fadvise() being used to
  let the kernel know what pages we're going to actually want).

> > That definitely seems like a useful thing to include and thanks for the
> > patch!  Please be sure to register it in the commitfest app:
> > https://commitfest.postgresql.org
> 
> Thank you! Thread is now registered.

Great!

> > I would think that, ideally, we'd teach analyze.c to work in the same
> > way that bitmap heap scans do- that is, use posix_fadvise to let the
> > kernel know what pages we're going to want next instead of the kernel
> > guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> > lot of time poking, but it doesn't look like analyze.c tries to do any
> > prefetching today.  In a similar vein, I wonder if VACUUM should be
> > doing prefetching too today, at least when it's skipping through the
> > heap based on the visibility map and jumping over all-frozen pages.
> 
> My only idea would be that a lot of those blocks could be read asynchronously 
> in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so 
> maybe the the optimization is possible, but not until we'll have AIO ;)

Well, the idea is that posix_fadvise() usage through PrefetchBuffer()
gets us some of that by letting the kernel know what we're going to ask
for next.  AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().

> > Haven't looked too closely at this but in general +1 on the idea and
> > this approach looks pretty reasonable to me.  Only thing I can think of
> > off-hand is to check how it compares

Re: Parallel copy

2020-10-27 Thread vignesh C

Thanks Ashutosh for reviewing and providing your comments.

On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma  wrote:
>
> Hi Vignesh,
>
> Thanks for the updated patches. Here are some more comments that I can
> find after reviewing your latest patches:
>
> +/*
> + * This structure helps in storing the common data from CopyStateData that 
> are
> + * required by the workers. This information will then be allocated and 
> stored
> + * into the DSM for the worker to retrieve and copy it to CopyStateData.
> + */
> +typedef struct SerializedParallelCopyState
> +{
> +   /* low-level state data */
> +   CopyDestcopy_dest;  /* type of copy source/destination */
> +   int file_encoding;  /* file or remote side's character encoding */
> +   boolneed_transcoding;   /* file encoding diff from server? */
> +   boolencoding_embeds_ascii;  /* ASCII can be non-first byte? */
> +
> ...
> ...
> +
> +   /* Working state for COPY FROM */
> +   AttrNumber  num_defaults;
> +   Oid relid;
> +} SerializedParallelCopyState;
>
> Can the above structure not be part of the CopyStateData structure? I
> am just asking this question because all the fields present in the
> above structure are also present in the CopyStateData structure. So,
> including it in the CopyStateData structure will reduce the code
> duplication and will also make CopyStateData a bit shorter.
>

I have removed the common members from the structure, now there are no
common members between CopyStateData & the new structure. I'm using
CopyStateData to copy to/from directly in the new patch.

> --
>
> +   pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
> +relid);
>
> Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
> function when we are already passing cstate structure, using which
> both of these information can be retrieved ?
>

nworkers need not be passed as you have suggested but relid need to be
passed as we will be setting it to pcdata, modified nworkers as
suggested.

> --
>
> +/* DSM keys for parallel copy.  */
> +#define PARALLEL_COPY_KEY_SHARED_INFO  1
> +#define PARALLEL_COPY_KEY_CSTATE   2
> +#define PARALLEL_COPY_WAL_USAGE3
> +#define PARALLEL_COPY_BUFFER_USAGE 4
>
> DSM key names do not appear to be consistent. For shared info and
> cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
> but for WalUsage and BufferUsage structures, it is prefixed with
> "PARALLEL_COPY". I think it would be better to make them consistent.
>

Modified as suggested

> --
>
> if (resultRelInfo->ri_TrigDesc != NULL &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
>  * triggers on the table. Such triggers might query the table we're
>  * inserting into and act differently if the tuples that have already
>  * been processed and prepared for insertion are not there.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
>  resultRelInfo->ri_TrigDesc->trig_insert_new_table)
> {
> /*
>  * For partitioned tables we can't support multi-inserts when there
>  * are any statement level insert triggers. It might be possible to
>  * allow partitioned tables with such triggers in the future, but for
>  * now, CopyMultiInsertInfoFlush expects that any before row insert
>  * and statement level insert triggers are on the same relation.
>  */
> insertMethod = CIM_SINGLE;
> }
> else if (resultRelInfo->ri_FdwRoutine != NULL ||
>  cstate->volatile_defexprs)
> {
> ...
> ...
>
> I think, if possible, all these if-else checks in CopyFrom() can be
> moved to a single function which can probably be named as
> IdentifyCopyInsertMethod() and this function can be called in
> IsParallelCopyAllowed(). This will ensure that in case of Parallel
> Copy when the leader has performed all these checks, the worker won't
> do it again. I also feel that it will make the code look a bit
> cleaner.
>

In the recent patch posted we have changed it to simplify the check
for parallel copy, it is not an exact match. I feel this comment is
not applicable on the latest patch

> --
>
> +void
> +ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
> +{
> ...
> ...
> +   InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> + &walusage[ParallelWorkerNumber]);
> +
> +   MemoryContextSwitchTo(oldcontext);
> +   pfree(cstate);
> +   return;
> +}
>
> It seems like you also need to delete the memory context
> (cstate->copycontext) here.
>

Added it.

> --
>
> +void
> +ExecBeforeStmtTrigger(CopyState cstate)
> +{
> +   ESta

Re: Parallel copy

2020-10-27 Thread vignesh C

On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila  wrote:
>
> On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
>  wrote:
> >
> >
> > 9. Instead of calling CopyStringToSharedMemory() for each string
> > variable, can't we just create a linked list of all the strings that
> > need to be copied into shm and call CopyStringToSharedMemory() only
> > once? We could avoid 5 function calls?
> >
>
> If we want to avoid different function calls then can't we just store
> all these strings in a local structure and use it? That might improve
> the other parts of code as well where we are using these as individual
> parameters.
>

I have made one structure SerializedListToStrCState to store all the
variables. The rest of the common variables is directly copied from &
into cstate.

> > 10. Similar to above comment: can we fill all the required
> > cstate->variables inside the function CopyNodeFromSharedMemory() and
> > call it only once? In each worker we could save overhead of 5 function
> > calls.
> >
>
> Yeah, that makes sense.
>

I feel keeping it this way makes the code more readable, and also this
is not in a performance intensive tight loop. I'm  retaining the
change as is unless we feel this will make an impact.

This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-27 Thread vignesh C

Thanks Heikki for reviewing and providing your comments. Please find
my thoughts below.

On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas  wrote:
>
> I had a brief look at at this patch. Important work! A couple of first
> impressions:
>
> 1. The split between patches
> 0002-Framework-for-leader-worker-in-parallel-copy.patch and
> 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite
> artificial. All the stuff introduced in the first is unused until the
> second patch is applied. The first patch introduces a forward
> declaration for ParallelCopyData(), but the function only comes in the
> second patch. The comments in the first patch talk about
> LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only
> comes in the second patch. I think these have to merged into one. If you
> want to split it somehow, I'd suggest having a separate patch just to
> move CopyStateData from copy.c to copy.h. The subsequent patch would
> then be easier to read as you could see more easily what's being added
> to CopyStateData. Actually I think it would be better to have a new
> header file, copy_internal.h, to hold CopyStateData and the other
> structs, and keep copy.h as it is.
>

I have merged 0002 & 0003 patch, I have moved few things like creation
of copy_internal.h, moving of CopyStateData from copy.c into
copy_internal.h into 0001 patch.

> 2. This desperately needs some kind of a high-level overview of how it
> works. What is a leader, what is a worker? Which process does each step
> of COPY processing, like reading from the file/socket, splitting the
> input into lines, handling escapes, calling input functions, and
> updating the heap and indexes? What data structures are used for the
> communication? How does is the work synchronized between the processes?
> There are comments on those individual aspects scattered in the patch,
> but if you're not already familiar with it, you don't know where to
> start. There's some of that in the commit message, but it needs to be
> somewhere in the source code, maybe in a long comment at the top of
> copyparallel.c.
>

Added it in copyparallel.c

> 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for
> every input line. Doesn't that incur a lot of synchronization overhead?
> I haven't done any testing, this is just my gut feeling, but I assumed
> you'd work in batches of, say, 100 or 1000 lines each.
>

Data read from the file will be stored in DSM which is of size 64k *
1024. Leader will parse and identify the line boundary like which line
starts from which data block, what is the starting offset in the data
block, what is the line size, this information will be present in
ParallelCopyLineBoundary. Like you said, each worker processes
WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run
for parallel copy are available at [1]. This is addressed in v9 patch
shared at [2].

[1] 
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

2020-10-27 Thread vignesh C

On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy
 wrote:
>
> On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
>  wrote:
> >
> > 17. Remove extra lines after #define IsHeaderLine()
> > (cstate->header_line && cstate->cur_lineno == 1) in copy.h
> >
>
>  I missed one comment:
>
>  18. I think we need to treat the number of parallel workers as an
> integer similar to the parallel option in vacuum.
>
> postgres=# copy t1 from stdin with(parallel '1');  < - we
> should not allow this.
> Enter data to be copied followed by a newline.
>
> postgres=# vacuum (parallel '1') t1;
> ERROR:  parallel requires an integer value
>

I have made the behavior the same as vacuum.
This is addressed in v9 patch shared at [1].
[1] - 
https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: automatic analyze: readahead - add "IO read time" log message

2020-10-27 Thread Jakub Wartak

Hi Stephen, hackers,

> The analyze is doing more-or-less random i/o since it's skipping through
> the table picking out select blocks, not doing regular sequential i/o.
VS
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
>Not really sure what's interesting here, but it does look like we're
>skipping through the table as expected.

Yes, but not randomly in this case. I wanted to point out that this is 
incrementing block number, therefore I've included this debug output which 
might trigger readahead heuristics.
Perhaps this depends on how the table was built / vacuumed ? (in this case, 
pure-INSERT-only; I would expect the same in time series DBs and DWHs). 

> With all those 'readahead' calls it certainly makes one wonder if the
> Linux kernel is reading more than just the block we're looking for
> because it thinks we're doing a sequential read and will therefore want
> the next few blocks when, in reality, we're going to skip past them,
> meaning that any readahead the kernel is doing is likely just wasted
> I/O.

I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending 
time looking at the smgr/md/fd API changes required to find shortcut, but I'm 
getting actually a little bit worse timings at least on "laptop DB tests". One 
thing that I've noticed is that needs to be only for automatic-analyze, but not 
for automatic-vacuum where apparently there is some boost due to readahead.

> That definitely seems like a useful thing to include and thanks for the
> patch!  Please be sure to register it in the commitfest app:
> https://commitfest.postgresql.org

Thank you! Thread is now registered.

> I would think that, ideally, we'd teach analyze.c to work in the same
> way that bitmap heap scans do- that is, use posix_fadvise to let the
> kernel know what pages we're going to want next instead of the kernel
> guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> lot of time poking, but it doesn't look like analyze.c tries to do any
> prefetching today.  In a similar vein, I wonder if VACUUM should be
> doing prefetching too today, at least when it's skipping through the
> heap based on the visibility map and jumping over all-frozen pages.

My only idea would be that a lot of those blocks could be read asynchronously 
in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe 
the the optimization is possible, but not until we'll have AIO ;)

> Haven't looked too closely at this but in general +1 on the idea and
> this approach looks pretty reasonable to me.  Only thing I can think of
> off-hand is to check how it compares to other places where we report IO
> read time and make sure that it looks similar.

Ok, I've changed the output in 002 version to include "avg read rate" just like 
in the autovacuum case but still maintaining single line output, e.g:  
automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s 
(read time: 2.52 s), system usage: CPU: user: 0.28 s, system: 0.26 s, elapsed: 
2.94 s

-J.diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..38d69d887e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+   PgStat_Counter startreadtime = 0;
+   longstartblksread = 0;
 
if (inh)
ereport(elevel,
@@ -347,6 +349,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
pg_rusage_init(&ru0);
+   startreadtime = pgStatBlockReadTime;
+   startblksread = pgBufferUsage.shared_blks_read;
if (params->log_min_duration > 0)
starttime = GetCurrentTimestamp();
}
@@ -685,12 +689,27 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (params->log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, 
GetCurrentTimestamp(),
   
params->log_min_duration))
-   ereport(LOG,
-   (errmsg("automatic analyze of table 
\"%s.%s.%s\" system usage: %s",
-   
get_database_name(MyDatabaseId),
-   
get_namespace_name(RelationGetNam

Re: parallel distinct union and aggregate support patch

2020-10-27 Thread Dilip Kumar

On Tue, Oct 27, 2020 at 5:43 PM Robert Haas  wrote:
>
> On Thu, Oct 22, 2020 at 5:08 AM Dilip Kumar  wrote:
> > Interesting idea.  So IIUC, whenever a worker is scanning the tuple it
> > will directly put it into the respective batch(shared tuple store),
> > based on the hash on grouping column and once all the workers are
> > doing preparing the batch then each worker will pick those baches one
> > by one, perform sort and finish the aggregation.  I think there is a
> > scope of improvement that instead of directly putting the tuple to the
> > batch what if the worker does the partial aggregations and then it
> > places the partially aggregated rows in the shared tuple store based
> > on the hash value and then the worker can pick the batch by batch.  By
> > doing this way, we can avoid doing large sorts.  And then this
> > approach can also be used with the hash aggregate, I mean the
> > partially aggregated data by the hash aggregate can be put into the
> > respective batch.
>
> I am not sure if this would be a win if the typical group size is
> small and the transition state has to be serialized/deserialized.
> Possibly we need multiple strategies, but I guess we'd have to test
> performance to be sure.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Re: parallel distinct union and aggregate support patch

2020-10-27 Thread Dilip Kumar

On Tue, Oct 27, 2020 at 3:27 PM Dilip Kumar  wrote:
>
> On Fri, Oct 23, 2020 at 11:58 AM bu...@sohu.com  wrote:
> >
> > > Interesting idea.  So IIUC, whenever a worker is scanning the tuple it
> > > will directly put it into the respective batch(shared tuple store),
> > > based on the hash on grouping column and once all the workers are
> > > doing preparing the batch then each worker will pick those baches one
> > > by one, perform sort and finish the aggregation.  I think there is a
> > > scope of improvement that instead of directly putting the tuple to the
> > > batch what if the worker does the partial aggregations and then it
> > > places the partially aggregated rows in the shared tuple store based
> > > on the hash value and then the worker can pick the batch by batch.  By
> > > doing this way, we can avoid doing large sorts.  And then this
> > > approach can also be used with the hash aggregate, I mean the
> > > partially aggregated data by the hash aggregate can be put into the
> > > respective batch.
> >
> > Good idea. Batch sort suitable for large aggregate result rows,
> > in large aggregate result using partial aggregation maybe out of memory,
> > and all aggregate functions must support partial(using batch sort this is 
> > unnecessary).
> >
> > Actually i written a batch hash store for hash aggregate(for pg11) like 
> > this idea,
> > but not write partial aggregations to shared tuple store, it's write origin 
> > tuple and hash value
> > to shared tuple store, But it's not support parallel grouping sets.
> > I'am trying to write parallel hash aggregate support using batch shared 
> > tuple store for PG14,
> > and need support parallel grouping sets hash aggregate.
>
> I was trying to look into this patch to understand the logic in more
> detail.  Actually, there are no comments at all so it's really hard to
> understand what the code is trying to do.
>
> I was reading the below functions, which is the main entry point for
> the batch sort.
>
> +static TupleTableSlot *ExecBatchSortPrepare(PlanState *pstate)
> +{
> ...
> + for (;;)
> + {
> ...
> + tuplesort_puttupleslot(state->batches[hash%node->numBatches], slot);
> + }
> +
> + for (i=node->numBatches;i>0;)
> + tuplesort_performsort(state->batches[--i]);
> +build_already_done_:
> + if (parallel)
> + {
> + for (i=node->numBatches;i>0;)
> + {
> + --i;
> + if (state->batches[i])
> + {
> + tuplesort_end(state->batches[i]);
> + state->batches[i] = NULL;
> + }
> + }
>
> I did not understand this part, that once each worker has performed
> their local batch-wise sort why we are clearing the baches?  I mean
> individual workers have their on batches so eventually they supposed
> to get merged.  Can you explain this part and also it will be better
> if you can add the comments.

I think I got this,  IIUC, each worker is initializing the shared
short and performing the batch-wise sorting and we will wait on a
barrier so that all the workers can finish with their sorting.  Once
that is done the workers will coordinate and pick the batch by batch
and perform the final merge for the batch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Internal key management system

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 10:02:53PM +0800, Craig Ringer wrote:
> On Tue, 27 Oct 2020, 19:15 Bruce Momjian,  wrote:
> We don't need to do anything except provide a way to tell OpenSSL where to get
> the KEK from, for situations where having Pg generate it internally
> undesirable. 
> 
> I proposed a simple GUC that we could supply to OpenSSL as a key path because
> it's simple. It's definitely not best.
> 
> In my prior mail I outlined what I think is a better way. Abstract key key
> initialisation -  passphrase fetching KEK/HMAC loading and all of it - behind 
> a
> pluggable interface. Looking at the patch, it's mostly there already. We just
> need a way to hook the key loading and setup so it can be overridden to use
> whatever method is required. Then KEK operations to encrypt and decrypt the
> heap and WAL keys happen via that abstraction.
> 
> That way Pg does not have to care about the details of hardware key 
> management,
> PKCS#11 or OpenSSL engines, etc.
> 
> A little thought is needed to make key rotation work well. Especially when you
> want to switch from cluster passphrase to a plugin that supports use of a HVM
> escrowed key, or vice versa.
> 
> But most of what's needed looks like it's there already. It's just down to
> making sure the key loading and initialisation is overrideable.

I don't know much about how to hook into that stuff so if you have an
idea, I am all ears.  I have used OpenSSL with Yubikey via pksc11.  You
can see the use of it on slide 57 and following:

https://momjian.us/main/writings/crypto_hw_config.pdf#page=57

Interestingly, that still needed the user to type in a key to unlock the
Yubikey, so we might need PKCS11 and a password for the same server
start.

I would like to get this moving forward so I will work on the idea of
passing an open /dev/tty file descriptor from pg_ctl to the postmaster
on start.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Question about make coverage-html

2020-10-27 Thread Tom Lane

Heikki Linnakangas  writes:
> On 27/10/2020 10:09, Peter Smith wrote:
>> The documentation [1] also says "The make commands also work in
>> subdirectories." so I tried running them all in that folder.
>> However, when I run "make coverage-html" in that subdirectory
>> src/test/subscription it does not work:

> Creating a coverage report is a two-step process. First, you run the 
> test you're interested in, with "make check" or similar. Then you create 
> a report for the source files you're interested in, with "make 
> coverage-html". You can run these commands in different subdirectories.

> In this case, you want to do "cd src/test/subscription; make check", to 
> run those TAP tests, and then run "make coverage-html" from the top 
> folder. Or if you wanted to create coverage report that covers only 
> replication-related source code, for example, you could run it in the 
> src/backend/replication directory ("cd src/backend/replication; make 
> coverage-html").

I agree with the OP that the documentation is a bit vague here.
I think (maybe I'm wrong) that it's clear enough that you can run
whichever test case(s) you want, but this behavior of generating a
partial coverage report is less clear.  Maybe instead of

The "make" commands also work in subdirectories.

we could say

You can run the "make coverage-html" command in a subdirectory
if you want a coverage report for only a portion of the code tree.

regards, tom lane

Re: Add important info about ANALYZE after create Functional Index

2020-10-27 Thread Fabrízio de Royes Mello

On Tue, Oct 27, 2020 at 4:12 AM Nikolay Samokhvalov 
wrote:
>
> On Mon, Oct 26, 2020 at 3:08 PM Fabrízio de Royes Mello <
fabriziome...@gmail.com> wrote:
>>
>> Would be nice if add some information about it into our docs but not
sure where. I'm thinking about:
>> - doc/src/sgml/ref/create_index.sgml
>> - doc/src/sgml/maintenance.sgml (routine-reindex)
>
>
> Attaching the patches for the docs, one for 11 and older, and another for
12+ (which have REINDEX CONCURRENTLY not suffering from lack of ANALYZE).
>

Actually the REINDEX CONCURRENTLY suffers with the lack of ANALYZE. See my
previous message on this thread.

So just adding the note on the ANALYZE docs is enough.


> I still think that automating is the right thing to do but of course,
it's a much bigger topic that a quick fix dor the docs.

So what we need to do is see how to fix REINDEX CONCURRENTLY.

Regards,

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Add important info about ANALYZE after create Functional Index

2020-10-27 Thread Fabrízio de Royes Mello

On Mon, Oct 26, 2020 at 7:46 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:
>
> It would seem preferable to call the lack of auto-analyzing after these
operations a bug and back-patch a fix that injects an analyze side-effect
just before their completion.  It doesn't have to be smart either,
analyzing things even if the created (or newly validated) index doesn't
have statistics of its own isn't a problem in my book.
>

When we create a new table or index they will not have statistics until an
ANALYZE happens. This is the default behaviour and I think is not a big
problem here, but we need to add some note on docs about the need of
statistics for indexes on expressions.

But IMHO there is a misbehaviour with the implementation of CONCURRENTLY on
REINDEX because running it will lose the statistics. Have a look the
example below:

fabrizio=# SELECT version();
 version

-
 PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

fabrizio=# CREATE TABLE t(f1 BIGSERIAL PRIMARY KEY, f2 TEXT) WITH
(autovacuum_enabled = false);
CREATE TABLE
fabrizio=# INSERT INTO t(f2) SELECT repeat(chr(65+(random()*26)::int),
(random()*300)::int) FROM generate_series(1, 1);
INSERT 0 1
fabrizio=# CREATE INDEX t_idx2 ON t(lower(f2));
CREATE INDEX
fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_pkey'::regclass;
 count
---
 0
(1 row)

fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_idx2'::regclass;
 count
---
 0
(1 row)

fabrizio=# ANALYZE t;
ANALYZE
fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_pkey'::regclass;
 count
---
 0
(1 row)

fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_idx2'::regclass;
 count
---
 1
(1 row)

fabrizio=# REINDEX INDEX t_idx2;
REINDEX
fabrizio=# REINDEX INDEX t_pkey;
REINDEX
fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_pkey'::regclass;
 count
---
 0
(1 row)

fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_idx2'::regclass;
 count
---
 1
(1 row)

-- A regular REINDEX don't lose the statistics.


fabrizio=# REINDEX INDEX CONCURRENTLY t_idx2;
REINDEX
fabrizio=# SELECT count(*) FROM pg_statistic WHERE starelid =
't_idx2'::regclass;
 count
---
 0
(1 row)


-- But the REINDEX CONCURRENTLY loses.

So IMHO here is the place we should rework a bit to execute ANALYZE as a
last step.

Regards,

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Internal key management system

2020-10-27 Thread Craig Ringer

On Tue, 27 Oct 2020, 19:15 Bruce Momjian,  wrote:

> We could implement a 'command:' prefix now, and maybe
> a 'pass:' one, and allow other methods like 'pkcs11' later.
>

We don't need to do anything except provide a way to tell OpenSSL where to
get the KEK from, for situations where having Pg generate it internally
undesirable.

I proposed a simple GUC that we could supply to OpenSSL as a key path
because it's simple. It's definitely not best.

In my prior mail I outlined what I think is a better way. Abstract key key
initialisation -  passphrase fetching KEK/HMAC loading and all of it -
behind a pluggable interface. Looking at the patch, it's mostly there
already. We just need a way to hook the key loading and setup so it can be
overridden to use whatever method is required. Then KEK operations to
encrypt and decrypt the heap and WAL keys happen via that abstraction.

That way Pg does not have to care about the details of hardware key
management, PKCS#11 or OpenSSL engines, etc.

A little thought is needed to make key rotation work well. Especially when
you want to switch from cluster passphrase to a plugin that supports use of
a HVM escrowed key, or vice versa.

But most of what's needed looks like it's there already. It's just down to
making sure the key loading and initialisation is overrideable.

Re: duplicate function oid symbols

2020-10-27 Thread Tom Lane

John Naylor  writes:
> I noticed that the table AM abstraction introduced the symbol
> HEAP_TABLE_AM_HANDLER_OID, although we already have a convention for
> defining symbols automatically for builtin functions, which in this case is
> (currently unused) F_HEAP_TABLEAM_HANDLER.

Yeah, that seems wrong.  I'd just remove HEAP_TABLE_AM_HANDLER_OID.
As long as we're not back-patching the change, it seems like a very
minor thing to fix, if anyone outside core is referencing the old name.

regards, tom lane

Re: "unix_socket_directories" should be GUC_LIST_INPUT?

2020-10-27 Thread Tom Lane

Peter Eisentraut  writes:
> If we're going to change it I think we need an updated patch that covers 
> pg_dump.  (Even if we argue that pg_dump would not normally dump this 
> variable, keeping it up to date with respect to GUC_LIST_QUOTE seems 
> proper.)

Right, I was definitely assuming that that would happen.

regards, tom lane

Re: SQL-standard function body

2020-10-27 Thread Peter Eisentraut

Here is another updated patch.  I did some merging and some small fixes 
and introduced the pg_proc column prosqlbody to store the parsed 
function body separately from probin.  Aside from one TODO left it seems 
feature-complete to me for now.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 566a30b33df2bca79dd7c6f45f2f55be42403b27 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 27 Oct 2020 14:40:14 +0100
Subject: [PATCH v5] SQL-standard function body

This adds support for writing CREATE FUNCTION and CREATE PROCEDURE
statements for language SQL with a function body that conforms to the
SQL standard and is portable to other implementations.

Instead of the PostgreSQL-specific AS $$ string literal $$ syntax,
this allows writing out the SQL statements making up the body
unquoted, either as a single statement:

CREATE FUNCTION add(a integer, b integer) RETURNS integer
LANGUAGE SQL
RETURN a + b;

or as a block

CREATE PROCEDURE insert_data(a integer, b integer)
LANGUAGE SQL
BEGIN ATOMIC
  INSERT INTO tbl VALUES (a);
  INSERT INTO tbl VALUES (b);
END;

The function body is parsed at function definition time and stored as
expression nodes in a new pg_proc column prosqlbody.  So at run time,
no further parsing is required.

However, this form does not support polymorphic arguments, because
there is no more parse analysis done at call time.

Dependencies between the function and the objects it uses are fully
tracked.

A new RETURN statement is introduced.  This can only be used inside
function bodies.  Internally, it is treated much like a SELECT
statement.

psql needs some new intelligence to keep track of function body
boundaries so that it doesn't send off statements when it sees
semicolons that are inside a function body.

Also, per SQL standard, LANGUAGE SQL is the default, so it does not
need to be specified anymore.

Discussion: 
https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c0...@2ndquadrant.com
---
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/create_function.sgml | 126 +--
 doc/src/sgml/ref/create_procedure.sgml|  62 -
 src/backend/catalog/pg_aggregate.c|   1 +
 src/backend/catalog/pg_proc.c | 116 ++
 src/backend/commands/aggregatecmds.c  |   2 +
 src/backend/commands/functioncmds.c   | 119 --
 src/backend/commands/typecmds.c   |   1 +
 src/backend/executor/functions.c  |  79 ---
 src/backend/nodes/copyfuncs.c |  15 ++
 src/backend/nodes/equalfuncs.c|  13 ++
 src/backend/nodes/outfuncs.c  |  12 +
 src/backend/nodes/readfuncs.c |   1 +
 src/backend/optimizer/util/clauses.c  | 126 +++
 src/backend/parser/analyze.c  |  35 +++
 src/backend/parser/gram.y | 129 ---
 src/backend/tcop/postgres.c   |   3 +-
 src/backend/utils/adt/ruleutils.c | 106 -
 src/bin/pg_dump/pg_dump.c |  45 +++-
 src/bin/psql/describe.c   |  15 +-
 src/fe_utils/psqlscan.l   |  23 +-
 src/include/catalog/pg_proc.dat   |   4 +
 src/include/catalog/pg_proc.h |   6 +-
 src/include/commands/defrem.h |   2 +
 src/include/executor/functions.h  |  15 ++
 src/include/fe_utils/psqlscan_int.h   |   2 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|  13 ++
 src/include/parser/kwlist.h   |   2 +
 src/include/tcop/tcopprot.h   |   1 +
 src/interfaces/ecpg/preproc/ecpg.addons   |   6 +
 src/interfaces/ecpg/preproc/ecpg.trailer  |   4 +-
 .../regress/expected/create_function_3.out| 212 +-
 .../regress/expected/create_procedure.out |  58 +
 src/test/regress/sql/create_function_3.sql|  99 
 src/test/regress/sql/create_procedure.sql |  26 +++
 36 files changed, 1286 insertions(+), 204 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5bd54cb218..a8ae9594e6 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -5972,6 +5972,16 @@ pg_proc Columns
   
  
 
+ 
+  
+   prosqlbody pg_node_tree
+  
+  
+   Pre-parsed SQL function body.  This will be used for language SQL
+   functions if the body is not specified as a string constant.
+   
+ 
+
  
   
proconfig text[]
diff --git a/doc/src/sgml/ref/create_function.sgml 
b/doc/src/sgml/ref/create_function.sgml
index 3c1eaea651..1b5b9420db 100644
--- a/doc/src/sgml/ref/create_function.sgml
+++ b/doc/src/sgml/ref/create_function.sgml
@@ -38,6 +38,7 @@
 | SET configur

Re: partition routing layering in nodeModifyTable.c

2020-10-27 Thread Heikki Linnakangas


On 23/10/2020 12:37, Amit Langote wrote:

To explain these numbers a bit, "overheaul update/delete processing"
patch improves the performance of that benchmark by allowing the
updates to use run-time pruning when executing generic plans, which
they can't today.

However without "lazy-ResultRelInfo-initialization" patch,
ExecInitModifyTable() (or InitPlan() when I ran those benchmarks) can
be seen to be spending time initializing all of those result
relations, whereas only one of those will actually be used.

As mentioned further in that email, it's really the locking of all
relations by AcquireExecutorLocks() that occurs even before we enter
the executor that's a much thornier bottleneck for this benchmark.
But the ResultRelInfo initialization bottleneck sounded like it could
get alleviated in a relatively straightforward manner.  The patches
that were developed for attacking the locking bottleneck would require
further reflection on whether they are correct.

(Note: I've just copy pasted the numbers I reported in that email.  To
reproduce, I'll have to rebase the "overhaul update/delete processing"
patch on this one, which I haven't yet done.)


Ok, thanks for the explanation, now I understand.

This patch looks reasonable to me at a quick glance. I'm a bit worried 
or unhappy about the impact on FDWs, though. It doesn't seem nice that 
the ResultRelInfo is not available in the BeginDirectModify call. It's 
not too bad, the FDW can call ExecGetResultRelation() if it needs it, 
but still. Perhaps it would be better to delay calling 
BeginDirectModify() until the first modification is performed, to avoid 
any initialization overhead there, like establishing the connection in 
postgres_fdw.


But since this applies on top of the "overhaul update/delete processing" 
patch, let's tackle that patch set next. Could you rebase that, please?


- Heikki

duplicate function oid symbols

2020-10-27 Thread John Naylor

Hi,

I noticed that the table AM abstraction introduced the symbol
HEAP_TABLE_AM_HANDLER_OID, although we already have a convention for
defining symbols automatically for builtin functions, which in this case is
(currently unused) F_HEAP_TABLEAM_HANDLER.

It seems a wart to have two symbols for the same function (seemingly
accidentally). I suppose it's unacceptable to remove the non-standard
symbol since it's been referred to in code for a while now. We could remove
the unused (in core anyway) standard one by arranging fmgroids.h to use
explicit symbols from pg_proc.dat where they exist, as well as prevent such
symbols from being emitted into pg_proc_d.h. But then again there is
no guarantee the standard symbol is not being used elsewhere. Thoughts?

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: parallel distinct union and aggregate support patch

2020-10-27 Thread Robert Haas

On Thu, Oct 22, 2020 at 5:08 AM Dilip Kumar  wrote:
> Interesting idea.  So IIUC, whenever a worker is scanning the tuple it
> will directly put it into the respective batch(shared tuple store),
> based on the hash on grouping column and once all the workers are
> doing preparing the batch then each worker will pick those baches one
> by one, perform sort and finish the aggregation.  I think there is a
> scope of improvement that instead of directly putting the tuple to the
> batch what if the worker does the partial aggregations and then it
> places the partially aggregated rows in the shared tuple store based
> on the hash value and then the worker can pick the batch by batch.  By
> doing this way, we can avoid doing large sorts.  And then this
> approach can also be used with the hash aggregate, I mean the
> partially aggregated data by the hash aggregate can be put into the
> respective batch.

I am not sure if this would be a win if the typical group size is
small and the transition state has to be serialized/deserialized.
Possibly we need multiple strategies, but I guess we'd have to test
performance to be sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Move catalog toast table and index declarations

2020-10-27 Thread John Naylor

On Tue, Oct 27, 2020 at 7:43 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-10-24 15:23, John Naylor wrote:
> > Style: In genbki.h, "extern int no_such_variable" is now out of place.
> > Also, the old comments like "The macro definitions are just to keep the
> > C compiler from spitting up." are now redundant in their new setting.
>
> Hmm, I don't really see what's wrong there.  Do you mean the macro
> definitions should be different, or the comments are wrong, or something
> else?
>

There's nothing wrong; it's just a minor point of consistency. For the
first part, I mean defined symbols in this file that are invisible to the C
compiler are written

#define SOMETHING()

If some are written

#define SOMETHING() extern int no_such_variable

I imagine some future reader will wonder why there's a difference.

As for the comments, the entire file is for things meant for scripts to
read, but have to be put in macro form to be invisible to the compiler. The
header comment has

"genbki.h defines CATALOG(), BKI_BOOTSTRAP and related macros
 * so that the catalog header files can be read by the C compiler."

I'm just saying we don't need to carry over the comments I mentioned from
the toasting/indexing headers that were specially for those macros.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online checksums verification in the backend

2020-10-27 Thread Julien Rouhaud

On Tue, Oct 27, 2020 at 3:07 PM Michael Paquier  wrote:
>
> On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote:
> > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:
> >> Mmm, is it really an improvement to report warnings during this
> >> function execution?  Note also that PageIsVerified as-is won't report
> >> a warning if a page is found as PageIsNew() but isn't actually all
> >> zero, while still being reported as corrupted by the SRF.
> >
> > Yep, joining the point of above to just have no WARNINGs at all.
>
> Now that we have d401c57, I got to consider more this one, and opted
> for not generating a WARNING for now.  Hence, PageisVerifiedExtended()
> is disabled regarding that, but we still report a checksum failure in
> it.

Great, that's also what I had in mind.

> I have spent some time reviewing the tests, and as I felt this was
> bulky.  In the reworked version attached, I have reduced the number of
> tests by half, without reducing the coverage, mainly:
> - Removed all the stderr and the return code tests, as we always
> expected the commands to succeed, and safe_psql() can do the job
> already.
> - Merged of the queries using pg_relation_check_pages into a single
> routine, with the expected output (set of broken pages returned in the
> SRF) in the arguments.
> - Added some prefixes to the tests, to generate unique test names.
> That makes debug easier.
> - The query on pg_stat_database is run once at the beginning, once at
> the end with the number of checksum failures correctly updated.
> - Added comments to document all the routines, and renamed some of
> them mostly for consistency.
> - Skipped system relations from the scan of pg_class, making the test
> more costly for nothing.
> - I ran some tests on Windows, just-in-case.
>
> I have also added a SearchSysCacheExists1() to double-check if the
> relation is missing before opening it, added a
> CHECK_FOR_INTERRUPTS() within the main check loop (where the error
> context is really helpful), indented the code, bumped the catalogs
> (mostly a self-reminder), etc.
>
> So, what do you think?

I think it's also worth noting that the IOLock is now acquired just
before getting the buffer state, and released after the read (or after
finding that the buffer is dirty).  This is consistent with how it's
done elsewhere, so I'm fine.

Other than that I'm quite happy with the changes you made, thanks a lot!

Re: Yet another fast GiST build

2020-10-27 Thread Heikki Linnakangas


On 21/10/2020 20:13, Andrey Borodin wrote:

7 окт. 2020 г., в 17:38, Heikki Linnakangas  написал(а):

On 07/10/2020 15:27, Andrey Borodin wrote:

Here's draft patch with implementation of sortsupport for ints and floats.



+static int
+gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
+{
+   int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
+   int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
+
+   if (ia->lower == ib->lower)
+   {
+   if (ia->upper == ib->upper)
+   return 0;
+
+   return (ia->upper > ib->upper) ? 1 : -1;
+   }
+
+   return (ia->lower > ib->lower) ? 1 : -1;
+}


We're only dealing with leaf items during index build, so the 'upper' and 
'lower' should always be equal here, right? Maybe add a comment and an 
assertion on that.

(It's pretty sad that the on-disk representation is identical for leaf and 
internal items, because that wastes a lot of disk space, but that's way out of 
scope for this patch.)


Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.


gbt_ts_cmp(), gbt_time_cmp_sort() and gbt_date_cmp_sort() still have the 
above issue, they still compare "upper" for no good reason.



+static Datum
+gbt_bit_abbrev_convert(Datum original, SortSupport ssup)
+{
+   return (Datum) 0;
+}
+
+static int
+gbt_bit_cmp_abbrev(Datum z1, Datum z2, SortSupport ssup)
+{
+   return 0;
+}


If an abbreviated key is not useful, just don't define abbrev functions 
and don't set SortSupport->abbrev_converter in the first place.



static bool
gbt_inet_abbrev_abort(int memtupcount, SortSupport ssup)
{
#if SIZEOF_DATUM == 8
return false;
#else
return true;
#endif
}


Better to not set the 'abbrev_converter' function in the first place. Or 
would it be better to cast the float8 to float4 if SIZEOF_DATUM == 4?



I had a plan to implement and test one type each day. I did not quite 
understood how rich our type system is.


:-)

- Heikki

Re: Move catalog toast table and index declarations

2020-10-27 Thread Peter Eisentraut


On 2020-10-24 15:23, John Naylor wrote:

This part created a syntax error:

--- a/src/include/catalog/unused_oids
+++ b/src/include/catalog/unused_oids
@@ -28,7 +28,7 @@ chdir $FindBin::RealBin or die "could not cd to 
$FindBin::RealBin: $!\n";

  use lib "$FindBin::RealBin/../../backend/catalog/";
  use Catalog;

-my @input_files = (glob("pg_*.h"), qw(indexing.h));
+my @input_files = (glob("pg_*.h");


OK, fixing that.

Style: In genbki.h, "extern int no_such_variable" is now out of place. 
Also, the old comments like "The macro definitions are just to keep the 
C compiler from spitting up." are now redundant in their new setting.


Hmm, I don't really see what's wrong there.  Do you mean the macro 
definitions should be different, or the comments are wrong, or something 
else?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote:
> On 2020-10-27 11:53, Bruce Momjian wrote:
> > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> > > On 2020-10-06 12:26, Magnus Hagander wrote:
> > > > I went with the name --no-instructions to have the same name for both
> > > > initdb and pg_upgrade. The downside is that "no-instructions" also
> > > > causes the scripts not to be written in pg_upgrade, which arguably is a
> > > > different thing. We could go with "--no-instructions" and
> > > > "--no-scripts", but that would leave the parameters different. I also
> > > > considered "--no-next-step", but that one didn't quite have the right
> > > > ring to me. I'm happy for other suggestions on the parameter names.
> > > 
> > > What scripts are left after we remove the analyze script, as discussed in 
> > > a
> > > different thread?
> > 
> > There is still delete_old_cluster.sh.
> 
> Well, that one can trivially be replaced by a printed instruction, too.

True. On my system is it simply:

rm -rf '/u/pgsql.old/data'

The question is whether the user is going to record the vacuumdb and rm
instructions that display at the end of the pg_upgrade run, or do we
need to write it down for them in script files.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-10-27 Thread Peter Eisentraut


On 2020-10-27 11:53, Bruce Momjian wrote:

On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:

On 2020-10-06 12:26, Magnus Hagander wrote:

I went with the name --no-instructions to have the same name for both
initdb and pg_upgrade. The downside is that "no-instructions" also
causes the scripts not to be written in pg_upgrade, which arguably is a
different thing. We could go with "--no-instructions" and
"--no-scripts", but that would leave the parameters different. I also
considered "--no-next-step", but that one didn't quite have the right
ring to me. I'm happy for other suggestions on the parameter names.


What scripts are left after we remove the analyze script, as discussed in a
different thread?


There is still delete_old_cluster.sh.


Well, that one can trivially be replaced by a printed instruction, too.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Internal key management system

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 03:07:22PM +0800, Craig Ringer wrote:
> On Mon, Oct 26, 2020 at 11:02 PM Stephen Frost  wrote:
> 
> 
> TL;DR:
> 
> * Important to check that key rotation is possible on a replica, i.e.
> primary and standby can have different cluster passphrase and KEK
> encrypting the same WAL and heap keys;
> * with a HSM we can't read the key out, so a pluggable KEK operations
> context or a configurable URI for the KEK is necessary
> * I want the SQL key and SQL wrap/unwrap part in a separate patch, I
> don't think it's fully baked and oppose its inclusion in its current
> form
> * Otherwise this looks good so far
> 
> Explanation and argument for why below.
> 
> > I've not been following very closely, but I definitely agree with the
> > general feedback here (more on that below), but to this point- I do
> > believe that was the intent, or at least I sure hope that it was.  Being
> > able to have user/role keys would certainly be good.  Having a way for a
> > user to log in and unlock their key would also be really nice.
> 
> Right. AFAICS this is supposed to provide the foundation layer for
> whole-cluster encryption, and it looks ok for that, caveat about HSMs
> aside. I see nothing wrong with using a single key for heap (and one
> for WAL, or even the same key). Finer grained and autovacuum etc
> becomes seriously painful.

You need to use separate keys for heap/index and WAL so you can
replicate to another server that uses a different heap/index key, but
the same WAL.  You can then fail-over to the replica and change the WAL
key to complete full key rotation.  The replication protocol needs to
decrypt, and the receiving end has to encrypt with a different
heap/index key.  This is the key rotation method this is planned.  This
is another good reason the keys should be in a separate directory so
they can be easily copied or replaced.

> I want to take a closer look at how the current implementation will
> play with physical replication. I assume the WAL and heap keys have to
> be constant for the full cluster lifetime, short of a dump and reload.

The WAL key can change if you are willing to stop/start the server.  You
only read the WAL during crash recovery.

> The main issue I have so far is that I don't think the SQL key
> actually fits well with the current proposal. Its proposed interface
> and use cases are incomplete, it doesn't fully address key leak risks,
> there's no user access control, etc. Also the SQL key part could be
> implemented on top of the base cluster encryption part, I don't see
> why it actually has to integrate with the whole-cluster key management
> directly.

Agreed.  Maybe we should just focus on the TDE use now.  I do think the
current patch is not commitable since, because there are no defined
keys, there is no way to validate the boot-time password.  The no-key
case should be an unsupported configuration.  Maybe we need to just
create one key just to verify the boot password.

> 
> SQL KEY
> 
> 
> I'm not against the SQL key and wrap/unwrap functionality - quite the
> contrary, I think it's really important to have something like it. But
> is it appropriate to have a single, fixed-for-cluster-lifetime key for
> this, one with no SQL-level access control over who can or cannot use
> it, etc? The material encrypted with the key is user-exposed so key
> rotation is an issue, but is not addressed here. And the interface
> doesn't really solve the numerous potential problems with key material
> leaks through logs and error messages.
> 
> I just think that if we bake in the proposed user visible wrap/unwrap
> interface now we're going to regret it later. How will it work when we
> want to add user- or role- level access control for database-stored
> keys? When we want to introduce a C-level API for extensions to work
> directly with encrypted data like they can work currently with TOASTed
> data, to prevent decrypted data from ever becoming SQL function
> arguments subject to possible leakage and to allow implementation of
> always-encrypted data types, etc?
> 
> Most importantly - I don't think the SQL key adds anything really
> crucial that we cannot do at the SQL level with an extension.  An
> extension "pg_wrap" could provide pg_wrap() and pg_unwrap() already,
> using a single master key much like the SQL key proposed in this
> patch. To store the master key it could:

The idea of the SQL key was to give the boot key a use, but I am now
seeing that the SQL key is just holding us back, and is preventing the
boot testing that is a requirement.  Maybe we just need to forget the
SQL part and focus on the TDE usage now, and come back to the SQL part. 
I am also not 100% clear on the usefulness of the SQL key.

> OTHER TRANSPARENT ENCRYPTION USE CASES
> 
> 
> Does this patch get in the way of supporting other kinds of
> transparent encryption that are frequently requested and are in use on
> other systems already?
> 
> I don't think so. Whole-cluster encryption

Re: Internal key management system

2020-10-27 Thread Bruce Momjian

On Mon, Oct 26, 2020 at 10:05:10PM +0800, Craig Ringer wrote:
> For example if I want to lock my database with a YubiHSM I would configure
> something like:
> 
>     cluster_encryption_key = 'pkcs11:token=YubiHSM;id=0:0001;type=private'

Well, openssl uses a prefix before the password string, e.g.:

*  pass:password
*  env:var
*  file:pathname
*  fd:number
*  stdin

See 'man openssl'.  I always thought that API was ugly, but I now see
the value in it.  We could implement a 'command:' prefix now, and maybe
a 'pass:' one, and allow other methods like 'pkcs11' later.

I can also imagine using the 'file' one to allow the key to be placed on
an encrypted file system that has to be mounted for Postgres to start. 
You could also have the key on a USB device that has to be inserted to
be used, and the 'file' is on the USB key --- seems clearer than having
to create a script to 'cat' the file.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-10-27 Thread Magnus Hagander

On Tue, Oct 27, 2020 at 11:53 AM Bruce Momjian  wrote:

> On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> > On 2020-10-06 12:26, Magnus Hagander wrote:
> > > I went with the name --no-instructions to have the same name for both
> > > initdb and pg_upgrade. The downside is that "no-instructions" also
> > > causes the scripts not to be written in pg_upgrade, which arguably is a
> > > different thing. We could go with "--no-instructions" and
> > > "--no-scripts", but that would leave the parameters different. I also
> > > considered "--no-next-step", but that one didn't quite have the right
> > > ring to me. I'm happy for other suggestions on the parameter names.
> >
> > What scripts are left after we remove the analyze script, as discussed
> in a
> > different thread?
>
> There is still delete_old_cluster.sh.
>

Yeah, that's the one I was thinking of, but I was looking for something
more generic than explicitly saying --no-delete-script.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: [patch] ENUM errdetail should mention bytes, not chars

2020-10-27 Thread Peter Eisentraut


On 2020-10-19 06:34, Julien Rouhaud wrote:

 ERROR:  invalid enum label "ああ"
 DETAIL:  Labels must be 63 characters or less.

Attached trivial patch changes the message to:

 DETAIL:  Labels must be 63 bytes or less.

This matches the documentation, which states:

 The length of an enum value's textual label is limited by the NAMEDATALEN
 setting compiled into PostgreSQL; in standard builds this means at most
 63 bytes.

 https://www.postgresql.org/docs/current/datatype-enum.html

I don't see any particular need to backpatch this.


Indeed the message is wrong, and patch LGTM.


Committed.  Btw., the patch didn't update the regression test output.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-10-27 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> On 2020-10-06 12:26, Magnus Hagander wrote:
> > I went with the name --no-instructions to have the same name for both
> > initdb and pg_upgrade. The downside is that "no-instructions" also
> > causes the scripts not to be written in pg_upgrade, which arguably is a
> > different thing. We could go with "--no-instructions" and
> > "--no-scripts", but that would leave the parameters different. I also
> > considered "--no-next-step", but that one didn't quite have the right
> > ring to me. I'm happy for other suggestions on the parameter names.
> 
> What scripts are left after we remove the analyze script, as discussed in a
> different thread?

There is still delete_old_cluster.sh.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: deferred primary key and logical replication

2020-10-27 Thread Amit Kapila

On Sun, Oct 25, 2020 at 9:39 PM Euler Taveira
 wrote:
>
> On Mon, 5 Oct 2020 at 08:34, Amit Kapila  wrote:
>>
>> On Mon, May 11, 2020 at 2:41 AM Euler Taveira
>>  wrote:
>> >
>> > Hi,
>> >
>> > While looking at an old wal2json issue, I stumbled on a scenario that a 
>> > table
>> > with a deferred primary key is not updatable in logical replication. 
>> > AFAICS it
>> > has been like that since the beginning of logical decoding and seems to be 
>> > an
>> > oversight while designing logical decoding.
>> >
>>
>> I am not sure if it is an oversight because we document that the index
>> must be non-deferrable, see "USING INDEX records the old values of the
>> columns covered by the named index, which must be unique, not partial,
>> not deferrable, and include only columns marked NOT NULL." in docs
>> [1].
>>
>
> Inspecting this patch again, I forgot to consider that RelationGetIndexList()
> is called by other backend modules. Since logical decoding deals with finished
> transactions, it is ok to use a deferrable primary key.
>

But starting PG-14, we do support logical decoding of in-progress
transactions as well.


-- 
With Regards,
Amit Kapila.

Re: "unix_socket_directories" should be GUC_LIST_INPUT?

2020-10-27 Thread Peter Eisentraut


On 2020-10-23 10:02, Michael Paquier wrote:

So maybe we could get away with just changing it.  It'd be good to
verify though that this doesn't break existing string values for
the variable, assuming they contain no unlikely characters that'd
need quoting.


Yeah, that's the kind of things I wanted to check anyway before
considering doing the switch.


If we're going to change it I think we need an updated patch that covers 
pg_dump.  (Even if we argue that pg_dump would not normally dump this 
variable, keeping it up to date with respect to GUC_LIST_QUOTE seems 
proper.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-10-27 Thread Peter Eisentraut


On 2020-10-06 12:26, Magnus Hagander wrote:
I went with the name --no-instructions to have the same name for both 
initdb and pg_upgrade. The downside is that "no-instructions" also 
causes the scripts not to be written in pg_upgrade, which arguably is a 
different thing. We could go with "--no-instructions" and 
"--no-scripts", but that would leave the parameters different. I also 
considered "--no-next-step", but that one didn't quite have the right 
ring to me. I'm happy for other suggestions on the parameter names.


What scripts are left after we remove the analyze script, as discussed 
in a different thread?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_upgrade analyze script

2020-10-27 Thread Peter Eisentraut


On 2020-10-06 11:43, Magnus Hagander wrote:
For a long time now, pg_upgrade drops a script (analyze_new_cluster.sh) 
that just calls vacuumdb to run the analyze in stages. This script made 
a lot of sense back when it actually implemented the stages, but these 
days since it's just calling a single command, I think it's just 
unnecessary complication.


I suggest we drop it and just replace it with instructions to call 
vacuumdb directly.


Attached patch does this. It also removes the support in the 
instructions that talk about pre-8.4 databases, which I believe is dead 
code per 
https://postgr.es/m/CABUevEx-D0PNVe00tkeQRGennZQwDtBJn=493mjt-x6sppb...@mail.gmail.com.


I agree that the script should be removed.  It makes a lot of things 
simpler.


Here is the thread that proposed implementing vacuumdb-in-stages.  There 
were discussions then about removing the script also, but didn't come to 
a conclusion.


https://www.postgresql.org/message-id/flat/1389237323.30068.8.camel%40vanquo.pezone.net

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Use-after-free in 12- EventTriggerAlterTableEnd

2020-10-27 Thread Arseny Sher

Hi,

Valgrind on our internal buildfarm complained about use-after-free
during currentEventTriggerState->commandList manipulations, e.g. lappend
in EventTriggerCollectSimpleCommand. I've discovered that the source of
problem is EventTriggerAlterTableEnd not bothering to switch into its
own context before appending to the list. ced138e8cba fixed this in
master and 13 but wasn't backpatched further, so I see the problem in
12-.

The particular reproducing scenario is somewhat involved; it combines
pg_pathman [1] extension and SQL interface to it created in our fork of
Postgres. Namely, we allow to create partitions in CREATE TABLE
PARTITIONED by statement (similar to what [2] proposes). Each partition
is created via separate SPI call which in turn ends up making
AlterTableStmt as ProcessUtility PROCESS_UTILITY_SUBCOMMAND; so
EventTriggerAlterTableStart/End is done, but additional
EventTriggerQueryState is not allocated and commands are collected in
toplevel EventTriggerQueryState. Of course SPI frees its memory between
the calls which makes valgrind scream.

Admittedly our case is tricky and I'm not sure it is possible to make up
something like that in the pure core code, but I do believe other
extension writers might run into this, so I propose to backpatch (the
attached) 3 lines healing to all supported branches.

Technically, all you (an extension author) have to do to encounter this
is
 1) Have toplevel EvenTriggerQueryState, i.e. catch utility statement.
 2) Inside it, run AlterTable as PROCESS_UTILITY_SUBCOMMAND in some
short-living context.

[1] https://github.com/postgrespro/pg_pathman
[2] 
https://www.postgresql.org/message-id/flat/CALT9ZEFBv05OhLMKO1Lbo_Zg9a0v%2BU9q9twe%3Dt-dixfR45RmVQ%40mail.gmail.com#f86f0fcfa62d5108fb81556a43f42457

diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index f7ee9838f7..d1972e2d7f 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -1807,9 +1807,15 @@ EventTriggerAlterTableEnd(void)
 	/* If no subcommands, don't collect */
 	if (list_length(currentEventTriggerState->currentCommand->d.alterTable.subcmds) != 0)
 	{
+		MemoryContext oldcxt;
+
+		oldcxt = MemoryContextSwitchTo(currentEventTriggerState->cxt);
+
 		currentEventTriggerState->commandList =
 			lappend(currentEventTriggerState->commandList,
 	currentEventTriggerState->currentCommand);
+
+		MemoryContextSwitchTo(oldcxt);
 	}
 	else
 		pfree(currentEventTriggerState->currentCommand);

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Re: parallel distinct union and aggregate support patch

2020-10-27 Thread Dilip Kumar

On Fri, Oct 23, 2020 at 11:58 AM bu...@sohu.com  wrote:
>
> > Interesting idea.  So IIUC, whenever a worker is scanning the tuple it
> > will directly put it into the respective batch(shared tuple store),
> > based on the hash on grouping column and once all the workers are
> > doing preparing the batch then each worker will pick those baches one
> > by one, perform sort and finish the aggregation.  I think there is a
> > scope of improvement that instead of directly putting the tuple to the
> > batch what if the worker does the partial aggregations and then it
> > places the partially aggregated rows in the shared tuple store based
> > on the hash value and then the worker can pick the batch by batch.  By
> > doing this way, we can avoid doing large sorts.  And then this
> > approach can also be used with the hash aggregate, I mean the
> > partially aggregated data by the hash aggregate can be put into the
> > respective batch.
>
> Good idea. Batch sort suitable for large aggregate result rows,
> in large aggregate result using partial aggregation maybe out of memory,
> and all aggregate functions must support partial(using batch sort this is 
> unnecessary).
>
> Actually i written a batch hash store for hash aggregate(for pg11) like this 
> idea,
> but not write partial aggregations to shared tuple store, it's write origin 
> tuple and hash value
> to shared tuple store, But it's not support parallel grouping sets.
> I'am trying to write parallel hash aggregate support using batch shared tuple 
> store for PG14,
> and need support parallel grouping sets hash aggregate.

I was trying to look into this patch to understand the logic in more
detail.  Actually, there are no comments at all so it's really hard to
understand what the code is trying to do.

I was reading the below functions, which is the main entry point for
the batch sort.

+static TupleTableSlot *ExecBatchSortPrepare(PlanState *pstate)
+{
...
+ for (;;)
+ {
...
+ tuplesort_puttupleslot(state->batches[hash%node->numBatches], slot);
+ }
+
+ for (i=node->numBatches;i>0;)
+ tuplesort_performsort(state->batches[--i]);
+build_already_done_:
+ if (parallel)
+ {
+ for (i=node->numBatches;i>0;)
+ {
+ --i;
+ if (state->batches[i])
+ {
+ tuplesort_end(state->batches[i]);
+ state->batches[i] = NULL;
+ }
+ }

I did not understand this part, that once each worker has performed
their local batch-wise sort why we are clearing the baches?  I mean
individual workers have their on batches so eventually they supposed
to get merged.  Can you explain this part and also it will be better
if you can add the comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-27 Thread Amit Kapila

On Thu, Oct 22, 2020 at 9:47 AM Greg Nancarrow  wrote:
>
> On Fri, Oct 16, 2020 at 9:26 PM Amit Kapila  wrote:
> >
>
> Posting an update to the smaller patch (Parallel SELECT for INSERT
> INTO...SELECT...).
>
> Most of this patch feeds into the larger Parallel INSERT patch, for
> which I'll also be posting an update soon.
>
> Patch updates include:
> - Removed explicit trigger-type checks (instead rely on declared
> trigger parallel safety)
> - Restored parallel-related XID checks that previous patch altered;
> now assign XID prior to entering parallel-mode
> - Now considers parallel-SELECT for parallel RESTRICTED cases (not
> just parallel SAFE cases)
> - Added parallel-safety checks for partition key expressions and
> support functions
> - Workaround added for test failure in "partition-concurrent-attach"
> test;
>

IIUC, below is code for this workaround:

+MaxRelParallelHazardForModify(Oid relid,
+ CmdType commandType,
+ max_parallel_hazard_context *context)
+{
+ Relationrel;
+ TupleDesc tupdesc;
+ int attnum;
+
+ LOCKMODE lockmode = AccessShareLock;
+
+ /*
+ * It's possible that this relation is locked for exclusive access
+ * in another concurrent transaction (e.g. as a result of a
+ * ALTER TABLE ... operation) until that transaction completes.
+ * If a share-lock can't be acquired on it now, we have to assume this
+ * could be the worst-case, so to avoid blocking here until that
+ * transaction completes, conditionally try to acquire the lock and
+ * assume and return UNSAFE on failure.
+ */
+ if (ConditionalLockRelationOid(relid, lockmode))
+ {
+ rel = table_open(relid, NoLock);
+ }
+ else
+ {
+ context->max_hazard = PROPARALLEL_UNSAFE;
+ return context->max_hazard;
+ }

Do we need this workaround if we lock just the parent table instead of
locking all the tables? Basically, can we safely identify the
parallel-safety of partitioned relation if we just have a lock on
parent relation? One more thing I have noticed is that for scan
relations (Select query), we do such checks much later based on
RelOptInfo (see set_rel_consider_parallel) which seems to have most of
the information required to perform parallel-safety checks but I guess
for ModifyTable (aka the Insert table) the equivalent doesn't seem
feasible but have you thought of doing at the later stage in planner?

Few other comments on latest patch:
===
1.
MaxRelParallelHazardForModify()
{
..
+ if (commandType == CMD_INSERT || commandType == CMD_UPDATE)
+ {
+ /*
..

Why to check CMD_UPDATE here?

2.
+void PrepareParallelModeForModify(CmdType commandType, bool
isParallelModifyLeader)
+{
+ Assert(!IsInParallelMode());
+
+ if (isParallelModifyLeader)
+ (void)GetCurrentCommandId(true);
+
+ (void)GetCurrentFullTransactionId();

Here, we should use GetCurrentTransactionId() similar to heap_insert
or other heap operations. I am not sure why you have used
GetCurrentFullTransactionId?

3. Can we have a test to show why we need to check all the partitions
for parallel-safety? I think it would be possible when there is a
trigger on only one of the partitions and that trigger has
corresponding parallel_unsafe function. But it is good to verify that
once.

4. Have you checked the overhead of this on the planner for different
kinds of statements like inserts into tables having 100 or 500
partitions? Similarly, it is good to check the overhead of domain
related checks added in the patch.

5. Can we have a separate patch for parallel-selects for Insert? It
will make review easier.

-- 
With Regards,
Amit Kapila.

Re: Deleting older versions in unique indexes to avoid page splits

2020-10-27 Thread Simon Riggs

On Mon, 26 Oct 2020 at 21:15, Peter Geoghegan  wrote:

> Now for the not-so-good news. The TPS numbers looked like this
> (results in original chronological order of the runs, which I've
> interleaved):

While it is important we investigate the worst cases, I don't see this
is necessarily bad.

HOT was difficult to measure, but on a 2+ hour run on a larger table,
the latency graph was what showed it was a winner. Short runs and
in-memory data masked the benefits in our early analyses.

So I suggest not looking at the totals and averages but on the peaks
and the long term trend. Showing that in graphical form is best.

> The patch adds a backstop. It seems to me that that's really what we
> need here. Predictability over time and under a large variety of
> different conditions. Real workloads constantly fluctuate.

Yeh, agreed. This is looking like a winner now, but lets check.

> Even if people end up not buying my argument that it's worth it for
> workloads like this, there are various options. And, I bet I can
> further improve the high contention cases without losing the valuable
> part -- there are a number of ways in which I can get the CPU costs
> down further that haven't been fully explored (yes, it really does
> seem to be CPU costs, especially due to TID sorting). Again, this
> patch is all about extreme pathological workloads, system stability,
> and system efficiency over time -- it is not simply about increasing
> system throughput. There are some aspects of this design (that come up
> with extreme workloads) that may in the end come down to value
> judgments. I'm not going to tell somebody that they're wrong for
> prioritizing different things (within reason, at least). In my opinion
> almost all of the problems we have with VACUUM are ultimately
> stability problems, not performance problems per se. And, I suspect
> that we do very well with stupid benchmarks like this compared to
> other DB systems precisely because we currently allow non-HOT updaters
> to "live beyond their means" (which could in theory be great if you
> frame it a certain way that seems pretty absurd to me). This suggests
> we can "afford" to go a bit slower here as far as the competitive
> pressures determine what we should do (notice that this is a distinct
> argument to my favorite argument, which is that we cannot afford to
> *not* go a bit slower in certain extreme cases).
>
> I welcome debate about this.

Agreed, we can trade initial speed for long term consistency. I guess
there are some heuristics there on that tradeoff.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

RE: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-27 Thread tsunakawa.ta...@fujitsu.com

Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.


From: Hubert Zhang  
> Libpq has supported to specify multiple hosts in connection string and enable 
> auto failover when the previous PostgreSQL instance cannot be accessed.
> But when I tried to enable this feature for a non-hot standby, it cannot do 
> the failover with the following messages.
> 
> psql: error: could not connect to server: FATAL:  the database system is 
> starting up

Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */


Regards
Takayuki Tsunakawa

Re: Question about make coverage-html

2020-10-27 Thread Heikki Linnakangas


On 27/10/2020 10:09, Peter Smith wrote:

Hi hackers.

The example of test coverage in the documentation [1] works as advertised.

But I wanted to generate test coverage results only of some TAP tests
in src/test/subscription.

The documentation [1] also says "The make commands also work in
subdirectories." so I tried running them all in that folder.

However, when I run "make coverage-html" in that subdirectory
src/test/subscription it does not work:

=
[postgres@CentOS7-x64 subscription]$ make coverage-html
/usr/local/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -i
-d . -d . -o lcov_base.info
geninfo: WARNING: no .gcno files found in . - skipping!
geninfo: WARNING: no .gcno files found in . - skipping!
/usr/local/bin/lcov --gcov-tool /usr/bin/gcov -q --no-external -c -d .
-d . -o lcov_test.info
geninfo: WARNING: no .gcda files found in . - skipping!
geninfo: WARNING: no .gcda files found in . - skipping!
rm -rf coverage
/usr/local/bin/genhtml -q --legend -o coverage --title='PostgreSQL
14devel' --num-spaces=4 --prefix='/home/postgres/oss_postgres_2PC'
lcov_base.info lcov_test.info
genhtml: ERROR: no valid records found in tracefile lcov_base.info
make: *** [coverage-html-stamp] Error 255
[postgres@CentOS7-x64 subscription]$
=

OTOH, running the "make coverage-html" at the top folder after running
my TAP tests does give the desired coverage results.

~

QUESTION:

Was that documentation [1] just being misleading by saying it can work
in the subdirectories?
e.g. Are you only supposed to run "make coverage-html" from the top folder?

Or is it supposed to work but I did something wrong?


Running "make coverage-html" in src/test/subscription doesn't work, 
because there is no C code in that directory.


Creating a coverage report is a two-step process. First, you run the 
test you're interested in, with "make check" or similar. Then you create 
a report for the source files you're interested in, with "make 
coverage-html". You can run these commands in different subdirectories.


In this case, you want to do "cd src/test/subscription; make check", to 
run those TAP tests, and then run "make coverage-html" from the top 
folder. Or if you wanted to create coverage report that covers only 
replication-related source code, for example, you could run it in the 
src/backend/replication directory ("cd src/backend/replication; make 
coverage-html").


- Heikki

Re: Commitfest manager 2020-11

2020-10-27 Thread gkokolatos







‐‐‐ Original Message ‐‐‐
On Sunday, October 25, 2020 8:01 PM, Anastasia Lubennikova 
 wrote:

> On 20.10.2020 10:30, gkokola...@pm.me wrote:
>
> > ‐‐‐ Original Message ‐‐‐
> >
> > > [snip]
> > > Wow, that was well in advance) I am willing to assist if you need any 
> > > help.
> >
> > Indeed it was a bit early. I left for vacation after that. For the record, 
> > I am newly active to the community. In our PUG, in Stockholm, we held a 
> > meetup during which a contributor presented ways to contribute to the 
> > community, one of which is becoming CFM. So, I thought of picking up the 
> > recommendation.
> > I have taken little part in CFs as reviewer/author and I have no idea how a 
> > CF is actually run. A contributor from Stockholm has been willing to mentor 
> > me to the part.
> > Since you have both the knowledge and specific ideas on improving the CF, 
> > how about me assisting you? I could shadow you and you can groom me to the 
> > part, so that I can take the lead to a future CF more effectively.
> > This is just a suggestion of course. I am happy with anything that can help 
> > the community as a whole.
>
> Even though, I've worked a lot with community, I have never been CFM
> before as well. So, I think we can just follow these articles:
>
> https://www.2ndquadrant.com/en/blog/managing-a-postgresql-commitfest/
> https://wiki.postgresql.org/wiki/CommitFest_Checklist
>
> Some parts are a bit outdated, but in general the checklist is clear.
> I've already requested CFM privileges in pgsql-www and I'm going to
> spend next week sending pings and updates to the patches at commitfest.

Awesome. I will start with requesting the privileges then.

>
> There are already 219 patches, so I will appreciate if you join me in
> this task.

Count me in.

>
> --
>
> Anastasia Lubennikova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

Re: [PATCH] remove deprecated v8.2 containment operators

2020-10-27 Thread Magnus Hagander

On Tue, Oct 27, 2020 at 9:38 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-10-27 04:25, Justin Pryzby wrote:
> > Forking this thread:
> >
> https://www.postgresql.org/message-id/fd93f1c5-7818-a02c-01e5-1075ac0d4...@iki.fi
> >
> > They have been deprecated for a Long Time, so finally remove them for
> v14.
> > Four fewer exclamation marks makes the documentation less exciting,
> which is a
> > good thing.
>
> I don't know the reason or context why they were deprecated, but I agree
> that the timeline for removing them now is good.
>

IIRC it was to align things so that "containment" had the same operator for
all different kinds of datatypes?

But whether that memory is right nor not it was indeed a long time ago,
so +1 that it's definitely time to get rid of them.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-27 Thread Hubert Zhang

Hi hackers,

Libpq has supported to specify multiple hosts in connection string and enable 
auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the 
failover with the following messages.

psql: error: could not connect to server: FATAL:  the database system is 
starting up

Document says ' If a connection is established successfully, but authentication 
fails, the remaining hosts in the list are not tried.'
I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix 
it.

Thanks,
Hubert Zhang

1 2 >

1 - 100 of 108 matches

Mail list logo