Retry in pgbench

2021-04-12 Thread Tatsuo Ishii
Currently standard pgbench scenario produces transaction serialize
errors "could not serialize access due to concurrent update" if
PostgreSQL runs in REPEATABLE READ or SERIALIZABLE level, and the
session aborts. In order to achieve meaningful results even in these
transaction isolation levels, I would like to propose an automatic
retry feature if "could not serialize access due to concurrent update"
error occurs.

Probably just adding a switch to retry is not enough, maybe retry
method (random interval etc.) and max retry number are needed to be
added.

I would like to hear your thoughts,

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Extensions not dumped when --schema is used

2021-04-12 Thread Michael Paquier
On Sun, Apr 04, 2021 at 03:08:02PM -0700, Noah Misch wrote:
> I noticed the patch's behavior for relations that are members of non-dumped
> extensions and are also registered using pg_extension_config_dump().  It
> depends on the schema:
> 
> - If extschema='public', "pg_dump -e plpgsql" makes no mention of the
>   relations.

This one is expected to me.  The caller of pg_dump is not specifying
the extension that should be dumped, hence it looks logic to me to not
dump the tables marked as pg_extension_config_dump() part as an
extension not listed.

> - If extschema='public', "pg_dump -e plpgsql --schema=public" includes
>   commands to dump the relation data.  This surprised me.  (The
>   --schema=public argument causes selectDumpableNamespace() to set
>   nsinfo->dobj.dump=DUMP_COMPONENT_ALL instead of DUMP_COMPONENT_ACL.)

This one would be expected to me.  Per the discussion of upthread, we
want --schema and --extension to be two separate and exclusive
switches.  So, once the caller specifies --schema we should dump the
contents of the schema, even if its extension is not listed with
--extension.  Anyway, the behavior to select if a schema can be dumped
or not is not really related to this new code, right?  And "public" is
a mixed beast, being treated as a system object and a user object by
selectDumpableNamespace().

> - If extschema is not any sort of built-in schema, "pg_dump -e plpgsql"
>   includes commands to dump the relation data.  This surprised me.

Hmm.  But you are right that this one is inconsistent with the first
case where the extension is not listed.  I would have said that as the
extension is not directly specified and that the schema is not passed
down either then we should not dump it at all, but this combination
actually does so.  Maybe we should add an extra logic into
selectDumpableNamespace(), close to the end of it, to decide if,
depending on the contents of the extensions to include, we should dump
its associated schema or not?  Another solution would be to make use
of schema_include_oids to filter out the schemas we don't want, but
that would mean that --extension gets priority over --schema or
--table but we did ot want that per the original discussion.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Move tablespace path re-creation from the makefiles to pg_regres

2021-04-12 Thread Noah Misch
On Mon, Apr 12, 2021 at 02:25:53PM +0900, Michael Paquier wrote:
> On Fri, Apr 09, 2021 at 08:07:10PM -0700, Noah Misch wrote:
> > "pg_regress --outputdir" is not a great location for a file or directory
> > created by a user other than the user running pg_regress.  If one does "make
> > check" and then "make installcheck" against a cluster running as a different
> > user, the rmtree() will fail, assuming typical umask values.  An rmtree() at
> > the end of the tablespace test would mostly prevent that, but that can't 
> > help
> > in the event of a mid-test crash.
> 
> Yeah, I really don't think that we need to worry about multi-user
> scenarios with pg_regress like that though.

Christoph Berg's first message on this thread reported doing that.  If
supporting server_user!=pg_regress_user is unwarranted and Christoph Berg
should stop, then already-committed code suffices.

> > I'm not sure we should support installcheck against a server running as a
> > different user.  If we should support it, then I'd probably look at letting
> > the caller pass in a server-writable directory.  That directory would house
> > the tablespace instead of outputdir doing so.
> 
> But, then, we would be back to the pre-13 position where we'd need to
> have something external to pg_regress in charge of cleaning up and
> creating the tablespace path, no?

Correct.

> That's basically what we want to
> avoid with the Makefile rules.

The race that commit 6c788d9 fixed is not inherent to Makefile rules.  For
example, you could have instead caused test.sh to issue 'make
installcheck-parallel TABLESPACEDIR="$outputdir"/testtablespace' and have the
makefiles consult $(TABLESPACEDIR) rather than hard-code ./testtablespace.
(That said, I like how 6c788d9 consolidated Windows and non-Windows paths.)

> I can get that it could be interesting
> to be able to pass down a custom path for the test tablespace, but do
> we really have a need for that?

I don't know.  I never considered server_user!=pg_regress_user before this
thread, and I don't plan to use it myself.  Your latest patch originated to
make that case work, and my last message was reporting that the patch works
for a rather-narrow interpretation of server_user!=pg_regress_user, failing on
variations thereof.  That might be fine.




Re: PG Docs - logical decoding output plugins - fix typo

2021-04-12 Thread Fujii Masao




On 2021/04/13 13:23, Peter Smith wrote:

PSA a patch to fix a typo found on this page [1],

"preapre_end_lsn" -> "prepare_end_lsn"


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: TRUNCATE on foreign table

2021-04-12 Thread Kohei KaiGai
2021年4月9日(金) 23:49 Kohei KaiGai :
>
> 2021年4月9日(金) 22:51 Fujii Masao :
> >
> > On 2021/04/09 12:33, Kohei KaiGai wrote:
> > > 2021年4月8日(木) 22:14 Fujii Masao :
> > >>
> > >> On 2021/04/08 22:02, Kohei KaiGai wrote:
> >  Anyway, attached is the updated version of the patch. This is still 
> >  based on the latest Kazutaka-san's patch. That is, extra list for ONLY 
> >  is still passed to FDW. What about committing this version at first? 
> >  Then we can continue the discussion and change the behavior later if 
> >  necessary.
> > >>
> > >> Pushed! Thank all involved in this development!!
> > >> For record, I attached the final patch I committed.
> > >>
> > >>
> > >>> Ok, it's fair enought for me.
> > >>>
> > >>> I'll try to sort out my thought, then raise a follow-up discussion if 
> > >>> necessary.
> > >>
> > >> Thanks!
> > >>
> > >> The followings are the open items and discussion points that I'm 
> > >> thinking of.
> > >>
> > >> 1. Currently the extra information (TRUNCATE_REL_CONTEXT_NORMAL, 
> > >> TRUNCATE_REL_CONTEXT_ONLY or TRUNCATE_REL_CONTEXT_CASCADING) about how a 
> > >> foreign table was specified as the target to truncate in TRUNCATE 
> > >> command is collected and passed to FDW. Does this really need to be 
> > >> passed to FDW? Seems Stephen, Michael and I think that's necessary. But 
> > >> Kaigai-san does not. I also think that TRUNCATE_REL_CONTEXT_CASCADING 
> > >> can be removed because there seems no use case for that maybe.
> > >>
> > >> 2. Currently when the same foreign table is specified multiple times in 
> > >> the command, the extra information only for the foreign table found 
> > >> first is collected. For example, when "TRUNCATE ft, ONLY ft" is 
> > >> executed, TRUNCATE_REL_CONTEXT_NORMAL is collected and _ONLY is ignored 
> > >> because "ft" is found first. Is this OK? Or we should collect all, e.g., 
> > >> both _NORMAL and _ONLY should be collected in that example? I think that 
> > >> the current approach (i.e., collect the extra info about table found 
> > >> first if the same table is specified multiple times) is good because 
> > >> even local tables are also treated the same way. But Kaigai-san does not.
> > >>
> > >> 3. Currently postgres_fdw specifies ONLY clause in TRUNCATE command that 
> > >> it constructs. That is, if the foreign table is specified with ONLY, 
> > >> postgres_fdw also issues the TRUNCATE command for the corresponding 
> > >> remote table with ONLY to the remote server. Then only root table is 
> > >> truncated in remote server side, and the tables inheriting that are not 
> > >> truncated. Is this behavior desirable? Seems Michael and I think this 
> > >> behavior is OK. But Kaigai-san does not.
> > >>
> > > Prior to the discussion of 1-3, I like to clarify the role of 
> > > foreign-tables.
> > > (Likely, it will lead a natural conclusion for the above open items.)
> > >
> > > As literal of SQL/MED (Management of External Data), a foreign table
> > > is a representation of external data in PostgreSQL.
> > > It allows to read and (optionally) write the external data wrapped by
> > > FDW drivers, as if we usually read / write heap tables.
> > > By the FDW-APIs, the core PostgreSQL does not care about the
> > > structure, location, volume and other characteristics of
> > > the external data itself. It expects FDW-APIs invocation will perform
> > > as if we access a regular heap table.
> > >
> > > On the other hands, we can say local tables are representation of
> > > "internal" data in PostgreSQL.
> > > A heap table is consists of one or more files (per BLCKSZ *
> > > RELSEG_SIZE), and table-am intermediates
> > > the on-disk data to/from on-memory structure (TupleTableSlot).
> > > Here are no big differences in the concept. Ok?
> > >
> > > As you know, ONLY clause controls whether TRUNCATE command shall run
> > > on child-tables also, not only the parent.
> > > If "ONLY parent_table" is given, its child tables are not picked up by
> > > ExecuteTruncate(), unless child tables are not
> > > listed up individually.
> > > Then, once ExecuteTruncate() picked up the relations, it makes the
> > > relations empty using table-am
> > > (relation_set_new_filenode), and the callee
> > > (heapam_relation_set_new_filenode) does not care about whether the
> > > table is specified with ONLY, or not. It just makes the data
> > > represented by the table empty (in transactional way).
> > >
> > > So, how foreign tables shall perform?
> > >
> > > Once ExecuteTruncate() picked up a foreign table, according to
> > > ONLY-clause, does FDW driver shall consider
> > > the context where the foreign tables are specified? And, what behavior
> > > is consistent?
> > > I think that FDW driver shall make the external data represented by
> > > the foreign table empty, regardless of the
> > > structure, location, volume and others.
> > >
> > > Therefore, if we follow the above assumption, we don't need to inform
> > > the context where foreign-tables are

Re: Replication slot stats misgivings

2021-04-12 Thread Masahiko Sawada
On Mon, Apr 12, 2021 at 9:16 PM vignesh C  wrote:
>
> On Mon, Apr 12, 2021 at 4:46 PM Amit Kapila  wrote:
> >
> > On Sat, Apr 10, 2021 at 6:51 PM vignesh C  wrote:
> > >
> >
> > Thanks, 0001 and 0002 look good to me. I have a minor comment for 0002.
> >
> > 
> > +total_bytesbigint
> > +   
> > +   
> > +Amount of decoded transactions data sent to the decoding output 
> > plugin
> > +while decoding the changes from WAL for this slot. This can be 
> > used to
> > +gauge the total amount of data sent during logical decoding.
> >
> > Can we slightly extend it to say something like: Note that this
> > includes the bytes streamed and or spilled. Similarly, we can extend
> > it for total_txns.
> >
>
> Thanks for the comments, the comments are fixed in the v8 patch attached.
> Thoughts?

Here are review comments on new TAP tests:

+# Create replication slots.
+$node->safe_psql('postgres',
+   "SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot1',
'test_decoding')");
+$node->safe_psql('postgres',
+   "SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot2',
'test_decoding')");
+$node->safe_psql('postgres',
+   "SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot3',
'test_decoding')");
+$node->safe_psql('postgres',
+   "SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot4',
'test_decoding')");

and

+
+$node->safe_psql('postgres',
+   "SELECT data FROM
pg_logical_slot_get_changes('regression_slot1', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')");
+$node->safe_psql('postgres',
+   "SELECT data FROM
pg_logical_slot_get_changes('regression_slot2', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')");
+$node->safe_psql('postgres',
+"SELECT data FROM
pg_logical_slot_get_changes('regression_slot3', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')");
+$node->safe_psql('postgres',
+   "SELECT data FROM
pg_logical_slot_get_changes('regression_slot4', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')");

I think we can do those similar queries in a single psql connection
like follows:

 # Create replication slots.
 $node->safe_psql('postgres',
   qq[
SELECT pg_create_logical_replication_slot('regression_slot1', 'test_decoding');
SELECT pg_create_logical_replication_slot('regression_slot2', 'test_decoding');
SELECT pg_create_logical_replication_slot('regression_slot3', 'test_decoding');
SELECT pg_create_logical_replication_slot('regression_slot4', 'test_decoding');
]);

and

$node->safe_psql('postgres',
   qq[
SELECT data FROM pg_logical_slot_get_changes('regression_slot1', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT data FROM pg_logical_slot_get_changes('regression_slot3', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
SELECT data FROM pg_logical_slot_get_changes('regression_slot4', NULL,
NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
  ]);

---
+# Wait for the statistics to be updated.
+my $slot1_stat_check_query =
+  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
= 'regression_slot1' AND total_txns > 0 AND total_bytes > 0;";
+my $slot2_stat_check_query =
+  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
= 'regression_slot2' AND total_txns > 0 AND total_bytes > 0;";
+my $slot3_stat_check_query =
+  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
= 'regression_slot3' AND total_txns > 0 AND total_bytes > 0;";
+my $slot4_stat_check_query =
+  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
= 'regression_slot4' AND total_txns > 0 AND total_bytes > 0;";
+
+# Verify that the statistics have been updated.
+$node->poll_query_until('postgres', $slot1_stat_check_query)
+  or die "Timed out while waiting for statistics to be updated";
+$node->poll_query_until('postgres', $slot2_stat_check_query)
+  or die "Timed out while waiting for statistics to be updated";
+$node->poll_query_until('postgres', $slot3_stat_check_query)
+  or die "Timed out while waiting for statistics to be updated";
+$node->poll_query_until('postgres', $slot4_stat_check_query)
+  or die "Timed out while waiting for statistics to be updated";

We can simplify the above code to something like:

$node->poll_query_until(
   'postgres', qq[
SELECT count(slot_name) >= 4
FROM pg_stat_replication_slots
WHERE slot_name ~ 'regression_slot'
AND total_txns > 0
AND total_bytes > 0;
]) or die "Timed out while waiting for statistics to be updated";

---
+# Test to remove one of the replication slots and adjust max_replication_slots
+# accordingly to the number of slots and verify replication statistics data is
+# fine after restart.

I think it's better if we explain in detail what cases we're trying to
test. How about the following 

Re: doc review for v14

2021-04-12 Thread Justin Pryzby
On Fri, Apr 09, 2021 at 02:03:27PM +0900, Michael Paquier wrote:
> On Thu, Apr 08, 2021 at 11:40:08AM -0500, Justin Pryzby wrote:
> > Another round of doc review, not yet including all of yesterday's commits.
> 
> Thanks for compiling all that.  I got through the whole set and
> applied the most relevant parts on HEAD.  Some of them applied down to
> 9.6, so I have fixed it down where needed, for the parts that did not
> conflict too heavily.

Thanks.  Rebased with remaining, queued fixes.

-- 
Justin
>From 58b6a3ae7d76f6160bc7ad531f400da2fe0710eb Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 4 Mar 2021 19:32:05 -0600
Subject: [PATCH 1/7] doc review: Add support for PROVE_TESTS and PROVE_FLAGS
 in MSVC scripts

5bca69a76b3046a85c60c48271c1831fd5affa51
---
 doc/src/sgml/install-windows.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index 64687b12e6..cb6bb05dc5 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -499,8 +499,8 @@ $ENV{PERL5LIB}=$ENV{PERL5LIB} . ';c:\IPC-Run-0.94\lib';
 
   
The TAP tests run with vcregress support the
-   environment variables PROVE_TESTS, that is expanded
-   automatically using the name patterns given, and
+   environment variables PROVE_TESTS, which is
+   expanded as a glob pattern, and
PROVE_FLAGS. These can be set on a Windows terminal,
before running vcregress:
 
-- 
2.17.0

>From 705507708c977e8172a2e091378099e630bad1b0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 23 Mar 2021 13:39:36 -0500
Subject: [PATCH 2/7] doc review: Pass all scan keys to BRIN consistent
 function at once

commit a1c649d889bdf6e74e9382e1e28574d7071568de
---
 doc/src/sgml/brin.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index d2f12bb605..d2476481af 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -833,7 +833,7 @@ typedef struct BrinOpcInfo
   Returns whether all the ScanKey entries are consistent with the given
   indexed values for a range.
   The attribute number to use is passed as part of the scan key.
-  Multiple scan keys for the same attribute may be passed at once, the
+  Multiple scan keys for the same attribute may be passed at once; the
   number of entries is determined by the nkeys parameter.
  
 
-- 
2.17.0

>From b9bf3485cbd243bc09fcfd0f043ee8162b779e2f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 26 Mar 2021 23:14:57 -0500
Subject: [PATCH 3/7] doc review: BRIN minmax-multi indexes

ab596105b55f1d7fbd5a66b66f65227d210b047d
---
 doc/src/sgml/brin.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index d2476481af..ce7c210575 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -730,7 +730,7 @@ LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was
  for . When set to a positive value,
  each block range is assumed to contain this number of distinct non-null
  values. When set to a negative value, which must be greater than or
- equal to -1, the number of distinct non-null is assumed linear with
+ equal to -1, the number of distinct non-null values is assumed to grow linearly with
  the maximum possible number of tuples in the block range (about 290
  rows per block). The default value is -0.1, and
  the minimum number of distinct non-null values is 16.
@@ -1214,7 +1214,7 @@ typedef struct BrinOpcInfo
 
  
   The minmax-multi operator class is also intended for data types implementing
-  a totally ordered sets, and may be seen as a simple extension of the minmax
+  a totally ordered set, and may be seen as a simple extension of the minmax
   operator class. While minmax operator class summarizes values from each block
   range into a single contiguous interval, minmax-multi allows summarization
   into multiple smaller intervals to improve handling of outlier values.
-- 
2.17.0

>From 61ee22354dc00eb60fcbd88d83dbc5b9316f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 10 Apr 2021 20:56:35 -0500
Subject: [PATCH 4/7] reltuples not relpages: Set pg_class.reltuples for
 partitioned tables

commit 0e69f705cc1a3df273b38c9883fb5765991e04fe
Author: Alvaro Herrera 
Date:   Fri Apr 9 11:29:08 2021 -0400
---
 src/backend/commands/analyze.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cffcd54302..8aa329a2a0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -660,7 +660,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	{
 		/*
 		 * Partitioned tables don't have storage, so we don't set any fields in
-		 * their pg_class entries except for relpages, which is necessary for
+		 * their pg_class 

PG Docs - logical decoding output plugins - fix typo

2021-04-12 Thread Peter Smith
PSA a patch to fix a typo found on this page [1],

"preapre_end_lsn" -> "prepare_end_lsn"

--
[1] https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html

Kind Regards,
Peter Smith.
Fujitsu Australia


fix_typo.patch
Description: Binary data


Re: TRUNCATE on foreign table

2021-04-12 Thread Justin Pryzby
On Tue, Apr 13, 2021 at 12:38:35PM +0900, Fujii Masao wrote:
> + Relation data structures for each
> + foreign tables to be truncated.
> 
> "foreign tables" should be "foreign table" because it follows "each"?

Yes, you're right.

> +
> + behavior is either
> + DROP_RESTRICT or DROP_CASCADE.
> + DROP_CASCADE indicates that the
> + CASCADE option was specified in the original
>   TRUNCATE command.
> 
> Why did you remove the description for DROP_RESTRICT?

Because in order to handle the default/unspecified case, the description was
going to need to be something like:

| DROP_RESTRICT indicates that the RESTRICT option was specified in the original
| truncate command (or CASCADE option was NOT specified).

-- 
Justin




Re: TRUNCATE on foreign table

2021-04-12 Thread Fujii Masao




On 2021/04/13 10:21, Bharath Rupireddy wrote:

I agree to convert to bits and pass it as int value which is
extensible i.e. we can pass more extra parameters across if required.


Looks good to me.



Also I'm not in favour of removing relids_extra altogether, we might
need this to send some info in future.

Both 0001 and 0002(along with the new phrasings) look good to me.


Thanks for updating the patches!

One question is; "CONTEXT" of "TRUNCATE_REL_CONTEXT_ONLY" is required?
If not, I'm tempted to shorten the name to "TRUNCATE_REL_ONLY" or something.

+ Relation data structures for each
+ foreign tables to be truncated.

"foreign tables" should be "foreign table" because it follows "each"?

+
+ behavior is either
+ DROP_RESTRICT or DROP_CASCADE.
+ DROP_CASCADE indicates that the
+ CASCADE option was specified in the original
  TRUNCATE command.

Why did you remove the description for DROP_RESTRICT?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [PATCH] Identify LWLocks in tracepoints

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-13 10:34:18 +0800, Craig Ringer wrote:
> > But it's near trivial to add that.
> 
> Really?

Yes.


> Each backend can have different tranche IDs (right?)

No, they have to be the same in each. Note how the tranche ID is part of
struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock
etc.


> But if I'm looking for performance issues caused by excessive LWLock
> contention or waits, LWLocks held too long, [...] or the like, it's
> something I want to capture across the whole postgres instance.

Sure.

Although I still don't really buy that static tracepoints are the best
way to measure this kind of thing, given the delay introducing them and
the cost of having them around. I think I pointed out
https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de
before.


> LWLock lock-ordering deadlocks

This seems unrelated to tracepoints to me.


> and there's no way to know what a given non-built-in tranche ID means
> for any given backend without accessing backend-specific in-memory
> state. Including for non-user-accessible backends like bgworkers and
> auxprocs, where it's not possible to just query the state from a view
> directly.

The only per-backend part is that some backends might not know the
tranche name for dynamically registered tranches where the
LWLockRegisterTranche() hasn't been executed in a backend. Which should
pretty much never be an aux process or such. And even for bgworkers it
seems like a pretty rare thing, because those need to be started by
something...

It might be worth proposing a shared hashtable with tranch names and
jut reserving enough space for ~hundred entries...

> And you can always build without `--enable-dtrace` and ... just not care.

Practically speaking, distributions enable it, which then incurs the
cost for everyone.



> Take a look at "sudo perf list".
> 
> 
>   sched:sched_kthread_work_execute_end   [Tracepoint event]
>   sched:sched_kthread_work_execute_start [Tracepoint event]
>   ...
>   sched:sched_migrate_task   [Tracepoint event]
>   ...
>   sched:sched_process_exec   [Tracepoint event]
>   ...
>   sched:sched_process_fork   [Tracepoint event]
>   ...
>   sched:sched_stat_iowait[Tracepoint event]
>   ...
>   sched:sched_stat_sleep [Tracepoint event]
>   sched:sched_stat_wait  [Tracepoint event]
>   ...
>   sched:sched_switch [Tracepoint event]
>   ...
>   sched:sched_wakeup [Tracepoint event]
> 
> The kernel is packed with extremely useful trace events, and for very
> good reasons. Some on very hot paths.

IIRC those aren't really comparable - the kernel actually does modify
the executable code to replace the tracepoints with nops.


Greetings,

Andres Freund




Re: pg_amcheck contrib application

2021-04-12 Thread Mark Dilger


> On Apr 9, 2021, at 1:51 PM, Robert Haas  wrote:
> 
> On Fri, Apr 9, 2021 at 2:50 PM Mark Dilger  
> wrote:
>> I think #4, above, requires some clarification.  If there are missing 
>> chunks, the very definition of how large we expect subsequent chunks to be 
>> is ill-defined.  I took a fairly conservative approach to avoid lots of 
>> bogus complaints about chunks that are of unexpected size.   Not all such 
>> complaints are removed, but enough are removed that I needed to add a final 
>> complaint at the end about the total size seen not matching the total size 
>> expected.
> 
> My instinct is to suppose that the size that we expect for future
> chunks is independent of anything being wrong with previous chunks. So
> if each chunk is supposed to be 2004 bytes (which probably isn't the
> real number) and the value is 7000 bytes long, we expect chunks 0-2 to
> be 2004 bytes each, chunk 3 to be 988 bytes, and chunk 4 and higher to
> not exist. If chunk 1 happens to be missing or the wrong length or
> whatever, our expectations for chunks 2 and 3 are utterly unchanged.

Fair enough.

>> Corruption #1:
>> 
>>UPDATE $toastname SET chunk_seq = chunk_seq + 1000
>> 
>> Before:
>> 
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 0 has sequence number 1000, but expected 
>> sequence number 0
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 1 has sequence number 1001, but expected 
>> sequence number 1
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 2 has sequence number 1002, but expected 
>> sequence number 2
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 3 has sequence number 1003, but expected 
>> sequence number 3
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 4 has sequence number 1004, but expected 
>> sequence number 4
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 chunk 5 has sequence number 1005, but expected 
>> sequence number 5
>> 
>> After:
>> 
>> # heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
>> # toast value 16445 missing chunks 0 through 999
> 
> Applying the above principle would lead to complaints that chunks 0-5
> are missing, and 1000-1005 are extra.

That sounds right.  It now reports:

# heap table "postgres"."public"."test", block 0, offset 16, attribute 2:
# toast value 16459 missing chunks 0 through 4 with expected size 1996 and 
chunk 5 with expected size 20
# heap table "postgres"."public"."test", block 0, offset 16, attribute 2:
# toast value 16459 unexpected chunks 1000 through 1004 each with size 1996 
followed by chunk 1005 with size 20


>> Corruption #2:
>> 
>>UPDATE $toastname SET chunk_seq = chunk_seq * 1000
> 
> Similarly here, except the extra chunk numbers are different.

It now reports:

# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 missing chunks 1 through 4 with expected size 1996 and 
chunk 5 with expected size 20
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 1000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 2000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 3000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 4000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 5000 with size 20

I don't see any good way in this case to report the extra chunks in one row, as 
in the general case there could be arbitrarily many of them, with the message 
text getting arbitrarily large if it reported the chunks as "chunks 1000, 2000, 
3000, 4000, 5000, ...".  I don't expect this sort of corruption being 
particularly common, though, so I'm not too bothered about it.

> 
>> Corruption #3:
>> 
>>UPDATE $toastname SET chunk_id = (chunk_id::integer + 1000)::oid 
>> WHERE chunk_seq = 3
> 
> And here we'd just get a complaint that chunk 3 is missing.

It now reports:

# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 missing chunk 3 with expected size 1996
# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 was expected to end at chunk 6 with total size 1, 
but ended at chunk 5 with total size 8004

It sounds like you weren't expecting the second of these reports.  I think it 
is valuable, especially when there are multiple 

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-04-12 Thread Melanie Plageman
On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi
 wrote:
> At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund  wrote 
> in
> > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:
> > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund  wrote:
> > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund  
> > > > > wrote:
> > > > > > Lastly, I don't understand what the point of sending fixed size 
> > > > > > stats,
> > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. 
> > > > > > While
> > > > > > I don't like it's architecture, we obviously need something like 
> > > > > > pgstat
> > > > > > to handle variable amounts of stats (database, table level etc
> > > > > > stats). But that doesn't at all apply to these types of global 
> > > > > > stats.
> > > > >
> > > > > That part has annoyed me as well a few times. +1 for just moving that
> > > > > into a global shared memory. Given that we don't really care about
> > > > > things being in sync between those different counters *or* if we loose
> > > > > a bit of data (which the stats collector is designed to do), we could
> > > > > even do that without a lock?
> > > >
> > > > I don't think we'd quite want to do it without any (single counter)
> > > > synchronization - high concurrency setups would be pretty likely to
> > > > loose values that way. I suspect the best would be to have a struct in
> > > > shared memory that contains the potential counters for each potential
> > > > process. And then sum them up when actually wanting the concrete
> > > > value. That way we avoid unnecessary contention, in contrast to having a
> > > > single shared memory value for each(which would just pingpong between
> > > > different sockets and store buffers).  There's a few details like how
> > > > exactly to implement resetting the counters, but ...
> > >
> > > Right. Each process gets to do their own write, but still in shared
> > > memory. But do you need to lock them when reading them (for the
> > > summary)? That's the part where I figured you could just read and
> > > summarize them, and accept the possible loss.
> >
> > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
> > integers can be read / written without a danger of torn values, and I
> > don't think we need perfect cross counter accuracy. To deal with the few
> > platforms without 64bit "single copy atomicity", we can just use
> > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
> > fall back to using locked operations for those platforms.  So I don't
> > think there's actually a danger of loss.
> >
> > Obviously we could also use atomic ops to increment the value, but I'd
> > rather not add all those atomic operations, even if it's on uncontended
> > cachelines. It'd allow us to reset the backend values more easily by
> > just swapping in a 0, which we can't do if the backend increments
> > non-atomically. But I think we could instead just have one global "bias"
> > value to implement resets (by subtracting that from the summarized
> > value, and storing the current sum when resetting). Or use the new
> > global barrier to trigger a reset. Or something similar.
>
> Fixed or global stats are suitable for the startar of shared-memory
> stats collector. In the case of buffers_*_write, the global stats
> entry for each process needs just 8 bytes plus matbe extra 8 bytes for
> the bias value.  I'm not sure how many counters like this there are,
> but is such size of footprint acceptatble?  (Each backend already uses
> the same amount of local memory for pgstat use, though.)
>
> Anyway I will do something like that as a trial, maybe by adding a
> member in PgBackendStatus and one global-shared for the bial value.
>
>   int64   st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
> + PgBackendStatsCounters counters;
>   } PgBackendStatus;
>

So, I took a stab at implementing this in PgBackendStatus. The attached
patch is not quite on top of current master, so, alas, don't try and
apply it. I went to rebase today and realized I needed to make some
changes in light of e1025044cd4, however, I wanted to share this WIP so
that I could pose a few questions that I imagine will still be relevant
after I rewrite the patch.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
  - number of shared buffers the checkpointer and bgwriter write out
  - number of shared buffers a regular backend is forced to flush
  - number of extends done by a regular backend through shared buffers
  - number of buffers flushed by a backend or autovacuum using a
BufferAccessStrategy which, were they not to use this strategy,
could perhaps have been avoided if a clean shared buffer was
available
  - number of fsyncs done by a backend which could have been done by
checkpointer if sync queue had not been full

This view currently does only track 

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-12 Thread Craig Ringer
On Tue, 13 Apr 2021 at 02:23, Andres Freund  wrote:

[I've changed the order of the quoted sections a little to prioritize
the key stuff]

>
> On 2021-04-12 14:31:32 +0800, Craig Ringer wrote:
>
> > It's annoying that we have to pay the cost of computing the tranche name
> > though. It never used to matter, but now that T_NAME() expands to
> > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more
> > on such a hot path. I might see if I can do a little comparison and see how
> > much.  I could add TRACE_POSTGRESQL_<>_ENABLED() guards 
> > since we
> > do in fact build with SDT semaphore support. That adds a branch for each
> > tracepoint, but they're already marshalling arguments and making a function
> > call that does lots more than a single branch, so that seems pretty
> > sensible.
>
> I am against adding any overhead for this feature. I honestly think the
> probes we have right now in postgres do not provide a commensurate
> benefit.

I agree that the probes we have now are nearly useless, if not
entirely useless. The transaction management ones are misplaced and
utterly worthless. The LWLock ones don't carry enough info to be much
use and are incomplete. I doubt anybody uses any of them at all, or
would even notice their absence.

In terms of overhead, what is in place right now is not free. It used
to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce
the current cost and improve functionality at the same time, so it's
actually useful.


> > * There is no easy way to look up the tranche name by ID from outside the
> > backend
>
> But it's near trivial to add that.

Really?

We can expose a pg_catalog.lwlock_tranches view that lets you observe
the current mappings for any given user backend I guess.

But if I'm looking for performance issues caused by excessive LWLock
contention or waits, LWLocks held too long, LWLock lock-ordering
deadlocks, or the like, it's something I want to capture across the
whole postgres instance. Each backend can have different tranche IDs
(right?) and there's no way to know what a given non-built-in tranche
ID means for any given backend without accessing backend-specific
in-memory state. Including for non-user-accessible backends like
bgworkers and auxprocs, where it's not possible to just query the
state from a view directly.

So we'd be looking at some kind of shm based monstrosity. That doesn't
sound appealing. Worse, there's no way to solve races with it - is a
given tranche ID already allocated when you see it? If not, can you
look it up from the backend before the backend exits/dies? For that
matter, how do you do that, since the connection to the backend is
likely under the control of an application, not your monitoring and
diagnostic tooling.

Some trace tools can poke backend memory directly, but it generally
requires debuginfo, is fragile and Pg version specific, slow, and a
real pain to use. If we don't attach the LWLock names to the
tracepoints in some way they're pretty worthless.

Again, I don't plan to add new costs here. I'm actually proposing to
reduce an existing cost.

And you can always build without `--enable-dtrace` and ... just not care.

Anyway - I'll do some `perf` runs shortly to quantify this:

* With/without tracepoints at all
* With/without names in tracepoints
* With/without tracepoint refcounting (_ENABLED() semaphores)

so as to rely less on handwaving.

> > (Those can also be used with systemtap guru mode scripts to do things
> > like turn a particular elog(DEBUG) into a PANIC at runtime for
> > diagnostic purposes).
>
> Yikes.
>

Well, it's not like it can happen by accident. You have to
deliberately write a script that twiddles process memory, using a tool
that requires special privileges and

I recently had to prepare a custom build for a customer that converted
an elog(DEBUG) into an elog(PANIC) in order to capture a core with
much better diagnostic info for a complex, hard to reproduce and
intermittent memory management issue. It would've been rather nice to
be able to do so with a trace marker instead of a custom build.

> > There are a TON of probes I want to add, and I have a tree full of them
> > waiting to submit progressively. Yes, ability to probe all GUCs is in
> > there. So is detail on walsender, reorder buffer, and snapshot builder
> > activity. Heavyweight lock SDTs. A probe that identifies the backend type
> > at startup. SDT probe events emitted for every wait-event. Probes in elog.c
> > to let probes observe error unwinding, capture error messages,
> > etc. [...] A probe that fires whenever debug_query_string
> > changes. Lots. But I can't submit them all at once, especially without
> > some supporting use cases and scripts that other people can use so
> > they can understand why these probes are useful.
>
> -1. This is not scalable. Adding static probes all over has both a
> runtime (L1I, branches, code optimization) and maintenance overhead.

Take a look at "sudo perf list".


  

RE: WIP: WAL prefetch (another approach)

2021-04-12 Thread Shinoda, Noriyoshi (PN Japan FSIP)
Hi, 

Thank you for developing a great feature. I tested this feature and checked the 
documentation.
Currently, the documentation for the pg_stat_prefetch_recovery view is included 
in the description for the pg_stat_subscription view.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION

It is also not displayed in the list of "28.2. The Statistics Collector".
https://www.postgresql.org/docs/devel/monitoring.html

The attached patch modifies the pg_stat_prefetch_recovery view to appear as a 
separate view.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Thomas Munro [mailto:thomas.mu...@gmail.com] 
Sent: Saturday, April 10, 2021 5:46 AM
To: Justin Pryzby 
Cc: Tomas Vondra ; Stephen Frost 
; Andres Freund ; Jakub Wartak 
; Alvaro Herrera ; Tomas 
Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; 
David Steele ; pgsql-hackers 
Subject: Re: WIP: WAL prefetch (another approach)

On Sat, Apr 10, 2021 at 8:37 AM Justin Pryzby  wrote:
> Did you see this?
> INVALID URI REMOVED
> 278MB0483490FEAC879DCA5ED583DD2739*40GV0P278MB0483.CHEP278.PROD.OUTLOO
> K.COM__;JQ!!NpxR!wcPrhiB2CaHRtywGoh9Ap0M-kH1m07hGI37-ycYRGCPgCqGs30lRS
> KicsXacduEXHxI$
>
> I meant to mail you so you could include it in the same commit, but 
> forgot until now.

Done, thanks.




pg_stat_prefetch_recovery_doc_v1.diff
Description: pg_stat_prefetch_recovery_doc_v1.diff


Re: vacuum freeze - possible improvements

2021-04-12 Thread Masahiko Sawada
On Mon, Apr 12, 2021 at 5:38 PM Virender Singla  wrote:
>
> Hi Postgres Community,
>
> Regarding anti wraparound vacuums (to freeze tuples), I see it has to scan 
> all the pages which are not frozen-all (looking at visibility map). That 
> means even if we want to freeze less transactions only (For ex - by 
> increasing parameter vacuum_freeze_min_age to 1B), still it will scan all the 
> pages in the visibility map and a time taking process.

 If vacuum_freeze_min_age is 1 billion, autovacuum_freeze_max_age is 2
billion (vacuum_freeze_min_age is limited to the half of
autovacuum_freeze_max_age). So vacuum freeze will still have to
process tuples that are inserted/modified during consuming 1 billion
transactions. It seems to me that it’s not fewer transactions. What is
the use case where users want to freeze fewer transactions, meaning
invoking anti-wraparound frequently?

>
> Can there be any improvement on this process so VACUUM knows the tuple/pages 
> of those transactions which need to freeze up.
>
> Benefit of such an improvement is that if we are reaching transaction id 
> close to 2B (and downtime), that time we can quickly recover the database 
> with vacuuming freeze only a few millions rows with quick lookup rather than 
> going all the pages from visibility map.

Apart from this idea, in terms of speeding up vacuum,
vacuum_failsafe_age parameter, introduced to PG14[1], would also be
helpful. When the failsafe is triggered, cost-based delay is no longer
be applied, and index vacuuming is bypassed in order to finish vacuum
work and advance relfrozenxid as quickly as possible.

Regards

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e55e7d1755cefbb44982fbacc7da461fa8684e6

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-12 Thread Amit Langote
On Mon, Apr 12, 2021 at 6:23 AM Justin Pryzby  wrote:
> On Sun, Apr 11, 2021 at 05:20:35PM -0400, Alvaro Herrera wrote:
> > On 2021-Mar-31, Tom Lane wrote:
> >
> > > diff -U3 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
> > >  
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > > --- 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
> > > 2021-03-29 20:14:21.258199311 +0200
> > > +++ 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > >   2021-03-30 18:54:34.272839428 +0200
> > > @@ -324,6 +324,7 @@
> > >  1
> > >  2
> > >  step s1insert: insert into d4_fk values (1);
> > > +ERROR:  insert or update on table "d4_fk" violates foreign key 
> > > constraint "d4_fk_a_fkey"
> > >  step s1c: commit;
> > >
> > >  starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s 
> > > s1insert s1c
> >
> > Hmm, actually, looking at this closely, I think the expected output is
> > bogus and trilobite is doing the right thing by throwing this error
> > here.  The real question is why isn't this case behaving in that way in
> > every *other* animal.
>
> I was looking/thinking at it, and wondered whether it could be a race 
> condition
> involving pg_cancel_backend()

I thought about it some and couldn't come up with an explanation as to
why pg_cancel_backend() race might be a problem.

Actually it occurred to me this morning that CLOBBER_CACHE_ALWAYS is
what exposed this problem on this animal (not sure if other such
animals did too though).  With CLOBBER_CACHE_ALWAYS, a PartitionDesc
will be built afresh on most uses.  In this particular case, the RI
query executed by the insert has to build a new one (for d4_primary),
correctly excluding the detach-pending partition (d4_primary1) per the
snapshot with which it is run.  In normal builds, it would reuse the
one built by an earlier query in the transaction, which does include
the detach-pending partition, thus allowing the insert trying to
insert a row referencing that partition to succeed.  There is a
provision in RelationGetPartitionDesc() to rebuild if any
detach-pending partitions in the existing copy of PartitionDesc are
not to be seen by the current query, but a bug mentioned in my earlier
reply prevents that from kicking in.


--
Amit Langote
EDB: http://www.enterprisedb.com




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-12 Thread Bruce Momjian
On Thu, Apr  8, 2021 at 01:01:42PM -0400, Bruce Momjian wrote:
> On Thu, Apr  8, 2021 at 12:51:06PM -0400, Álvaro Herrera wrote:
> > On 2021-Apr-08, Bruce Momjian wrote:
> > 
> > > pg_stat_activity.queryid is new, but I can imagine cases where you would
> > > join pg_stat_activity to pg_stat_statements to get an estimate of how
> > > long the query will take --- having one using an underscore and another
> > > one not seems odd.
> > 
> > OK.  So far, you have one vote for queryid (your own) and two votes for
> > query_id (mine and Julien's).  And even yourself were hesitating about
> > it earlier in the thread.
> 
> OK, if people are fine with pg_stat_activity.query_id not matching
> pg_stat_statements.queryid, I am fine with that.  I just don't want
> someone to say it was a big mistake later.  ;-)

OK, the attached patch renames pg_stat_activity.queryid to 'query_id'. I
have not changed any of the APIs which existed before this feature was
added, and are called "queryid" or "queryId" --- it is kind of a mess. 
I assume I should leave those unchanged.  It will also need a catversion
bump.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.



query_id.diff.gz
Description: application/gzip


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread James Coleman
On Mon, Apr 12, 2021 at 7:49 PM David Rowley  wrote:
>
> On Tue, 13 Apr 2021 at 11:42, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > I realised when working on something unrelated last night that we can
> > > also do hash lookups for NOT IN too.
> >
> > ... and still get the behavior right for nulls?
>
> Yeah, it will. There are already some special cases for NULLs in the
> IN version. Those would need to be adapted for NOT IN.

I hadn't thought about using the negator operator directly that way
when I initially wrote the patch.

But also I didn't think a whole lot about the NOT IN case at all --
and there's no mention of such that I see in this thread or the
precursor thread. It's pretty obvious that it wasn't part of my
immediate need, but obviously it'd be nice to have the consistency.

All that to say this: my vote would be to put it into PG15 also.

James




Re: PANIC: wrong buffer passed to visibilitymap_clear

2021-04-12 Thread Peter Geoghegan
On Mon, Apr 12, 2021 at 6:33 PM Tom Lane  wrote:
> Thanks for looking it over.  Do you have an opinion on whether or not
> to back-patch?  As far as we know, these bugs aren't exposed in the
> back branches for lack of code that would set the all-visible flag
> without superexclusive lock.  But I'd still say that heap_update
> is failing to honor its API contract in these edge cases, and that
> seems like something that could bite us after future back-patches.

If we assume that a scenario like the one we've been debugging can
never happen in the backbranches, then we must also assume that your
fix has negligible risk in the backbranches, because of how it is
structured. And so I agree -- I lean towards backpatching.

-- 
Peter Geoghegan




Re: PANIC: wrong buffer passed to visibilitymap_clear

2021-04-12 Thread Tom Lane
Peter Geoghegan  writes:
> On Mon, Apr 12, 2021 at 9:19 AM Tom Lane  wrote:
>> Hence, I propose the attached.  It passes check-world, but that proves
>> absolutely nothing of course :-(.  I wonder if there is any way to
>> exercise these code paths deterministically.

> This approach seems reasonable to me. At least you've managed to
> structure the visibility map page pin check as concomitant with the
> existing space recheck.

Thanks for looking it over.  Do you have an opinion on whether or not
to back-patch?  As far as we know, these bugs aren't exposed in the
back branches for lack of code that would set the all-visible flag
without superexclusive lock.  But I'd still say that heap_update
is failing to honor its API contract in these edge cases, and that
seems like something that could bite us after future back-patches.
Or there might be third-party code that can set all-visible flags.
So I'm kind of tempted to back-patch, but it's a judgment call.

regards, tom lane




Re: TRUNCATE on foreign table

2021-04-12 Thread Bharath Rupireddy
On Tue, Apr 13, 2021 at 6:27 AM Justin Pryzby  wrote:
>
> On Sun, Apr 11, 2021 at 03:45:36PM +0530, Bharath Rupireddy wrote:
> > On Sun, Apr 11, 2021 at 9:47 AM Justin Pryzby  wrote:
> > > Also, you currently test:
> > > > + if (extra & TRUNCATE_REL_CONTEXT_ONLY)
> > >
> > > but TRUNCATE_REL_ aren't indepedent bits, so shouldn't be tested with "&".
> >
> > Yeah this is an issue. We could just change the #defines to values
> > 0x0001, 0x0002, 0x0004, 0x0008 ... 0x0020 and so on and then testing
> > with & would work. So, this way, more than option can be multiplexed
> > into the same int value. To multiplex, we need to think: will there be
> > a scenario where a single rel in the truncate can have multiple
> > options at a time and do we want to distinguish these options while
> > deparsing?
> >
> > #define TRUNCATE_REL_CONTEXT_NORMAL  0x0001 /* specified without
> > ONLY clause */
> > #define TRUNCATE_REL_CONTEXT_ONLY 0x0002 /* specified with
> > ONLY clause */
> > #define TRUNCATE_REL_CONTEXT_CASCADING0x0004   /* not specified
> > but truncated
> >
> > And I'm not sure what's the agreement on retaining or removing #define
> > values, currently I see only TRUNCATE_REL_CONTEXT_ONLY is being used,
> > others are just being set but not used. As I said upthread, it will be
> > good to remove the unused macros/enums, retain only the ones that are
> > used, especially TRUNCATE_REL_CONTEXT_CASCADING this is not required I
> > feel, because we can add the child partitions that are foreign tables
> > to relids as just normal foreign tables with TRUNCATE_REL_CONTEXT_ONLY
> > option.
>
> Converting to "bits" would collapse TRUNCATE_REL_CONTEXT_ONLY and
> TRUNCATE_REL_CONTEXT_NORMAL into a single bit.  TRUNCATE_REL_CONTEXT_CASCADING
> could optionally be removed.
>
> +1 to convert to bits instead of changing "&" to "==".

Thanks for the patches.

I agree to convert to bits and pass it as int value which is
extensible i.e. we can pass more extra parameters across if required.
Also I'm not in favour of removing relids_extra altogether, we might
need this to send some info in future.

Both 0001 and 0002(along with the new phrasings) look good to me.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Peter Geoghegan
On Mon, Apr 12, 2021 at 5:37 PM Andres Freund  wrote:
> Well, one argument is that you made a fairly significant behavioural
> change, with hard-coded logic for when the optimization kicks in. It's
> not at all clear that your constants are the right ones for every
> workload.

(Apparently nobody wants to talk about HOT and the documentation.)

The BYPASS_THRESHOLD_PAGES cutoff was chosen conservatively, so that
it would avoid index vacuuming in truly marginal cases -- and it tends
to only delay it there.

A table-level threshold is not the best way of constraining the
problem. In the future, the table threshold should be treated as only
one factor among several. Plus there will be more than a simple yes/no
question to consider. We should eventually be able to do index
vacuuming for some indexes but not others. Bottom-up index deletion
has totally changed things here, because roughly speaking it makes
index bloat proportionate to the number of logical changes to indexed
columns -- you could have one super-bloated index on the table, but
several others that perfectly retain their original size. You still
need to do heap vacuuming eventually, which necessitates vacuuming
indexes too, but the right strategy is probably to vacuum much more
frequently, vacuuming the bloated index each time. You only do a full
round of index vacuuming when the table starts to accumulate way too
many LP_DEAD items. You need a much more sophisticated model for this.
It might also need to hook into autovacuums scheduling.

One of the dangers of high BYPASS_THRESHOLD_PAGES settings is that
it'll work well for some indexes but not others. To a dramatic degree,
even.

That said, nbtree isn't the only index AM, and it is hard to be
completely sure that you've caught everything. So an off switch seems
like a good idea now.

> We'll likely on get to know whether they're right in > 1 year
> - not having a real out at that point imo is somewhat scary.
>
> That said, adding more and more reloptions has a significant cost, so I
> don't think it's clear cut that it's the right decision to add
> one. Perhaps vacuum_cleanup_index_scale_factor should just be reused for
> BYPASS_THRESHOLD_PAGES?

I think that the right way to do this is to generalize INDEX_CLEANUP
to support a mode of operation that disallows vacuumlazy.c from
applying this optimization, as well as any similar optimizations which
will be added in the future.

Even if you don't buy my argument about directly parameterizing
BYPASS_THRESHOLD_PAGES undermining future work, allowing it to be set
much higher than 5% - 10% would be a pretty big footgun. It might
appear to help at first, but risks destabilizing things much later on.


--
Peter Geoghegan




Re: Possible SSI bug in heap_update

2021-04-12 Thread Thomas Munro
On Mon, Apr 12, 2021 at 10:36 AM Thomas Munro  wrote:
> Yeah.  Patch attached.

Pushed.




Re: TRUNCATE on foreign table

2021-04-12 Thread Justin Pryzby
On Sun, Apr 11, 2021 at 03:45:36PM +0530, Bharath Rupireddy wrote:
> On Sun, Apr 11, 2021 at 9:47 AM Justin Pryzby  wrote:
> > Also, you currently test:
> > > + if (extra & TRUNCATE_REL_CONTEXT_ONLY)
> >
> > but TRUNCATE_REL_ aren't indepedent bits, so shouldn't be tested with "&".
> 
> Yeah this is an issue. We could just change the #defines to values
> 0x0001, 0x0002, 0x0004, 0x0008 ... 0x0020 and so on and then testing
> with & would work. So, this way, more than option can be multiplexed
> into the same int value. To multiplex, we need to think: will there be
> a scenario where a single rel in the truncate can have multiple
> options at a time and do we want to distinguish these options while
> deparsing?
> 
> #define TRUNCATE_REL_CONTEXT_NORMAL  0x0001 /* specified without
> ONLY clause */
> #define TRUNCATE_REL_CONTEXT_ONLY 0x0002 /* specified with
> ONLY clause */
> #define TRUNCATE_REL_CONTEXT_CASCADING0x0004   /* not specified
> but truncated
> 
> And I'm not sure what's the agreement on retaining or removing #define
> values, currently I see only TRUNCATE_REL_CONTEXT_ONLY is being used,
> others are just being set but not used. As I said upthread, it will be
> good to remove the unused macros/enums, retain only the ones that are
> used, especially TRUNCATE_REL_CONTEXT_CASCADING this is not required I
> feel, because we can add the child partitions that are foreign tables
> to relids as just normal foreign tables with TRUNCATE_REL_CONTEXT_ONLY
> option.

Converting to "bits" would collapse TRUNCATE_REL_CONTEXT_ONLY and
TRUNCATE_REL_CONTEXT_NORMAL into a single bit.  TRUNCATE_REL_CONTEXT_CASCADING
could optionally be removed.

+1 to convert to bits instead of changing "&" to "==".

-- 
Justin
>From 26e1015de344352c4cd73deda28ad59594914067 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 12 Apr 2021 19:07:34 -0500
Subject: [PATCH 1/2] Convert TRUNCATE_REL_CONTEXT_* to bits

The values were defined as consecutive integers, but TRUNCATE_REL_CONTEXT_ONLY
was tested with bitwise test.  If specified as bits, NORMAL vs ONLY is
redundant (there's only one bit).  And CASCADING was unused.

See also: 8ff1c94649f5c9184ac5f07981d8aea9dfd7ac19
---
 src/backend/commands/tablecmds.c | 9 +++--
 src/backend/replication/logical/worker.c | 4 ++--
 src/include/commands/tablecmds.h | 6 +-
 3 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 096a6f2891..e8215d8dc8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1654,8 +1654,7 @@ ExecuteTruncate(TruncateStmt *stmt)
 
 		rels = lappend(rels, rel);
 		relids = lappend_oid(relids, myrelid);
-		relids_extra = lappend_int(relids_extra, (recurse ?
-  TRUNCATE_REL_CONTEXT_NORMAL :
+		relids_extra = lappend_int(relids_extra, (recurse ? 0 :
   TRUNCATE_REL_CONTEXT_ONLY));
 		/* Log this relation only if needed for logical decoding */
 		if (RelationIsLogicallyLogged(rel))
@@ -1704,8 +1703,7 @@ ExecuteTruncate(TruncateStmt *stmt)
 
 rels = lappend(rels, rel);
 relids = lappend_oid(relids, childrelid);
-relids_extra = lappend_int(relids_extra,
-		   TRUNCATE_REL_CONTEXT_CASCADING);
+relids_extra = lappend_int(relids_extra, 0);
 /* Log this relation only if needed for logical decoding */
 if (RelationIsLogicallyLogged(rel))
 	relids_logged = lappend_oid(relids_logged, childrelid);
@@ -1799,8 +1797,7 @@ ExecuteTruncateGuts(List *explicit_rels,
 truncate_check_activity(rel);
 rels = lappend(rels, rel);
 relids = lappend_oid(relids, relid);
-relids_extra = lappend_int(relids_extra,
-		   TRUNCATE_REL_CONTEXT_CASCADING);
+relids_extra = lappend_int(relids_extra, 0);
 /* Log this relation only if needed for logical decoding */
 if (RelationIsLogicallyLogged(rel))
 	relids_logged = lappend_oid(relids_logged, relid);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fb3ba5c415..986b634525 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1825,7 +1825,7 @@ apply_handle_truncate(StringInfo s)
 		remote_rels = lappend(remote_rels, rel);
 		rels = lappend(rels, rel->localrel);
 		relids = lappend_oid(relids, rel->localreloid);
-		relids_extra = lappend_int(relids_extra, TRUNCATE_REL_CONTEXT_NORMAL);
+		relids_extra = lappend_int(relids_extra, 0);
 		if (RelationIsLogicallyLogged(rel->localrel))
 			relids_logged = lappend_oid(relids_logged, rel->localreloid);
 
@@ -1864,7 +1864,7 @@ apply_handle_truncate(StringInfo s)
 rels = lappend(rels, childrel);
 part_rels = lappend(part_rels, childrel);
 relids = lappend_oid(relids, childrelid);
-relids_extra = lappend_int(relids_extra, TRUNCATE_REL_CONTEXT_CASCADING);
+relids_extra = lappend_int(relids_extra, 0);
 /* Log this relation only 

Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 16:53:47 -0700, Peter Geoghegan wrote:
> On Mon, Apr 12, 2021 at 4:52 PM Michael Paquier  wrote:
> > While going through this commit a couple of days ago, I really got to
> > wonder why you are controlling this stuff with a hardcoded value and I
> > found that scary, while what you should be using are two GUCs with the
> > reloptions that come with the feature (?):
> > - A threshold, as an integer, to define a number of pages.
> > - A scale factor to define a percentage of pages.
> 
> Why?

Well, one argument is that you made a fairly significant behavioural
change, with hard-coded logic for when the optimization kicks in. It's
not at all clear that your constants are the right ones for every
workload. We'll likely on get to know whether they're right in > 1 year
- not having a real out at that point imo is somewhat scary.

That said, adding more and more reloptions has a significant cost, so I
don't think it's clear cut that it's the right decision to add
one. Perhaps vacuum_cleanup_index_scale_factor should just be reused for
BYPASS_THRESHOLD_PAGES?

Greetings,

Andres Freund




Re: wal stats questions

2021-04-12 Thread Fujii Masao




On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?


Yes. Currently if the following condition is false even when the WAL counters
are updated, nothing is sent to the stats collector. But with your patch,
in this case the WAL stats are sent.

if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master cleanly.
So could you rebase the patch?

pgstat_initialize() should initialize pgWalUsage with zero?

+   /*
+* set the counters related to generated WAL data.
+*/
+   WalStats.m_wal_records = pgWalUsage.wal_records;
+   WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+   WalStats.m_wal_bytes = pgWalUsage.wal_bytes;

This should be skipped if pgWalUsage.wal_records is zero?

+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.

On the second thought, I'm afraid that this can be likely to be a foot-gun
in the future. So I'm now wondering if the current approach (i.e., calculate
the diff between the current and previous pgWalUsage and don't reset it
to zero) is better. Thought? Since the similar data structure pgBufferUsage
doesn't have this kind of restriction, I'm afraid that the developer can
easily miss that only pgWalUsage has the restriction.

But currently the diff is calculated (i.e., WalUsageAccumDiff() is called)
even when the WAL counters are not updated. Then if that calculated diff is
zero, we skip sending the WAL stats. This logic should be improved. That is,
probably we should be able to check whether the WAL counters are updated
or not, without calculating the diff, because the calculation is not free.
We can implement this by introducing new flag variable that we discussed
upthread. This flag is set to true whenever the WAL counters are incremented
and used to determine whether the WAL stats need to be sent.

If we do this, another issue is that the data types for wal_records and wal_fpi
in pgWalUsage are long. Which may lead to overflow? If yes, it's should be
replaced with uint64.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bruce Momjian
On Mon, Apr 12, 2021 at 05:20:43PM -0700, Bryn Llewellyn wrote:
> I’d argue that the fact that this:
>
> ('0.3 months'::interval) + ('0.7 months'::interval)
>
> Is reported as '30 days' and not '1 month' is yet another
> bug—precisely because of what I said in my previous email (sorry
> that I forked the thread) where I referred to the fact that, in the
> right test, adding 1 month gets a different answer than adding 30
> days. 

Flowing _up_ is what these functions do:

\df *justify*
   List of functions
   Schema   |   Name   | Result data type | Argument data types 
| Type

+--+--+-+--
 pg_catalog | justify_days | interval | interval
| func
 pg_catalog | justify_hours| interval | interval
| func
 pg_catalog | justify_interval | interval | interval
| func

> Yet another convincing reason to get rid of this flow down
> business altogether.

We can certainly get rid of all downflow, which in the current patch is
only when fractional internal units are specified.

> If some application wants to model flow-down, then it can do so with
> trivial programming and full control over its own definition of the
> rules.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bryn Llewellyn
> On 12-Apr-2021, at 17:00, Bruce Momjian  wrote:
> 
> On Mon, Apr 12, 2021 at 07:38:21PM -0400, Tom Lane wrote:
>> Bruce Momjian  writes:
>>> On Mon, Apr 12, 2021 at 03:09:48PM -0700, Bryn Llewellyn wrote:
 After all, you've bitten the bullet now and changed the behavior. This 
 means that the semantics of some extant applications will change. So... in 
 for a penny, in for a pound?
>> 
>>> The docs now say:
>> 
>>> Field values can have fractional parts;  for example, '1.5
>>> weeks' or '01:02:03.45'.  The fractional
>>> -->  parts are used to compute appropriate values for the next lower-order
>>> internal fields (months, days, seconds).
>> 
>>> meaning fractional years flows to the next lower internal unit, months,
>>> and no further.  Fractional months would flow to days.  The idea of not
>>> flowing past the next lower-order internal field is that the
>>> approximations between units are not precise enough to flow accurately.
>> 
>> Um, what's the argument for down-converting AT ALL?  The problem is
>> precisely that any such conversion is mostly fictional.
> 
> True.
> 
>>> With my patch, the output is now:
>> 
>>> SELECT INTERVAL '3 years 10.241604 months';
>>> interval
>>> 
>>>  3 years 10 mons 7 days
>> 
>>> It used to flow to seconds.
>> 
>> Yeah, that's better than before, but I don't see any principled argument
>> for it not to be "3 years 10 months", full stop.
> 
> Well, the case was:
> 
>   SELECT INTERVAL '0.1 months';
>interval
>   --
>3 days
>   
>   SELECT INTERVAL '0.1 months' + interval '0.9 months';
>?column?
>   --
>30 days
> 
> If you always truncate, you basically lose the ability to specify
> fractional internal units, which I think is useful.  I would say if you
> use fractional units of one of the internal units, you are basically
> knowing you are asking for an approximation --- that is not true of '3.5
> years', for example.

I’d argue that the fact that this:

('0.3 months'::interval) + ('0.7 months'::interval)

Is reported as '30 days' and not '1 month' is yet another bug—precisely because 
of what I said in my previous email (sorry that I forked the thread) where I 
referred to the fact that, in the right test, adding 1 month gets a different 
answer than adding 30 days. Yet another convincing reason to get rid of this 
flow down business altogether.

If some application wants to model flow-down, then it can do so with trivial 
programming and full control over its own definition of the rules.





Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bryn Llewellyn
> t...@sss.pgh.pa.us wrote:
> 
> br...@momjian.us writes:
>> b...@yugabyte.com wrote:
>>> After all, you've bitten the bullet now and changed the behavior. This 
>>> means that the semantics of some extant applications will change. So... in 
>>> for a penny, in for a pound?
> 
>> The docs now say:
> 
>> Field values can have fractional parts;  for example, '1.5
>> weeks' or '01:02:03.45'.  The fractional
>> -->  parts are used to compute appropriate values for the next lower-order
>> internal fields (months, days, seconds).
> 
>> meaning fractional years flows to the next lower internal unit, months, and 
>> no further.  Fractional months would flow to days.  The idea of not flowing 
>> past the next lower-order internal field is that the approximations between 
>> units are not precise enough to flow accurately.
> 
> Um, what's the argument for down-converting AT ALL?  The problem is precisely 
> that any such conversion is mostly fictional.
> 
>> With my patch, the output is now:
> 
>>  SELECT INTERVAL '3 years 10.241604 months';
>>  interval
>>  
>>   3 years 10 mons 7 days
> 
>> It used to flow to seconds.
> 
> Yeah, that's better than before, but I don't see any principled argument for 
> it not to be "3 years 10 months", full stop.

Tom, I fully support your suggestion to have no flow down at all. Please may 
this happen! However, the new rule must be described in terms of the three 
fields of the internal representation: [months, days, seconds]. This 
representation is already documented.

Don’t forget that '731.42587 hours’ is read back as "731:25:33.132" (or, if you 
prefer, 731 hours, 25 minutes, and 33.132 seconds if you use "extract" and your 
own pretty print). But we don’t think of this as “flow down”. Rather, it’s just 
a conventional representation of the seconds field of the internal 
representation. I could go on. But you all know what I mean.

By the way, I made a nice little demo (for my doc project). It shows that:

(1) if you pick the right date-time just before a DST change, and do the 
addition in the right time zone, then adding 24 hours gets a different answer 
than adding one day.

(2) if you pick the right date-time just before 29-Feb in a leap year, then 
adding 30 days gets a different answer than adding one month.

You all know why. And though the doc could explain and illustrate this better, 
it does tell you to expect this. It also says that the difference in semantics 
that these examples show is the reason for the three-field internal 
representation.

It seems to me that both the age-old behavior that vanilla 13.2 exhibits, and 
the behavior in the regime of Bruce’s patch are like adding 2.2 oranges to 1.3 
oranges and getting 3 oranges and 21 apples (because 1 orange is conventionally 
the same as 42 apples). Bruce made a step in the right direction by stopping 
oranges convert all the way down to bananas. But it would be so much better to 
get rid of this false equivalence business altogether.




Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bruce Momjian
On Mon, Apr 12, 2021 at 07:38:21PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Apr 12, 2021 at 03:09:48PM -0700, Bryn Llewellyn wrote:
> >> After all, you've bitten the bullet now and changed the behavior. This 
> >> means that the semantics of some extant applications will change. So... in 
> >> for a penny, in for a pound?
> 
> > The docs now say:
> 
> >  Field values can have fractional parts;  for example, '1.5
> >  weeks' or '01:02:03.45'.  The fractional
> > -->  parts are used to compute appropriate values for the next lower-order
> >  internal fields (months, days, seconds).
> 
> > meaning fractional years flows to the next lower internal unit, months,
> > and no further.  Fractional months would flow to days.  The idea of not
> > flowing past the next lower-order internal field is that the
> > approximations between units are not precise enough to flow accurately.
> 
> Um, what's the argument for down-converting AT ALL?  The problem is
> precisely that any such conversion is mostly fictional.

True.

> > With my patch, the output is now:
> 
> > SELECT INTERVAL '3 years 10.241604 months';
> > interval
> > 
> >  3 years 10 mons 7 days
> 
> > It used to flow to seconds.
> 
> Yeah, that's better than before, but I don't see any principled argument
> for it not to be "3 years 10 months", full stop.

Well, the case was:

SELECT INTERVAL '0.1 months';
 interval
--
 3 days

SELECT INTERVAL '0.1 months' + interval '0.9 months';
 ?column?
--
 30 days

If you always truncate, you basically lose the ability to specify
fractional internal units, which I think is useful.  I would say if you
use fractional units of one of the internal units, you are basically
knowing you are asking for an approximation --- that is not true of '3.5
years', for example.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Peter Geoghegan
On Mon, Apr 12, 2021 at 4:52 PM Michael Paquier  wrote:
> While going through this commit a couple of days ago, I really got to
> wonder why you are controlling this stuff with a hardcoded value and I
> found that scary, while what you should be using are two GUCs with the
> reloptions that come with the feature (?):
> - A threshold, as an integer, to define a number of pages.
> - A scale factor to define a percentage of pages.

Why?


-- 
Peter Geoghegan




Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Michael Paquier
On Mon, Apr 12, 2021 at 04:35:13PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 12, 2021 at 4:30 PM Andres Freund  wrote:
> > As far as I can see there's no reasonable way to disable this
> > "optimization", which scares me.
> 
> I'm fine with adding a simple 'off' switch. What I'd like to avoid
> doing is making the behavior tunable, since it's likely to change in
> Postgres 15 and Postgres 16 anyway.

While going through this commit a couple of days ago, I really got to
wonder why you are controlling this stuff with a hardcoded value and I
found that scary, while what you should be using are two GUCs with the
reloptions that come with the feature (?):
- A threshold, as an integer, to define a number of pages.
- A scale factor to define a percentage of pages.

Also, I am a bit confused with the choice of BYPASS_THRESHOLD_PAGES as
parameter name.  For all the other parameters of autovacuum, we use
"threshold" for a fixed number of items, not a percentage of a given
item.
--
Michael


signature.asc
Description: PGP signature


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread David Rowley
On Tue, 13 Apr 2021 at 11:42, Tom Lane  wrote:
>
> David Rowley  writes:
> > I realised when working on something unrelated last night that we can
> > also do hash lookups for NOT IN too.
>
> ... and still get the behavior right for nulls?

Yeah, it will. There are already some special cases for NULLs in the
IN version. Those would need to be adapted for NOT IN.

David




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread Tom Lane
David Rowley  writes:
> I realised when working on something unrelated last night that we can
> also do hash lookups for NOT IN too.

... and still get the behavior right for nulls?

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Tom Lane
Bruce Momjian  writes:
> On Mon, Apr 12, 2021 at 03:09:48PM -0700, Bryn Llewellyn wrote:
>> After all, you've bitten the bullet now and changed the behavior. This means 
>> that the semantics of some extant applications will change. So... in for a 
>> penny, in for a pound?

> The docs now say:

>  Field values can have fractional parts;  for example, '1.5
>  weeks' or '01:02:03.45'.  The fractional
> -->  parts are used to compute appropriate values for the next lower-order
>  internal fields (months, days, seconds).

> meaning fractional years flows to the next lower internal unit, months,
> and no further.  Fractional months would flow to days.  The idea of not
> flowing past the next lower-order internal field is that the
> approximations between units are not precise enough to flow accurately.

Um, what's the argument for down-converting AT ALL?  The problem is
precisely that any such conversion is mostly fictional.

> With my patch, the output is now:

>   SELECT INTERVAL '3 years 10.241604 months';
>   interval
>   
>3 years 10 mons 7 days

> It used to flow to seconds.

Yeah, that's better than before, but I don't see any principled argument
for it not to be "3 years 10 months", full stop.

regards, tom lane




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread David Rowley
On Fri, 9 Apr 2021 at 00:00, David Rowley  wrote:
> I push this with some minor cleanup from the v6 patch I posted earlier.

I realised when working on something unrelated last night that we can
also do hash lookups for NOT IN too.

We'd just need to check if the operator's negator operator is
hashable.  No new fields would need to be added to ScalarArrayOpExpr.
We'd just set the hashfuncid to the correct value and then set the
opfuncid to the negator function.  In the executor, we'd know to check
if the value is in the table or not in the table based on the useOr
value.

I'm not really sure whether lack of NOT IN support is going to be a
source of bug reports for PG14 or not.  If it was, then it might be
worth doing something about that for PG14.   Otherwise, we can just
leave it for future work for PG15 and beyond.  I personally don't have
any strong feelings either way, but I'm leaning towards just writing a
patch and thinking of pushing it sometime after we branch for PG15.

I've included the RMT, just in case they want to voice an opinion on that.

David




Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Peter Geoghegan
On Mon, Apr 12, 2021 at 4:30 PM Andres Freund  wrote:
> As far as I can see there's no reasonable way to disable this
> "optimization", which scares me.

I'm fine with adding a simple 'off' switch. What I'd like to avoid
doing is making the behavior tunable, since it's likely to change in
Postgres 15 and Postgres 16 anyway.

-- 
Peter Geoghegan




Re: pg_upgrade check for invalid role-specific default config

2021-04-12 Thread Tom Lane
I wrote:
> Another answer is that maybe the processing of the "role" case
> in particular is just broken.

After digging around a bit more, I think that that is indeed the
right answer.  Most of the GUC check functions that have
database-state-dependent behavior are programmed to behave specially
when checking a proposed ALTER USER/DATABASE setting; but check_role
and check_session_authorization did not get that memo.  I also
noted that check_temp_buffers would throw an error for no very good
reason.  There don't look to be any other troublesome cases.
So I ended up with the attached.

It feels a bit unsatisfactory to have these various check functions
know about this explicitly.  However, I'm not sure that there's a
good way to centralize it.  Only the check function knows whether
the check it's making is immutable or dependent on DB state --- and
in the former case, not throwing an error wouldn't be an improvement.

Anyway, I'm inclined to think that we should not only apply this
but back-patch it.  Two complaints is enough to suggest we have
an issue.  Plus, I think there is a dump/reload ordering problem
here: as things stand, "alter user joe set role = bob" would work
or not work depending on the order the roles are created in
and/or the order privileges are granted in.

regards, tom lane

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c5cf08b423..0c85679420 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -767,6 +767,17 @@ check_session_authorization(char **newval, void **extra, GucSource source)
 	roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum(*newval));
 	if (!HeapTupleIsValid(roleTup))
 	{
+		/*
+		 * When source == PGC_S_TEST, we don't throw a hard error for a
+		 * nonexistent user name, only a NOTICE.  See comments in guc.h.
+		 */
+		if (source == PGC_S_TEST)
+		{
+			ereport(NOTICE,
+	(errcode(ERRCODE_UNDEFINED_OBJECT),
+	 errmsg("role \"%s\" does not exist", *newval)));
+			return true;
+		}
 		GUC_check_errmsg("role \"%s\" does not exist", *newval);
 		return false;
 	}
@@ -837,10 +848,23 @@ check_role(char **newval, void **extra, GucSource source)
 			return false;
 		}
 
+		/*
+		 * When source == PGC_S_TEST, we don't throw a hard error for a
+		 * nonexistent user name or insufficient privileges, only a NOTICE.
+		 * See comments in guc.h.
+		 */
+
 		/* Look up the username */
 		roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum(*newval));
 		if (!HeapTupleIsValid(roleTup))
 		{
+			if (source == PGC_S_TEST)
+			{
+ereport(NOTICE,
+		(errcode(ERRCODE_UNDEFINED_OBJECT),
+		 errmsg("role \"%s\" does not exist", *newval)));
+return true;
+			}
 			GUC_check_errmsg("role \"%s\" does not exist", *newval);
 			return false;
 		}
@@ -859,6 +883,14 @@ check_role(char **newval, void **extra, GucSource source)
 		if (!InitializingParallelWorker &&
 			!is_member_of_role(GetSessionUserId(), roleid))
 		{
+			if (source == PGC_S_TEST)
+			{
+ereport(NOTICE,
+		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+		 errmsg("permission will be denied to set role \"%s\"",
+*newval)));
+return true;
+			}
 			GUC_check_errcode(ERRCODE_INSUFFICIENT_PRIVILEGE);
 			GUC_check_errmsg("permission denied to set role \"%s\"",
 			 *newval);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d0a51b507d..2ffefdf943 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -11777,8 +11777,9 @@ check_temp_buffers(int *newval, void **extra, GucSource source)
 {
 	/*
 	 * Once local buffers have been initialized, it's too late to change this.
+	 * However, if this is only a test call, allow it.
 	 */
-	if (NLocBuffer && NLocBuffer != *newval)
+	if (source != PGC_S_TEST && NLocBuffer && NLocBuffer != *newval)
 	{
 		GUC_check_errdetail("\"temp_buffers\" cannot be changed after any temporary tables have been accessed in the session.");
 		return false;


Re: psql - add SHOW_ALL_RESULTS option

2021-04-12 Thread Michael Paquier
On Mon, Apr 12, 2021 at 07:08:21PM +, Bossart, Nathan wrote:
> I think I've found another issue with this patch.  If AcceptResult()
> returns false in SendQueryAndProcessResults(), it seems to result in
> an infinite loop of "unexpected PQresultStatus" messages.  This can be
> reproduced by trying to run "START_REPLICATION" via psql.

Yes, that's another problem, and this causes an infinite loop where
we would just report one error previously :/
--
Michael


signature.asc
Description: PGP signature


Re: Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 16:11:59 -0700, Peter Geoghegan wrote:
> Recent work from commit 5100010e taught VACUUM that it doesn't have to
> do index vacuuming in cases where there are practically zero (not
> necessarily exactly zero) tuples to delete from indexes.

FWIW, I'd not at all be surprised if this causes some issues. Consider
cases where all index lookups are via bitmap scans (which does not
support killtuples) - if value ranges are looked up often the repeated
heap fetches can absolutely kill query performance. I've definitely had
to make autovacuum more aggressive for cases like this or schedule
manual vacuums, and now that's silently not good enough anymore. Yes, 2%
of the table isn't all that much, but often enough all the updates and
lookups concentrate in one value range.

As far as I can see there's no reasonable way to disable this
"optimization", which scares me.

Greetings,

Andres Freund




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-12 Thread David Rowley
On Sun, 11 Apr 2021 at 10:38, Tomas Vondra
 wrote:
> I wonder what's the relationship between the length of the IN list and
> the minimum number of rows needed for the hash to start winning.

I made the attached spreadsheet which demonstrates the crossover point
using the costs that I coded into cost_qual_eval_walker().

It basically shows, for large arrays, that there are fairly
significant benefits to hashing for just 2 lookups and not hashing
only just wins for 1 lookup.  However, the cost model does not account
for allocating memory for the hash table, which is far from free.

You can adjust the number of items in the IN clause by changing the
value in cell B1. The values in B2 and B3 are what I saw the planner
set when I tested with both INT and TEXT types.

David


cost_comparison_hashed_vs_non-hashed_saops.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bruce Momjian
On Mon, Apr 12, 2021 at 03:09:48PM -0700, Bryn Llewellyn wrote:
> I showed you all this example a long time ago:
> 
> select (
> '
>   3.853467 years
> '::interval
>   )::text as i;
> 
> This behavior is the same in the env. of Bruce’s patch as in unpatched PG 
> 13.2. This is the result.
> 
> 3 years 10 mons
> 
> Notice that "3.853467 years" is "3 years" plus "10.241604 months". This 
> explains the "10 mons" in the result. But the 0.241604 months remainder did 
> not spill down into days.
> 
> Can anybody defend this quirk? An extension of this example with a real 
> number of month in the user input is correspondingly yet more quirky. The 
> rules can be written down. But they’re too tortuos to allow an ordinary 
> mortal confidently to design code that relies on them.
> 
> (I was unable to find any rule statement that lets the user predict this in 
> the doc. But maybe that’s because of my feeble searching skills.)
> 
> If there is no defense (and I cannot imagine one) might Bruce’s patch 
> normalize this too to follow this rule:
> 
> — convert 'y years m months' to the real number y*12 + m.
> 
> — record truc( y*12 + m) in the "months" field of the internal representation
> 
> — flow the remainder down to days (but no further)
> 
> After all, you've bitten the bullet now and changed the behavior. This means 
> that the semantics of some extant applications will change. So... in for a 
> penny, in for a pound?

The docs now say:

 Field values can have fractional parts;  for example, '1.5
 weeks' or '01:02:03.45'.  The fractional
-->  parts are used to compute appropriate values for the next lower-order
 internal fields (months, days, seconds).

meaning fractional years flows to the next lower internal unit, months,
and no further.  Fractional months would flow to days.  The idea of not
flowing past the next lower-order internal field is that the
approximations between units are not precise enough to flow accurately.

With my patch, the output is now:

SELECT INTERVAL '3 years 10.241604 months';
interval

 3 years 10 mons 7 days

It used to flow to seconds.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Teaching users how they can get the most out of HOT in Postgres 14

2021-04-12 Thread Peter Geoghegan
Recent work from commit 5100010e taught VACUUM that it doesn't have to
do index vacuuming in cases where there are practically zero (not
necessarily exactly zero) tuples to delete from indexes. It also
surfaces the information used to decide whether or not we skip index
vacuuming in the logs, via the log_autovacuum_min_duration mechanism.
This log output can be used to get a sense of how effective HOT is
over time.

There is one number of particular interest: the proportion of heap
pages that have one or more LP_DEAD items across successive VACUUMs
(this is expressed as a percentage of the table). The immediate reason
to expose this is that it is crucial to the skipping behavior from
commit 5100010e -- the threshold for skipping is 2% of all heap pages.
But that's certainly not the only reason to pay attention to the
percentage. It can also be used to understand HOT in general. It can
be correlated with workload spikes and stressors that tend to make HOT
less effective.

A number of interesting workload-specific patterns seem to emerge by
focussing on how this number changes/grows over time. I think that
this should be pointed out directly in the docs. What's more, it seems
like a good vehicle for discussing how HOT works in general. Why did
we never really get around to documenting HOT? There should at least
be some handling of how DBAs can get the most out of HOT through
monitoring and through tuning -- especially by lowering heap
fillfactor.

It's very hard to get all UPDATEs to use HOT. It's much easier to get
UPDATEs to mostly use HOT most of the time. How things change over
time seems crucially important.

I'll show one realistic example, just to give some idea of what it
might look like. This is output for 3 successive autovacuums against
the largest TPC-C table:

automatic vacuum of table "postgres.public.bmsql_order_line": index scans: 0
pages: 0 removed, 4668405 remain, 0 skipped due to pins, 696997 skipped frozen
tuples: 324571 removed, 186702373 remain, 333888 are dead but not yet
removable, oldest xmin: 7624965
buffer usage: 3969937 hits, 3931997 misses, 1883255 dirtied
index scan bypassed: 42634 pages from table (0.91% of total) have
324364 dead item identifiers
avg read rate: 62.469 MB/s, avg write rate: 29.920 MB/s
I/O Timings: read=42359.501 write=11867.903
system usage: CPU: user: 43.62 s, system: 38.17 s, elapsed: 491.74 s
WAL usage: 4586766 records, 1850599 full page images, 849931 bytes

automatic vacuum of table "postgres.public.bmsql_order_line": index scans: 0
pages: 0 removed, 5976391 remain, 0 skipped due to pins, 2516643 skipped frozen
tuples: 759956 removed, 239787517 remain, 1848431 are dead but not yet
removable, oldest xmin: 18489326
buffer usage: 3432019 hits, 3385757 misses, 2426571 dirtied
index scan bypassed: 103941 pages from table (1.74% of total) have
790233 dead item identifiers
avg read rate: 50.338 MB/s, avg write rate: 36.077 MB/s
I/O Timings: read=49252.721 write=17003.987
system usage: CPU: user: 45.86 s, system: 34.47 s, elapsed: 525.47 s
WAL usage: 5598040 records, 2274556 full page images, 10510281959 bytes

automatic vacuum of table "postgres.public.bmsql_order_line": index scans: 1
pages: 0 removed, 7483804 remain, 1 skipped due to pins, 4208295 skipped frozen
tuples: 972778 removed, 299295048 remain, 1970910 are dead but not yet
removable, oldest xmin: 30987445
buffer usage: 3384994 hits, 4593727 misses, 2891003 dirtied
index scan needed: 174243 pages from table (2.33% of total) had
1325752 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 1250660 in total, 0 newly
deleted, 0 currently deleted, 0 reusable
avg read rate: 60.505 MB/s, avg write rate: 38.078 MB/s
I/O Timings: read=72881.986 write=21872.615
system usage: CPU: user: 65.24 s, system: 42.24 s, elapsed: 593.14 s
WAL usage: 6668353 records, 2684040 full page images, 12374536939 bytes

These autovacuums occur every 60-90 minutes with the workload in
question (with pretty aggressive autovacuum settings). We see that HOT
works rather well here -- but not so well that index vacuuming can be
avoided consistently, which happens in the final autovacuum (it has
"index scans: 1"). There was slow but steady growth in the percentage
of LP_DEAD-containing heap pages over time here, which is common
enough.

The point of HOT is not to avoid having to do index vacuuming, of
course -- that has it backwards. But framing HOT as doing work in
backends so autovacuum doesn't have to do similar work later on is a
good mental model to encourage users to adopt. There are also
significant advantages to reducing the effectiveness of HOT to this
one number -- HOT must be working well if it's close to 0%, almost
always below 2%, with the occasional aberration that sees it go up to
maybe 5%. But if it ever goes too high (in the absence of DELETEs),
you might have trouble on your hands. It might not go down again.

There are other interesting patterns from other tables within the same
database -- including 

Re: Curious test case added by collation version tracking patch

2021-04-12 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Apr 13, 2021 at 8:59 AM Tom Lane  wrote:
>> The reason I ask is that this case started failing after I fixed
>> a parse_coerce.c bug that allowed a CollateExpr node to survive
>> in this WHERE expression, which by rights it should not.  I'm
>> inclined to think that the test case is wrong and should be removed,
>> but maybe there's some reason to have a variant of it.

> Indeed, this doesn't do anything useful, other than prove that we
> record a collation dependency where it is (uselessly) allowed in an
> expression.  Since you're not going to allow that anymore, it should
> be dropped.

OK, I'll go clean it up.  Thanks!

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bryn Llewellyn
br...@momjian.us wrote:
> 
> z...@yugabyte.com wrote:
>> Among previous examples given by Bryn, the following produces correct result 
>> based on Bruce's patch.
>> 
>> # select interval '-1.7 years 29.4 months';
>> interval
>> 
>>  9 mons 12 days
> 
> Yes, that changed is caused by the rounding fixes, and not by the unit 
> pushdown adjustments.

I showed you all this example a long time ago:

select (
'
  3.853467 years
'::interval
  )::text as i;

This behavior is the same in the env. of Bruce’s patch as in unpatched PG 13.2. 
This is the result.

3 years 10 mons

Notice that "3.853467 years" is "3 years" plus "10.241604 months". This 
explains the "10 mons" in the result. But the 0.241604 months remainder did not 
spill down into days.

Can anybody defend this quirk? An extension of this example with a real number 
of month in the user input is correspondingly yet more quirky. The rules can be 
written down. But they’re too tortuos to allow an ordinary mortal confidently 
to design code that relies on them.

(I was unable to find any rule statement that lets the user predict this in the 
doc. But maybe that’s because of my feeble searching skills.)

If there is no defense (and I cannot imagine one) might Bruce’s patch normalize 
this too to follow this rule:

— convert 'y years m months' to the real number y*12 + m.

— record truc( y*12 + m) in the "months" field of the internal representation

— flow the remainder down to days (but no further)

After all, you've bitten the bullet now and changed the behavior. This means 
that the semantics of some extant applications will change. So... in for a 
penny, in for a pound?



Re: Curious test case added by collation version tracking patch

2021-04-12 Thread Thomas Munro
On Tue, Apr 13, 2021 at 8:59 AM Tom Lane  wrote:
> I am wondering what was the intent of this test case added by commit
> 257836a75:
>
> CREATE INDEX icuidx16_mood ON collate_test(id) WHERE mood > 'ok' COLLATE 
> "fr-x-icu";
>
> where "mood" is of an enum type, which surely does not respond to
> collations.
>
> The reason I ask is that this case started failing after I fixed
> a parse_coerce.c bug that allowed a CollateExpr node to survive
> in this WHERE expression, which by rights it should not.  I'm
> inclined to think that the test case is wrong and should be removed,
> but maybe there's some reason to have a variant of it.

Indeed, this doesn't do anything useful, other than prove that we
record a collation dependency where it is (uselessly) allowed in an
expression.  Since you're not going to allow that anymore, it should
be dropped.




Re: pg_upgrade check for invalid role-specific default config

2021-04-12 Thread Tom Lane
I wrote:
> I'm not sure I buy the premise that "it is possible to write a query
> to identify these cases".  It seems to me that the general problem is
> that ALTER ROLE/DATABASE SET values might have become incorrect since
> they were installed and would thus fail when reloaded in dump/restore.
> We're not going to be able to prevent that in the general case, and
> it's not obvious to me what special case might be worth going after.

Actually, after thinking about that a bit more, this is a whole lot
like the issues we have with reloading function bodies and function
SET clauses.  The solution we've adopted for that is to allow dumps
to turn off validation via the check_function_bodies GUC.  Maybe there
should be a GUC to disable validation of ALTER ROLE/DATABASE SET values.
If you fat-finger a setting, you might not be able to log in, but you
couldn't log in in the old database either.

Another answer is that maybe the processing of the "role" case
in particular is just broken.  Compare the behavior here:

regression=# create role joe;
CREATE ROLE
regression=# alter role joe set role = 'notthere';
ERROR:  role "notthere" does not exist
regression=# alter role joe set default_text_search_config to 'notthere';
NOTICE:  text search configuration "notthere" does not exist
ALTER ROLE
# \drds
List of settings
 Role |  Database  |  Settings   
--++-
 joe  || default_text_search_config=notthere

despite the fact that a direct SET fails:

regression=# set default_text_search_config to 'notthere';
ERROR:  invalid value for parameter "default_text_search_config": "notthere"

It's intentional that we don't throw a hard error for
default_text_search_config, because that would create
a problematic ordering dependency for pg_dump: the
desired config might just not have been reloaded yet.
Maybe the right answer here is that the processing of
"set role" in particular failed to get that memo.

regards, tom lane




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 17:14:20 -0400, Tom Lane wrote:
> I doubt that falsely labeling a function LEAKPROOF can get you more
> than the ability to read data you're not supposed to be able to read
> ... but that ability is then available to all users, or at least all
> users who can execute the function in question.  So it definitely is a
> fairly serious security hazard, and one that's not well modeled by
> role labels.  If you give somebody e.g. pg_read_all_data privileges,
> you don't expect that that means they can give it to other users.

A user with BYPASSRLS can create public security definer functions
returning data. If the concern is a BYPASSRLS user intentionally
exposing data, then there's not a meaningful increase to allow defining
LEAKPROOF functions.

To me the more relevant concern is that it's hard to determine
LEAKPROOF-ness and that many use-cases for BYPASSRLS do not require the
target to have the technical chops to determine if a function actually
is leakproof. But that seems more an argument for providing a separate
control over allowing to specify LEAKPROOF than against separating it
from superuser.

Greetings,

Andres Freund




Re: pg_upgrade check for invalid role-specific default config

2021-04-12 Thread Tom Lane
Bruce Momjian  writes:
> On Mon, Apr 12, 2021 at 01:28:19PM +, Charlie Hornsby wrote:
>> While troubleshooting a failed upgrade from v11 -> v12 I realised I had
>> encountered a bug previously reported on the pgsql-bugs mailing list:
>> #14242 Role with a setconfig "role" setting to a nonexistent role causes
>> pg_upgrade to fail
>> https://www.postgresql.org/message-id/20160711223641.1426.86096%40wrigleys.postgresql.org
>> Since it is possible to write a query to identify these cases, would there
>> be appetite for me to submit a patch to add a check for this to 
>> pg_upgrade?

> Yes, I think a patch would be good, but the fix might be for pg_dump
> instead, which pg_upgrade uses.

I'm not sure I buy the premise that "it is possible to write a query
to identify these cases".  It seems to me that the general problem is
that ALTER ROLE/DATABASE SET values might have become incorrect since
they were installed and would thus fail when reloaded in dump/restore.
We're not going to be able to prevent that in the general case, and
it's not obvious to me what special case might be worth going after.

I do find it interesting that we now have two reports of somebody
doing "ALTER ROLE SET role = something".  In the older thread,
I was skeptical that that had any real use-case, so I wonder if
Charlie has a rationale for having done that.

regards, tom lane




Re: pg_upgrade check for invalid role-specific default config

2021-04-12 Thread Bruce Momjian
On Mon, Apr 12, 2021 at 01:28:19PM +, Charlie Hornsby wrote:
> Hi all,
> 
> While troubleshooting a failed upgrade from v11 -> v12 I realised I had
> encountered a bug previously reported on the pgsql-bugs mailing list:
> 
> #14242 Role with a setconfig "role" setting to a nonexistent role causes
> pg_upgrade to fail
> 
> https://www.postgresql.org/message-id/20160711223641.1426.86096%40wrigleys.postgresql.org

...

> Since it is possible to write a query to identify these cases, would there
> be appetite for me to submit a patch to add a check for this to 
> pg_upgrade?
> 
> First time mailing list user here so many apologies for any missteps I have
> made in this message.

Yes, I think a patch would be good, but the fix might be for pg_dump
instead, which pg_upgrade uses.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-12 23:51:02 +0300, Andrey Borodin wrote:
>> Do I risk having some extra superusers in my installation if I allow
>> everyone to create LEAKPROOF functions?

> I think that depends on what you define "superuser" to exactly
> be. Defining it as "has a path to executing arbitrary native code", I
> don't think, if implemented sensibly, allowing to set LEAKPROOF on new
> functions would equate superuser permissions. But you soon after might
> hit further limitations where lifting them would have such a risk,
> e.g. defining new types with in/out functions.

I think the issue here is more that superuser = "able to break the
security guarantees of the database".  I doubt that falsely labeling
a function LEAKPROOF can get you more than the ability to read data
you're not supposed to be able to read ... but that ability is then
available to all users, or at least all users who can execute the
function in question.  So it definitely is a fairly serious security
hazard, and one that's not well modeled by role labels.  If you
give somebody e.g. pg_read_all_data privileges, you don't expect
that that means they can give it to other users.

regards, tom lane




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andrey Borodin



> 13 апр. 2021 г., в 00:01, Andres Freund  написал(а):
> 
> Hi,
> 
> On 2021-04-12 23:51:02 +0300, Andrey Borodin wrote:
>> Do I risk having some extra superusers in my installation if I allow
>> everyone to create LEAKPROOF functions?
> 
> I think that depends on what you define "superuser" to exactly
> be. Defining it as "has a path to executing arbitrary native code", I
> don't think, if implemented sensibly, allowing to set LEAKPROOF on new
> functions would equate superuser permissions.
Thanks!


> But you soon after might
> hit further limitations where lifting them would have such a risk,
> e.g. defining new types with in/out functions.

I think, real extensibility of a managed DB service is a very distant challenge.
Currently we just allow-list extensions.

Best regards, Andrey Borodin.



Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 23:51:02 +0300, Andrey Borodin wrote:
> Do I risk having some extra superusers in my installation if I allow
> everyone to create LEAKPROOF functions?

I think that depends on what you define "superuser" to exactly
be. Defining it as "has a path to executing arbitrary native code", I
don't think, if implemented sensibly, allowing to set LEAKPROOF on new
functions would equate superuser permissions. But you soon after might
hit further limitations where lifting them would have such a risk,
e.g. defining new types with in/out functions.

Greetings,

Andres Freund




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andrey Borodin
Thanks, Tomas!

> 12 апр. 2021 г., в 23:42, Tomas Vondra  
> написал(а):
> 
> I guess for the cloud services it's not an issue - they're mostly
> concerned about manageability and restricting access to the OS.
In fact, we would happily give a client access to an OS too. It's a client's VM 
after all and all software is open source. But it opens a way to attack control 
plane. Which in turn opens a way for clients to attack each other. And we 
really do not want it.

Thanks!

Best regards, Andrey Borodin.



Curious test case added by collation version tracking patch

2021-04-12 Thread Tom Lane
I am wondering what was the intent of this test case added by commit
257836a75:

CREATE INDEX icuidx16_mood ON collate_test(id) WHERE mood > 'ok' COLLATE 
"fr-x-icu";

where "mood" is of an enum type, which surely does not respond to
collations.

The reason I ask is that this case started failing after I fixed
a parse_coerce.c bug that allowed a CollateExpr node to survive
in this WHERE expression, which by rights it should not.  I'm
inclined to think that the test case is wrong and should be removed,
but maybe there's some reason to have a variant of it.

regards, tom lane




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 22:42:03 +0200, Tomas Vondra wrote:
> It's unfortunate that we tie the this capability to being superuser,
> so maybe the right solution would be to introduce a separate role with
> this privilege?

Perhaps DB owner + BYPASSRLS would be enough?

Greetings,

Andres Freund




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 16:37:01 -0400, Tom Lane wrote:
> Andrey Borodin  writes:
> > Currently only superuser is allowed to create LEAKPROOF functions
> > because leakproof functions can see tuples which have not yet been
> > filtered out by security barrier views or row level security
> > policies.
>
> Yeah.
>
> > But managed cloud services typically do not provide superuser roles.
>
> This is not a good argument for relaxing superuser requirements.

IDK. I may have been adjacent to people operating database-as-a-service
for too long, but ISTM there's decent reasons for (and also against) not
providing full superuser access. Even outside of managed services it
seems like a decent idea to split the "can execute native code" role
from the "administers an application" role. That reduces the impact a
bug in the application can incur.

There's certain things that are pretty intrinsically "can execute native
code", like defining new 'C' functions, arbitrary ALTER SYSTEM,
arbitrary file reads/writes, etc. Splitting them off from superuser is a
fools errand. But it's not at all clear why adding LEAKPROOF to
functions falls into that category?

Greetings,

Andres Freund




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andrey Borodin
Thanks for so quick response, Tom!

> 12 апр. 2021 г., в 23:37, Tom Lane  написал(а):
> 
>> But managed cloud services typically do not provide superuser roles.
> 
> This is not a good argument for relaxing superuser requirements.
Ok, let's put aside question about relaxing requirements in upstream.

Do I risk having some extra superusers in my installation if I allow everyone 
to create LEAKPROOF functions?

Thanks!

Best regards, Andrey Borodin.



Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Tomas Vondra


On 4/12/21 10:37 PM, Tom Lane wrote:
> Andrey Borodin  writes:
>> Currently only superuser is allowed to create LEAKPROOF functions
>> because leakproof functions can see tuples which have not yet been
>> filtered out by security barrier views or row level security
>> policies.
> 
> Yeah.
> 
>> But managed cloud services typically do not provide superuser
>> roles.
> 
> This is not a good argument for relaxing superuser requirements.
> 

I guess for the cloud services it's not an issue - they're mostly
concerned about manageability and restricting access to the OS. It's
unfortunate that we tie the this capability to being superuser, so maybe
the right solution would be to introduce a separate role with this
privilege?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PANIC: wrong buffer passed to visibilitymap_clear

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-11 13:55:30 -0400, Tom Lane wrote:
> Either way, it's hard to argue that heap_update hasn't crossed the
> complexity threshold where it's impossible to maintain safely.  We
> need to simplify it.

Yea, I think we're well beyond that point. I can see a few possible
steps to wrangle the existing complexity into an easier to understand
shape:

- Rename heapam.c goto labels, they're useless to understand what is
  happening.

- Move HeapTupleSatisfiesUpdate() call and the related branches
  afterwards into its own function.

- Move "temporarily mark it locked" branch into its own function. It's a
  minimal implementation of tuple locking, so it seems more than
  separate enough.

- Move the "store the new tuple" part into its own function (pretty much
  the critical section).

- Probably worth unifying the exit paths - there's a fair bit of
  duplication by now...

Half related:

- I think we might also need to do something about the proliferation of
  bitmaps in heap_update(). We now separately allocate 5 bitmapsets -
  that strikes me as fairly insane.


However, these would not really address the complexity in itself, just
make it easier to manage.

ISTM that a lot of the complexity is related to needing to retry (and
avoiding doing so unnecessarily), which in turn is related to avoiding
deadlocks. We actually know how to not need that to the same degree -
the (need_toast || newtupsize > pagefree) locks the tuple and afterwards
has a lot more freedom. We obviously can't just always do that, due to
the WAL logging overhead.

I wonder if we could make that path avoid the WAL logging overhead. We
don't actually need a full blown tuple lock, potentially even with its
own multixact, here.

The relevant comment (in heap_lock_tuple()) says:
/*
 * XLOG stuff.  You might think that we don't need an XLOG record 
because
 * there is no state change worth restoring after a crash.  You would be
 * wrong however: we have just written either a TransactionId or a
 * MultiXactId that may never have been seen on disk before, and we need
 * to make sure that there are XLOG entries covering those ID numbers.
 * Else the same IDs might be re-used after a crash, which would be
 * disastrous if this page made it to disk before the crash.  
Essentially
 * we have to enforce the WAL log-before-data rule even in this case.
 * (Also, in a PITR log-shipping or 2PC environment, we have to have 
XLOG
 * entries for everything anyway.)
 */

But I don't really think that doing full-blown WAL tuple-locking WAL
logging is really the right solution.

- The "next xid" concerns are at least as easily solved by WAL logging a
  distinct "highest xid assigned" record when necessary. Either by
  having a shared memory variable saying "latestLoggedXid" or such, or
  by having end-of-recovery advance nextXid to beyond what ExtendCLOG()
  extended to. That reduces the overhead to at most once-per-xact (and
  commonly smaller) or nothing, respectively.

- While there's obviously a good bit of simplicity ensuring that a
  replica is exactly the same ("Also, in a PITR log-shipping or 2PC
  environment ..."), we don't actually enforce that strictly anyway -
  so I am not sure why it's necessary to pay the price here.

But maybe I'm all wet here, I certainly haven't had enough coffee.

Greetings,

Andres Freund




Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Tom Lane
Andrey Borodin  writes:
> Currently only superuser is allowed to create LEAKPROOF functions because 
> leakproof functions can see tuples which have not yet been filtered out by 
> security barrier views or row level security policies.

Yeah.

> But managed cloud services typically do not provide superuser roles.

This is not a good argument for relaxing superuser requirements.

regards, tom lane




Allowing to create LEAKPROOF functions to non-superuser

2021-04-12 Thread Andrey Borodin
Hi hackers!

This thread continues discussion of allowing something to non-superuser, AFAIK 
previous was [0].

Currently only superuser is allowed to create LEAKPROOF functions because 
leakproof functions can see tuples which have not yet been filtered out by 
security barrier views or row level security policies.

But managed cloud services typically do not provide superuser roles. I'm 
thinking about allowing the database owner or someone with BYPASSRLS flag to 
create these functions. Or, perhaps, pg_read_all_data.

And I'm trying to figure out if there are any security implications. Consider a 
user who already has access to all user data in a DB and the ability to create 
LEAKPROOF functions. Can they gain a superuser role or access something else 
that is available only to a superuser?
Is it possible to relax requirements for the creator of LEAKPROOF functions in 
upstream Postgres?

I'll appreciate any comments. Thanks!


Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/CACqFVBbx6PDq%2B%3DvHM0n78kHzn8tvOM-kGO_2q_q0zNAMT%2BTzdA%40mail.gmail.com



Re: Proposal for working on open source with PostgreSQL

2021-04-12 Thread Alvaro Herrera
On 2021-Apr-12, Laurenz Albe wrote:

> I couldn't see any detail information about the project in your proposal, 
> except
> that the project is called "plsample".  Is there more information somewhere?
> 
> If it is a procedural language as the name suggests, you probably don't have
> to modify PostgreSQL core code to make it work.

plsample is in src/test/modules/plsample.  Possible improvements are
handling of DO blocks, support for trigger functions, routine
validation.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: psql - add SHOW_ALL_RESULTS option

2021-04-12 Thread Alvaro Herrera
On 2021-Apr-12, Bossart, Nathan wrote:

> The following patch seems to resolve the issue, although I'll admit I
> haven't dug into this too deeply.  In any case, +1 for reverting the
> patch for now.

Please note that there's no "for now" about it -- if the patch is
reverted, the only way to get it back is to wait for PG15.  That's
undesirable.  A better approach is to collect all those bugs and get
them fixed.  There's plenty of time to do that.

I, for one, would prefer to see the feature repaired in this cycle.

-- 
Álvaro Herrera   Valdivia, Chile




Re: psql - add SHOW_ALL_RESULTS option

2021-04-12 Thread Bossart, Nathan
On 4/12/21, 9:25 AM, "Tom Lane"  wrote:
> Fabien COELHO  writes:
>>> Between this and the known breakage of control-C, it seems clear
>>> to me that this patch was nowhere near ready for prime time.
>>> I think shoving it in on the last day before feature freeze was
>>> ill-advised, and it ought to be reverted.  We can try again later.
>
>> The patch has been asleep for quite a while, and was resurrected, possibly
>> too late in the process. ISTM that fixing it for 14 is manageable,
>> but this is not my call.
>
> I just observed an additional issue that I assume was introduced by this
> patch, which is that psql's response to a server crash has gotten
> repetitive:
>
> regression=# CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM 
> generate_series(1, 10));
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> The connection to the server was lost. Attempting reset: Failed.
> !?>
>
> I've never seen that before, and it's not because I don't see
> server crashes regularly.

I think I've found another issue with this patch.  If AcceptResult()
returns false in SendQueryAndProcessResults(), it seems to result in
an infinite loop of "unexpected PQresultStatus" messages.  This can be
reproduced by trying to run "START_REPLICATION" via psql.

The following patch seems to resolve the issue, although I'll admit I
haven't dug into this too deeply.  In any case, +1 for reverting the
patch for now.

diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 028a357991..abafd41763 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1176,7 +1176,7 @@ SendQueryAndProcessResults(const char *query, double 
*pelapsed_msec, bool is_wat

 /* and switch to next result */
 result = PQgetResult(pset.db);
-continue;
+break;
 }

 /* must handle COPY before changing the current result */

Nathan



Re: Proposal for working on open source with PostgreSQL

2021-04-12 Thread Laurenz Albe
On Mon, 2021-04-12 at 23:26 +0530, Nandni Mehla wrote:
> I'm Nandni Mehla, a sophomore currently pursuing B.Tech in IT from Indira 
> Gandhi
>  Delhi Technical University for Women, Delhi. I've recently started working on
>  open source and I think I will be a positive addition to your organization 
> for
>  working on projects using C and SQL, as I have experience in these, and I am
>  willing to learn more from you.
> I am attaching my proposal in this email for your reference, please guide me 
> through this.
> Regards.
> 
> https://docs.google.com/document/d/1H84WmzZbMERPrjsnXbvoQ7W2AaKsM8eJU02SNw7vQBk/edit?usp=sharing

Thanks for your willingness to help with PostgreSQL!

I couldn't see any detail information about the project in your proposal, except
that the project is called "plsample".  Is there more information somewhere?

If it is a procedural language as the name suggests, you probably don't have
to modify PostgreSQL core code to make it work.

Yours,
Laurenz Albe





Re: [PATCH] Identify LWLocks in tracepoints

2021-04-12 Thread Andres Freund
Hi,

On 2021-04-12 14:31:32 +0800, Craig Ringer wrote:
> * There is no easy way to look up the tranche name by ID from outside the
> backend

But it's near trivial to add that.


> It's annoying that we have to pay the cost of computing the tranche name
> though. It never used to matter, but now that T_NAME() expands to
> GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more
> on such a hot path. I might see if I can do a little comparison and see how
> much.  I could add TRACE_POSTGRESQL_<>_ENABLED() guards since 
> we
> do in fact build with SDT semaphore support. That adds a branch for each
> tracepoint, but they're already marshalling arguments and making a function
> call that does lots more than a single branch, so that seems pretty
> sensible.

I am against adding any overhead for this feature. I honestly think the
probes we have right now in postgres do not provide a commensurate
benefit.


> (It'd be possible to instead generate probes for each GUC at compile-time
> using the preprocessor and the DTRACE_ macros. But as noted above, that
> doesn't currently work properly in the same compilation unit that a dtrace
> script-generated probes.h is included in. I think it's probably nicer to
> have specific probes for GUCs of high interest, then generic probes that
> capture all GUCs anyway.)
>
> There are a TON of probes I want to add, and I have a tree full of them
> waiting to submit progressively. Yes, ability to probe all GUCs is in
> there. So is detail on walsender, reorder buffer, and snapshot builder
> activity. Heavyweight lock SDTs. A probe that identifies the backend type
> at startup. SDT probe events emitted for every wait-event. Probes in elog.c
> to let probes observe error unwinding, capture error messages,
> etc. [...] A probe that fires whenever debug_query_string
> changes. Lots. But I can't submit them all at once, especially without
> some supporting use cases and scripts that other people can use so
> they can understand why these probes are useful.

-1. This is not scalable. Adding static probes all over has both a
runtime (L1I, branches, code optimization) and maintenance overhead.


> (Those can also be used with systemtap guru mode scripts to do things
> like turn a particular elog(DEBUG) into a PANIC at runtime for
> diagnostic purposes).

Yikes.

Greetings,

Andres Freund




Re: PANIC: wrong buffer passed to visibilitymap_clear

2021-04-12 Thread Peter Geoghegan
On Mon, Apr 12, 2021 at 9:19 AM Tom Lane  wrote:
> So I think we have to stick with the current basic design, and just
> tighten things up to make sure that visibility pins are accounted for
> in the places that are missing it.
>
> Hence, I propose the attached.  It passes check-world, but that proves
> absolutely nothing of course :-(.  I wonder if there is any way to
> exercise these code paths deterministically.

This approach seems reasonable to me. At least you've managed to
structure the visibility map page pin check as concomitant with the
existing space recheck.

> (I have realized BTW that I was exceedingly fortunate to reproduce
> the buildfarm report here --- I have run hundreds of additional
> cycles of the same test scenario without getting a second failure.)

In the past I've had luck with RR's chaos mode (most notably with the
Jepsen SSI bug). That didn't work for me here, though I might just
have not persisted with it for long enough. I should probably come up
with a shell script that runs the same thing hundreds of times or more
in chaos mode, while making sure that useless recordings don't
accumulate.

The feature is described here:

https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

You only have to be lucky once. Once that happens, you're left with a
recording to review and re-review at your leisure. This includes all
Postgres backends, maybe even pg_regress and other scaffolding (if
that's what you're after).

But that's for debugging, not testing. The only way that we'll ever be
able to test stuff like this is with something like Alexander
Korotkov's stop events patch [1]. That infrastructure should be added
sooner rather than later.

[1] 
https://postgr.es/m/capphfdtseohx8dsk9qp+z++i4bgqoffkip6jdwngea+g7z-...@mail.gmail.com
--
Peter Geoghegan




Proposal for working on open source with PostgreSQL

2021-04-12 Thread Nandni Mehla
Hello Sir/Madam,
I'm Nandni Mehla, a sophomore currently pursuing B.Tech in IT from Indira
Gandhi Delhi Technical University for Women, Delhi. I've recently started
working on open source and I think I will be a positive addition to
your organization for working on projects using C and SQL, as I have
experience in these, and I am willing to learn more from you.
I am attaching my proposal in this email for your reference, please guide
me through this.
Regards.

https://docs.google.com/document/d/1H84WmzZbMERPrjsnXbvoQ7W2AaKsM8eJU02SNw7vQBk/edit?usp=sharing


Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)

2021-04-12 Thread Justin Pryzby
On Mon, Apr 12, 2021 at 01:55:13PM -0300, Ranier Vilela wrote:
> Em seg., 12 de abr. de 2021 às 03:04, Tom Lane  escreveu:
> > Michael Paquier  writes:
> > > On Sun, Apr 11, 2021 at 07:42:20PM -0300, Ranier Vilela wrote:
> > >> Em dom., 11 de abr. de 2021 às 16:25, Justin Pryzby <
> > pry...@telsasoft.com>
> > >>> I think it's cleanest to write:
> > >>> |HeapTupleData tmptup = {0};
> >
> > > I agree that this would be cleaner.
> >
> > It would be wrong, though, or at least not have the same effect.
> >
> I think that you speak about fill pointers with 0 is not the same as fill
> pointers with NULL.
> 
> 
> > ItemPointerSetInvalid does not set the target to all-zeroes.
> >
> ItemPointerSetInvalid set or not set the target to all-zeroes?

I think Tom means that it does:
BlockIdSet(&((pointer)->ip_blkid), InvalidBlockNumber),
(pointer)->ip_posid = InvalidOffsetNumber

but it's not zero, as I thought:

src/include/storage/block.h:#define InvalidBlockNumber  ((BlockNumber) 
0x)

> > (Regardless of that detail, it's generally best to accomplish
> > objective X in the same way that existing code does.  Deciding
> > that you have a better way is often wrong, and even if you
> > are right, you should then submit a patch to change all the
> > existing cases.)

FYI, I'd gotten the idea from here:

$ git grep 'HeapTupleData.*='
src/backend/executor/execTuples.c:  HeapTupleData tuple = {0};

-- 
Justin




Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)

2021-04-12 Thread Tom Lane
Ranier Vilela  writes:
> Em seg., 12 de abr. de 2021 às 03:04, Tom Lane  escreveu:
>> It would be wrong, though, or at least not have the same effect.

> I think that you speak about fill pointers with 0 is not the same as fill
> pointers with NULL.

No, I mean that InvalidBlockNumber isn't 0.

> I was confused here, does the patch follow the pattern and fix the problem
> or not?

Your patch seems fine.  Justin's proposed improvement isn't.

(I'm not real sure whether there's any *actual* bug here --- would we
really be looking at either ctid or tableoid of this temporary tuple?
But it's probably best to ensure that they're valid anyway.)

regards, tom lane




Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)

2021-04-12 Thread Tomas Vondra



On 4/12/21 6:55 PM, Ranier Vilela wrote:
> 
> 
> Em seg., 12 de abr. de 2021 às 03:04, Tom Lane  > escreveu:
> 
> Michael Paquier mailto:mich...@paquier.xyz>>
> writes:
> > On Sun, Apr 11, 2021 at 07:42:20PM -0300, Ranier Vilela wrote:
> >> Em dom., 11 de abr. de 2021 às 16:25, Justin Pryzby
> mailto:pry...@telsasoft.com>>
> >>> I think it's cleanest to write:
> >>> |HeapTupleData tmptup = {0};
> 
> > I agree that this would be cleaner.
> 
> It would be wrong, though, or at least not have the same effect.
> 
> I think that you speak about fill pointers with 0 is not the same as
> fill pointers with NULL.
>  
> 
> ItemPointerSetInvalid does not set the target to all-zeroes.
> 
> ItemPointerSetInvalid set or not set the target to all-zeroes?
> 

Not sure what exactly are you asking about? What Tom said is that if you
do 'struct = {0}' it sets all fields to 0, but we only want/need to set
the t_self/t_tableOid fields to 0.

> 
> (Regardless of that detail, it's generally best to accomplish
> objective X in the same way that existing code does.  Deciding
> that you have a better way is often wrong, and even if you
> are right, you should then submit a patch to change all the
> existing cases.)
> 
> I was confused here, does the patch follow the pattern and fix the
> problem or not?
> 

I believe it does, and it's doing it in the same way as most other
similar places.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10));

2021-04-12 Thread Tom Lane
Yulin PEI  writes:
> I found it could cause a crash when executing sql statement: `CREATE VIEW 
> v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10)); ` in 
> postgres 13.2 release.

Nice catch.  I don't think the code in DefineVirtualRelation is wrong:
exprCollation shouldn't report any collation for an expression of a
non-collatable type.  Rather the problem is with an old kluge in
coerce_type(), which will push a type coercion underneath a CollateExpr
... without any mind for the possibility that the coercion result isn't
collatable.  So the right fix is more or less the attached.

regards, tom lane

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index d5310f27db..228ee8e7d6 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -95,6 +95,7 @@ coerce_to_target_type(ParseState *pstate, Node *expr, Oid exprtype,
 	 * *must* know that to avoid possibly calling hide_coercion_node on
 	 * something that wasn't generated by coerce_type.  Note that if there are
 	 * multiple stacked CollateExprs, we just discard all but the topmost.
+	 * Also, if the target type isn't collatable, we discard the CollateExpr.
 	 */
 	origexpr = expr;
 	while (expr && IsA(expr, CollateExpr))
@@ -114,7 +115,7 @@ coerce_to_target_type(ParseState *pstate, Node *expr, Oid exprtype,
 ccontext, cformat, location,
 (result != expr && !IsA(result, Const)));
 
-	if (expr != origexpr)
+	if (expr != origexpr && type_is_collatable(targettype))
 	{
 		/* Reinstall top CollateExpr */
 		CollateExpr *coll = (CollateExpr *) origexpr;
@@ -384,7 +385,7 @@ coerce_type(ParseState *pstate, Node *node,
 		if (result)
 			return result;
 	}
-	if (IsA(node, CollateExpr))
+	if (IsA(node, CollateExpr) && type_is_collatable(targetTypeId))
 	{
 		/*
 		 * If we have a COLLATE clause, we have to push the coercion


Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)

2021-04-12 Thread Ranier Vilela
Em seg., 12 de abr. de 2021 às 03:04, Tom Lane  escreveu:

> Michael Paquier  writes:
> > On Sun, Apr 11, 2021 at 07:42:20PM -0300, Ranier Vilela wrote:
> >> Em dom., 11 de abr. de 2021 às 16:25, Justin Pryzby <
> pry...@telsasoft.com>
> >>> I think it's cleanest to write:
> >>> |HeapTupleData tmptup = {0};
>
> > I agree that this would be cleaner.
>
> It would be wrong, though, or at least not have the same effect.
>
I think that you speak about fill pointers with 0 is not the same as fill
pointers with NULL.


> ItemPointerSetInvalid does not set the target to all-zeroes.
>
ItemPointerSetInvalid set or not set the target to all-zeroes?


> (Regardless of that detail, it's generally best to accomplish
> objective X in the same way that existing code does.  Deciding
> that you have a better way is often wrong, and even if you
> are right, you should then submit a patch to change all the
> existing cases.)
>
I was confused here, does the patch follow the pattern and fix the problem
or not?

regards,
Ranier Vilela


Re: Contribution to PostgreSQL - please give an advice

2021-04-12 Thread Justin Pryzby
On Mon, Apr 12, 2021 at 03:21:41PM +, Ian Zagorskikh wrote:
> I would like to contribute my time and efforts to the PostgreSQL project
> development. I have some [hope not too bad] experience in software
> development primarily for Linux/BSD/Windows platforms with C/C++ though
> almost no experience in RDBMS internals. I have read the "Development
> Information" section in wiki and checked the official TODO list. It's
> really great but it's a bit huge and "all is so tasty" so I don't know
> where to start. Can you please advise me how to proceed? Maybe there are
> active tasks or bugs or issues that require man power? I would appreciate
> any advice or mentorship. Thank you!

I think the best way is to start following this mailing list.
Since it's very "busy", it may be easier to follow the web archives.
https://www.postgresql.org/list/pgsql-hackers/2021-04/

When someone reports a problem, you can try to reproduce it to see if they've
provided enough information to confirm the issue, or test any proposed patch.

You can see the patches being proposed for future release:
https://commitfest.postgresql.org/

However, right now, we've just passed the "feature freeze" for v14, so new
development is on hold for awhile.  What's most useful is probably testing the
changes that have been committed.  You can check that everything works as
described, that the implemented behavior doesn't have any rough edges, that the
features work together well, and work under your own use-cases/workloads.

You can see a list of commits for v14 like so:
> git log --cherry-pick origin/REL_13_STABLE...origin/master
(Any commits that show up twice are also in v13, so aren't actually "new in
v14", but the patches differ so GIT couldn't figure that out)

-- 
Justin




Re: psql - add SHOW_ALL_RESULTS option

2021-04-12 Thread Tom Lane
Fabien COELHO  writes:
>> Between this and the known breakage of control-C, it seems clear
>> to me that this patch was nowhere near ready for prime time.
>> I think shoving it in on the last day before feature freeze was
>> ill-advised, and it ought to be reverted.  We can try again later.

> The patch has been asleep for quite a while, and was resurrected, possibly 
> too late in the process. ISTM that fixing it for 14 is manageable, 
> but this is not my call.

I just observed an additional issue that I assume was introduced by this
patch, which is that psql's response to a server crash has gotten
repetitive:

regression=# CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM 
generate_series(1, 10));
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
!?> 

I've never seen that before, and it's not because I don't see
server crashes regularly.

regards, tom lane




Re: PANIC: wrong buffer passed to visibilitymap_clear

2021-04-12 Thread Tom Lane
Peter Geoghegan  writes:
> On Sun, Apr 11, 2021 at 11:16 AM Tom Lane  wrote:
>> It wasn't very clear, because I hadn't thought it through very much;
>> but what I'm imagining is that we discard most of the thrashing around
>> all-visible rechecks and have just one such test somewhere very late
>> in heap_update, after we've successfully acquired a target buffer for
>> the update and are no longer going to possibly need to release any
>> buffer lock.  If at that one point we see the page is all-visible
>> and we don't have the vmbuffer, then we have to release all our locks
>> and go back to "l2".  Which is less efficient than some of the existing
>> code paths, but given how hard this problem is to reproduce, it seems
>> clear that optimizing for the occurrence is just not worth it.

> Oh! That sounds way better.

After poking at this for awhile, it seems like it won't work very nicely.
The problem is that once we've invoked the toaster, we really don't want
to just abandon that work; we'd leak any toasted out-of-line data that
was created.

So I think we have to stick with the current basic design, and just
tighten things up to make sure that visibility pins are accounted for
in the places that are missing it.

Hence, I propose the attached.  It passes check-world, but that proves
absolutely nothing of course :-(.  I wonder if there is any way to
exercise these code paths deterministically.

(I have realized BTW that I was exceedingly fortunate to reproduce
the buildfarm report here --- I have run hundreds of additional
cycles of the same test scenario without getting a second failure.)

regards, tom lane

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 03d4abc938..265b31e981 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3784,7 +3784,7 @@ l2:
 		 * overhead would be unchanged, that doesn't seem necessarily
 		 * worthwhile.
 		 */
-		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+		if (PageIsAllVisible(page) &&
 			visibilitymap_clear(relation, block, vmbuffer,
 VISIBILITYMAP_ALL_FROZEN))
 			cleared_all_frozen = true;
@@ -3846,36 +3846,46 @@ l2:
 		 * first".  To implement this, we must do RelationGetBufferForTuple
 		 * while not holding the lock on the old page, and we must rely on it
 		 * to get the locks on both pages in the correct order.
+		 *
+		 * Another consideration is that we need visibility map page pin(s) if
+		 * we will have to clear the all-visible flag on either page.  If we
+		 * call RelationGetBufferForTuple, we rely on it to acquire any such
+		 * pins; but if we don't, we have to handle that here.  Hence we need
+		 * a loop.
 		 */
-		if (newtupsize > pagefree)
-		{
-			/* Assume there's no chance to put heaptup on same page. */
-			newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
-			   buffer, 0, NULL,
-			   _new, );
-		}
-		else
+		for (;;)
 		{
+			if (newtupsize > pagefree)
+			{
+/* It doesn't fit, must use RelationGetBufferForTuple. */
+newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
+   buffer, 0, NULL,
+   _new, );
+/* We're all done. */
+break;
+			}
+			/* Acquire VM page pin if needed and we don't have it. */
+			if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+visibilitymap_pin(relation, block, );
 			/* Re-acquire the lock on the old tuple's page. */
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			/* Re-check using the up-to-date free space */
 			pagefree = PageGetHeapFreeSpace(page);
-			if (newtupsize > pagefree)
+			if (newtupsize > pagefree ||
+(vmbuffer == InvalidBuffer && PageIsAllVisible(page)))
 			{
 /*
- * Rats, it doesn't fit anymore.  We must now unlock and
- * relock to avoid deadlock.  Fortunately, this path should
- * seldom be taken.
+ * Rats, it doesn't fit anymore, or somebody just now set the
+ * all-visible flag.  We must now unlock and loop to avoid
+ * deadlock.  Fortunately, this path should seldom be taken.
  */
 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
-   buffer, 0, NULL,
-   _new, );
 			}
 			else
 			{
-/* OK, it fits here, so we're done. */
+/* We're all done. */
 newbuf = buffer;
+break;
 			}
 		}
 	}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 37a1be4114..ffc89685bf 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -293,9 +293,13 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	happen if space is freed in that page after heap_update finds there's not
  *	enough there).  In that case, the page will be pinned and locked only once.
  *
- *	For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by
- *	locking them only after locking the corresponding heap page, and taking
- *	no 

Re: Have I found an interval arithmetic bug?

2021-04-12 Thread Bruce Momjian
On Sun, Apr 11, 2021 at 07:33:34PM -0700, Zhihong Yu wrote:
> Among previous examples given by Bryn, the following produces correct result
> based on Bruce's patch.
> 
> # select interval '-1.7 years 29.4 months';
>     interval
> 
>  9 mons 12 days

Yes, that changed is caused by the rounding fixes, and not by the unit
pushdown adjustments.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Contribution to PostgreSQL - please give an advice

2021-04-12 Thread Pavel Stehule
Hi

po 12. 4. 2021 v 17:22 odesílatel Ian Zagorskikh  napsal:

> Hi all!
>
> I would like to contribute my time and efforts to the PostgreSQL project
> development. I have some [hope not too bad] experience in software
> development primarily for Linux/BSD/Windows platforms with C/C++ though
> almost no experience in RDBMS internals. I have read the "Development
> Information" section in wiki and checked the official TODO list. It's
> really great but it's a bit huge and "all is so tasty" so I don't know
> where to start. Can you please advise me how to proceed? Maybe there are
> active tasks or bugs or issues that require man power? I would appreciate
> any advice or mentorship. Thank you!
>

It is great - any hands and eyes are welcome.

I think so the best start now (when you have not own topic) is review  of
one or more patches from commitfest application

https://commitfest.postgresql.org/33/

Regards

Pavel



> --
> Best Regards,
> Ian Zagorskikh
>


Re: Replication slot stats misgivings

2021-04-12 Thread vignesh C
On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 12, 2021 at 9:36 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > It seems Vignesh has changed patches based on the latest set of
> > > > > > > > comments so you might want to rebase.
> > > > > > >
> > > > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > > > >
> > > > > > > I've attached the updated version of the patches. I didn't change
> > > > > > > anything in the patch that changes char[NAMEDATALEN] to NameData 
> > > > > > > (0001
> > > > > > > patch) and patches that add tests.
> > > > > > >
> > > > > >
> > > > > > I think we can push 0001. What do you think?
> > > > >
> > > > > +1
> > > > >
> > > > > >
> > > > > > > In 0003 patch I reordered the
> > > > > > > output parameters of pg_stat_replication_slots; showing total 
> > > > > > > number
> > > > > > > of transactions and total bytes followed by statistics for 
> > > > > > > spilled and
> > > > > > > streamed transactions seems appropriate to me.
> > > > > > >
> > > > > >
> > > > > > I am not sure about this because I think we might want to add some
> > > > > > info of stream/spill bytes in total_bytes description (something 
> > > > > > like
> > > > > > stream/spill bytes are not in addition to total_bytes).
> > >
> > > BTW doesn't it confuse users that stream/spill bytes are not in
> > > addition to total_bytes? User will need to do "total_bytes +
> > > spill/stream_bytes" to know the actual total amount of data sent to
> > > the decoding output plugin, is that right?
> > >
> >
> > No, total_bytes includes the spill/stream bytes. So, the user doesn't
> > need to do any calculation to compute totel_bytes sent to output
> > plugin.
>
> The following test for the latest v8 patch seems to show different.
> total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> something?

I will check this issue and post my analysis.

Regards,
Vignesh




Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10));

2021-04-12 Thread Yulin PEI
HI hackers,
I found it could cause a crash when executing sql statement: `CREATE VIEW 
v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10)); ` in 
postgres 13.2 release.

The crash happens at view.c:89 and I did some analysis:

```

ColumnDef  *def = makeColumnDef(tle->resname,
exprType((Node *) tle->expr),
exprTypmod((Node *) tle->expr),
exprCollation((Node *) tle->expr));



/*
 * It's possible that the column is of a collatable type but the
 * collation could not be resolved, so double-check.
 */

// Here is the analysis:

//example :  ('4' COLLATE "C")::INT

//exprCollation((Node *) tle->expr) is the oid of collate "COLLATE 'C'" so 
def->collOid is valid
//exprType((Node *) tle->expr)) is 23 which is the oid of type int4.
//We know that int4 is not collatable by calling type_is_collatable()

if (type_is_collatable(exprType((Node *) tle->expr)))
{
   if (!OidIsValid(def->collOid))
  ereport(ERROR,
(errcode(ERRCODE_INDETERMINATE_COLLATION),
 errmsg("could not determine which collation to use for view column 
\"%s\"",
  def->colname),
 errhint("Use the COLLATE clause to set the collation 
explicitly.")));
}
else

   // So we are here! int is not collatable and def->collOid is valid.
   Assert(!OidIsValid(def->collOid));

```

I am not sure whether to fix this bug in function DefineVirtualRelation or to 
fix this bug in parse tree and analyze procedure, so maybe we can discuss.




Best Regard!
Yulin PEI




Re: SQL/JSON: JSON_TABLE

2021-04-12 Thread Erik Rijkers
> On 2021.03.27. 02:12 Nikita Glukhov  wrote:
> Attached 47th version of the patches.

We're past feature freeze for 14 and alas, JSON_TABLE has not made it.

I have tested quite a bit with it and because I didn't find any trouble with 
functionality or speed, I wanted to at least mention that here once.

I looked at v47, these files
> [0001-SQL-JSON-functions-v47.patch]
> [0002-JSON_TABLE-v47.patch]
> [0003-JSON_TABLE-PLAN-DEFAULT-clause-v47.patch]
> [0004-JSON_TABLE-PLAN-clause-v47.patch]
> [manual_addition_fixed.patch]  # for this see [1], [2]

   (v47 doesn't apply anymore, as cfbot shows, but instances can still be built 
on top of 6131ffc43ff from 30 march 2021)

I hope it will fare better next round, version 15.

Thanks,

Erik Rijkers

[1] 
https://www.postgresql.org/message-id/69eefc5a-cabc-8dd3-c689-93da038c0d6a%40postgrespro.ru
[2] 
https://www.postgresql.org/message-id/19181987.22943.1617141503618%40webmailclassic.xs4all.nl


> -- 
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company




Contribution to PostgreSQL - please give an advice

2021-04-12 Thread Ian Zagorskikh
Hi all!

I would like to contribute my time and efforts to the PostgreSQL project
development. I have some [hope not too bad] experience in software
development primarily for Linux/BSD/Windows platforms with C/C++ though
almost no experience in RDBMS internals. I have read the "Development
Information" section in wiki and checked the official TODO list. It's
really great but it's a bit huge and "all is so tasty" so I don't know
where to start. Can you please advise me how to proceed? Maybe there are
active tasks or bugs or issues that require man power? I would appreciate
any advice or mentorship. Thank you!

-- 
Best Regards,
Ian Zagorskikh


Re: multi-install PostgresNode fails with older postgres versions

2021-04-12 Thread Jehan-Guillaume de Rorthais
On Mon, 12 Apr 2021 09:52:24 -0400
Andrew Dunstan  wrote:

> On 4/12/21 8:59 AM, Jehan-Guillaume de Rorthais wrote:
> > Hi,
> >
> > On Wed, 7 Apr 2021 20:07:41 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> > [...]  
>  Let me know if it worth that I work on an official patch.  
> >>> Let's give it a try ...
> >> OK  
> > So, as promised, here is my take to port my previous work on PostgreSQL
> > source tree.
> >
> > Make check pass with no errors. The new class probably deserve some own TAP
> > tests.
> >
> > Note that I added a PostgresVersion class for easier and cleaner version
> > comparisons. This could be an interesting take away no matter what.
> >
> > I still have some more ideas to cleanup, revamp and extend the base class,
> > but I prefer to go incremental to keep things review-ability.
> >  
> 
> Thanks for this. We have been working somewhat on parallel lines. With
> your permission I'm going to take some of what you've done and
> incorporate it in the patch I'm working on.

The current context makes my weeks difficult to plan and quite chaotic, that's
why it took me some days to produce the patch I promised.

I'm fine with working with a common base code, thanks. Feel free to merge both
code, we'll trade patches during review. However, I'm not sure what is the
status of your patch, so I can not judge what would be the easier way to
incorporate. Mine is tested down to 9.1 (9.0 was meaningless because of lack of
pg_stat_replication) with:

* get_new_node
* init(allows_streaming => 1)
* start
* stop
* backup
* init_from_backup
* wait_for_catchup
* command_checks_all

Note the various changes in my init() and new method allow_streaming(), etc.

FYI (to avoid duplicate work), the next step on my todo was to produce some
meaningful tap tests to prove the class.

> A PostgresVersion class is a good idea - I was already contemplating
> something of the kind.

Thanks!

Regards,




Re: psql - add SHOW_ALL_RESULTS option

2021-04-12 Thread Fabien COELHO



Hello Tom,


It's right: this is dead code because all paths through the if-nest
starting at line 1373 now leave results = NULL.  Hence, this patch
has broken the autocommit logic;


Do you mean yet another feature without a single non-regression test? :-(

I tend to rely on non regression tests to catch bugs in complex 
multi-purpose hard-to-maintain functions when the code is modified.


I have submitted a patch to improve psql coverage to about 90%, but given 
the lack of enthousiasm, I simply dropped it. Not sure I was right not 
to insist.


it's no longer possible to tell whether we should do anything with our 
savepoint.



Between this and the known breakage of control-C, it seems clear
to me that this patch was nowhere near ready for prime time.
I think shoving it in on the last day before feature freeze was
ill-advised, and it ought to be reverted.  We can try again later.


The patch has been asleep for quite a while, and was resurrected, possibly 
too late in the process. ISTM that fixing it for 14 is manageable, 
but this is not my call.


--
Fabien.




[GSoC 2021 proposal] pl/julia extension

2021-04-12 Thread Konstantina Skovola
Hello community,

I’m Konstantina, a GSoC candidate for the project “Create Procedural
language extension for the Julia programming language”. The mentors have
already looked at my proposal and I’m attaching the finalized document.
There is still some time for corrections, in case anyone would like to
offer an opinion.

Best regards,

Konstantina Skovola


Konstantina Skovola pl_julia proposal.pdf
Description: Adobe PDF document


Re: multi-install PostgresNode fails with older postgres versions

2021-04-12 Thread Andrew Dunstan


On 4/12/21 8:59 AM, Jehan-Guillaume de Rorthais wrote:
> Hi,
>
> On Wed, 7 Apr 2021 20:07:41 +0200
> Jehan-Guillaume de Rorthais  wrote:
> [...]
 Let me know if it worth that I work on an official patch.
>>> Let's give it a try ...  
>> OK
> So, as promised, here is my take to port my previous work on PostgreSQL
> source tree.
>
> Make check pass with no errors. The new class probably deserve some own TAP
> tests.
>
> Note that I added a PostgresVersion class for easier and cleaner version
> comparisons. This could be an interesting take away no matter what.
>
> I still have some more ideas to cleanup, revamp and extend the base class, but
> I prefer to go incremental to keep things review-ability.
>

Thanks for this. We have been working somewhat on parallel lines. With
your permission I'm going to take some of what you've done and
incorporate it in the patch I'm working on.


A PostgresVersion class is a good idea - I was already contemplating
something of the kind.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Replication slot stats misgivings

2021-04-12 Thread Masahiko Sawada
On Mon, Apr 12, 2021 at 9:36 PM Amit Kapila  wrote:
>
> On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  wrote:
> > >
> > > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > > It seems Vignesh has changed patches based on the latest set of
> > > > > > > comments so you might want to rebase.
> > > > > >
> > > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > > >
> > > > > > I've attached the updated version of the patches. I didn't change
> > > > > > anything in the patch that changes char[NAMEDATALEN] to NameData 
> > > > > > (0001
> > > > > > patch) and patches that add tests.
> > > > > >
> > > > >
> > > > > I think we can push 0001. What do you think?
> > > >
> > > > +1
> > > >
> > > > >
> > > > > > In 0003 patch I reordered the
> > > > > > output parameters of pg_stat_replication_slots; showing total number
> > > > > > of transactions and total bytes followed by statistics for spilled 
> > > > > > and
> > > > > > streamed transactions seems appropriate to me.
> > > > > >
> > > > >
> > > > > I am not sure about this because I think we might want to add some
> > > > > info of stream/spill bytes in total_bytes description (something like
> > > > > stream/spill bytes are not in addition to total_bytes).
> >
> > BTW doesn't it confuse users that stream/spill bytes are not in
> > addition to total_bytes? User will need to do "total_bytes +
> > spill/stream_bytes" to know the actual total amount of data sent to
> > the decoding output plugin, is that right?
> >
>
> No, total_bytes includes the spill/stream bytes. So, the user doesn't
> need to do any calculation to compute totel_bytes sent to output
> plugin.

The following test for the latest v8 patch seems to show different.
total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
something?

postgres(1:85969)=# select pg_create_logical_replication_slot('s',
'test_decoding');
 pg_create_logical_replication_slot

 (s,0/1884468)
(1 row)

postgres(1:85969)=# create table a (i int);
CREATE TABLE
postgres(1:85969)=# insert into a select generate_series(1, 10);
INSERT 0 10
postgres(1:85969)=# set logical_decoding_work_mem to 64;
SET
postgres(1:85969)=# select * from pg_stat_replication_slots ;
 slot_name | total_txns | total_bytes | spill_txns | spill_count |
spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
---++-++-+-+-+--+--+-
 s |  0 |   0 |  0 |   0 |
  0 |   0 |0 |0 |
(1 row)

postgres(1:85969)=# select count(*) from
pg_logical_slot_peek_changes('s', NULL, NULL);
 count

 14
(1 row)

postgres(1:85969)=# select * from pg_stat_replication_slots ;
 slot_name | total_txns | total_bytes | spill_txns | spill_count |
spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
---++-++-+-+-+--+--+-
 s |  2 |1808 |  1 | 202 |
1320 |   0 |0 |0 |
(1 row)

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




pg_upgrade check for invalid role-specific default config

2021-04-12 Thread Charlie Hornsby
Hi all,

While troubleshooting a failed upgrade from v11 -> v12 I realised I had
encountered a bug previously reported on the pgsql-bugs mailing list:

#14242 Role with a setconfig "role" setting to a nonexistent role causes
pg_upgrade to fail

https://www.postgresql.org/message-id/20160711223641.1426.86096%40wrigleys.postgresql.org

To quote the previous report:

> It is possible to modify the "role" setting in setconfig in the
> pg_db_role_setting table such that it points to a nonexistent role.  When
> this is the case, restoring the output of pg_dumpall will fail due to the
> missing role.

> Steps to reproduce:

> 1. As superuser, execute "create role foo with login password 'test'"
> 2. As foo, execute "alter role foo set role = 'foo'"
> 3. As superuser, execute "alter role foo rename to bar"
> a. At this point, the setconfig entry in pg_db_role_setting for
> 'bar' will contain '{role=foo}', which no longer exists
> 4. Execute pg_upgrade with the recommended steps in
> https://www.postgresql.org/docs/current/static/pgupgrade.html

> During pg_upgrade (more specifically, during the restore of the output from
> pg_dumpall), the "ALTER ROLE "bar" SET "role" TO 'foo'" command generated
> will fail with "ERROR: role "foo" does not exist".

> This issue was identified by Jordan Lange and Nathan Bossart.

The steps in the original report reproduce the problem on all currently
supported pg versions. I appreciate that the invalid role-specific default 
settings are ultimately self-inflicted by the user, but as a third-party 
performing the upgrade this caught me by surprise.

Since it is possible to write a query to identify these cases, would there
be appetite for me to submit a patch to add a check for this to 
pg_upgrade?

First time mailing list user here so many apologies for any missteps I have
made in this message.

Best regards,
Charlie Hornsby



Re: multi-install PostgresNode fails with older postgres versions

2021-04-12 Thread Jehan-Guillaume de Rorthais
Hi,

On Wed, 7 Apr 2021 20:07:41 +0200
Jehan-Guillaume de Rorthais  wrote:
[...]
> > > Let me know if it worth that I work on an official patch.
> > 
> > Let's give it a try ...  
> 
> OK

So, as promised, here is my take to port my previous work on PostgreSQL
source tree.

Make check pass with no errors. The new class probably deserve some own TAP
tests.

Note that I added a PostgresVersion class for easier and cleaner version
comparisons. This could be an interesting take away no matter what.

I still have some more ideas to cleanup, revamp and extend the base class, but
I prefer to go incremental to keep things review-ability.

Thanks,
>From 077e447995b3b8bb4c0594a6616e1350acd9d48b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 12 Apr 2021 14:45:17 +0200
Subject: [PATCH] Draft: Makes PostgresNode multi-version aware

---
 src/test/perl/PostgresNode.pm| 708 ---
 src/test/perl/PostgresVersion.pm |  65 +++
 2 files changed, 704 insertions(+), 69 deletions(-)
 create mode 100644 src/test/perl/PostgresVersion.pm

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index e26b2b3f30..d7a9e54efd 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -102,6 +102,7 @@ use Test::More;
 use TestLib ();
 use Time::HiRes qw(usleep);
 use Scalar::Util qw(blessed);
+use PostgresVersion;
 
 our @EXPORT = qw(
   get_new_node
@@ -376,9 +377,42 @@ sub dump_info
 	return;
 }
 
+# Internal routine to allows streaming replication on a primary node.
+sub allows_streaming
+{
+	my ($self, $allows_streaming) = @_;
+	my $pgdata = $self->data_dir;
+	my $wal_level;
+
+	if ($allows_streaming eq "logical")
+	{
+		$wal_level = "logical";
+	}
+	else
+	{
+		$wal_level = "replica";
+	}
+
+	$self->append_conf( 'postgresql.conf', qq{
+		wal_level = $wal_level
+		max_wal_senders = 10
+		max_replication_slots = 10
+		wal_log_hints = on
+		hot_standby = on
+		# conservative settings to ensure we can run multiple postmasters:
+		shared_buffers = 1MB
+		max_connections = 10
+		# limit disk space consumption, too:
+		max_wal_size = 128MB
+	});
 
-# Internal method to set up trusted pg_hba.conf for replication.  Not
-# documented because you shouldn't use it, it's called automatically if needed.
+	$self->set_replication_conf;
+
+	return;
+}
+
+# Internal to set up trusted pg_hba.conf for replication.  Not documented
+# because you shouldn't use it, it's called automatically if needed.
 sub set_replication_conf
 {
 	my ($self) = @_;
@@ -398,6 +432,15 @@ sub set_replication_conf
 	return;
 }
 
+# Internal methods to track capabilities depending on the PostgreSQL major
+# version used
+
+sub can_slots   { return 1 }
+sub can_log_hints   { return 1 }
+sub can_restart_after_crash { return 1 }
+sub can_skip_init_fsync { return 1 }
+
+
 =pod
 
 =item $node->init(...)
@@ -429,6 +472,7 @@ sub init
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $host   = $self->host;
+	my @cmd;
 
 	local %ENV = $self->_get_env();
 
@@ -438,19 +482,25 @@ sub init
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
 
-	TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
-		@{ $params{extra} });
+	@cmd = ( 'initdb', '-D', $pgdata, '-A', 'trust' );
+	push @cmd, '-N' if $self->can_skip_init_fsync;
+	push @cmd, @{ $params{extra} } if defined $params{extra};
+
+	TestLib::system_or_bail(@cmd);
 	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata,
 		@{ $params{auth_extra} });
 
 	open my $conf, '>>', "$pgdata/postgresql.conf";
 	print $conf "\n# Added by PostgresNode.pm\n";
 	print $conf "fsync = off\n";
-	print $conf "restart_after_crash = off\n";
+	print $conf "restart_after_crash = off\n" if $self->can_restart_after_crash;
 	print $conf "log_line_prefix = '%m [%p] %q%a '\n";
 	print $conf "log_statement = all\n";
-	print $conf "log_replication_commands = on\n";
-	print $conf "wal_retrieve_retry_interval = '500ms'\n";
+	if ($self->version >= '9.5.0')
+	{
+		print $conf "log_replication_commands = on\n";
+		print $conf "wal_retrieve_retry_interval = '500ms'\n";
+	}
 
 	# If a setting tends to affect whether tests pass or fail, print it after
 	# TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby permitting
@@ -463,31 +513,8 @@ sub init
 	# concurrently must not share a stats_temp_directory.
 	print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
 
-	if ($params{allows_streaming})
-	{
-		if ($params{allows_streaming} eq "logical")
-		{
-			print $conf "wal_level = logical\n";
-		}
-		else
-		{
-			print $conf "wal_level = replica\n";
-		}
-		print $conf "max_wal_senders = 10\n";
-		print $conf "max_replication_slots = 10\n";
-		print $conf "wal_log_hints = on\n";
-		print $conf "hot_standby = on\n";
-		# conservative settings to ensure we can run multiple postmasters:
-		print $conf "shared_buffers = 1MB\n";
-		print $conf "max_connections = 10\n";
-		# limit disk space consumption, 

Re: "could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-12 Thread Tomas Vondra
On 4/12/21 2:24 PM, Luc Vlaming wrote:
> Hi,
> 
> When trying to run on master (but afaik also PG-13) TPC-DS queries 94,
> 95 and 96 on a SF10 I get the error "could not find pathkey item to sort".
> When I disable enable_gathermerge the problem goes away and then the
> plan for query 94 looks like below. I tried figuring out what the
> problem is but to be honest I would need some pointers as the code that
> tries to matching equivalence members in prepare_sort_from_pathkeys is
> something i'm really not familiar with.
> 

Could be related to incremental sort, which allowed some gather merge
paths that were impossible before. We had a couple issues related to
that fixed in November, IIRC.

> To reproduce you can either ingest and test using the toolkit I used too
> (see https://github.com/swarm64/s64da-benchmark-toolkit/), or
> alternatively just use the schema (see
> https://github.com/swarm64/s64da-benchmark-toolkit/tree/master/benchmarks/tpcds/schemas/psql_native)
> 

Thanks, I'll see if I can reproduce that with your schema.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Replication slot stats misgivings

2021-04-12 Thread Amit Kapila
On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > It seems Vignesh has changed patches based on the latest set of
> > > > > > comments so you might want to rebase.
> > > > >
> > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > >
> > > > > I've attached the updated version of the patches. I didn't change
> > > > > anything in the patch that changes char[NAMEDATALEN] to NameData (0001
> > > > > patch) and patches that add tests.
> > > > >
> > > >
> > > > I think we can push 0001. What do you think?
> > >
> > > +1
> > >
> > > >
> > > > > In 0003 patch I reordered the
> > > > > output parameters of pg_stat_replication_slots; showing total number
> > > > > of transactions and total bytes followed by statistics for spilled and
> > > > > streamed transactions seems appropriate to me.
> > > > >
> > > >
> > > > I am not sure about this because I think we might want to add some
> > > > info of stream/spill bytes in total_bytes description (something like
> > > > stream/spill bytes are not in addition to total_bytes).
>
> BTW doesn't it confuse users that stream/spill bytes are not in
> addition to total_bytes? User will need to do "total_bytes +
> spill/stream_bytes" to know the actual total amount of data sent to
> the decoding output plugin, is that right?
>

No, total_bytes includes the spill/stream bytes. So, the user doesn't
need to do any calculation to compute totel_bytes sent to output
plugin.

-- 
With Regards,
Amit Kapila.




Re: TRUNCATE on foreign table

2021-04-12 Thread Fujii Masao




On 2021/04/09 23:10, Bharath Rupireddy wrote:

On Fri, Apr 9, 2021 at 7:06 PM Fujii Masao  wrote:

  > 4. Tab-completion for TRUNCATE should be updated so that also foreign 
tables are displayed.

 It will be good to have.


Patch attached.


Tab completion patch LGTM and it works as expected.


Thanks for the review! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-12 Thread Amit Langote
On Mon, Apr 12, 2021 at 4:42 PM Amit Langote  wrote:
> On Mon, Apr 12, 2021 at 6:20 AM Alvaro Herrera  
> wrote:
> > On 2021-Mar-31, Tom Lane wrote:
> >
> > > diff -U3 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
> > >  
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > > --- 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
> > >   2021-03-29 20:14:21.258199311 +0200
> > > +++ 
> > > /home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > > 2021-03-30 18:54:34.272839428 +0200
> > > @@ -324,6 +324,7 @@
> > >  1
> > >  2
> > >  step s1insert: insert into d4_fk values (1);
> > > +ERROR:  insert or update on table "d4_fk" violates foreign key 
> > > constraint "d4_fk_a_fkey"
> > >  step s1c: commit;
> > >
> > >  starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s 
> > > s1insert s1c
> >
> > Hmm, actually, looking at this closely, I think the expected output is
> > bogus and trilobite is doing the right thing by throwing this error
> > here.  The real question is why isn't this case behaving in that way in
> > every *other* animal.
>
> Indeed.
>
> I can see a wrong behavior of RelationGetPartitionDesc() in a case
> that resembles the above test case.
>
> drop table if exists d4_primary, d4_primary1, d4_fk, d4_pid;
> create table d4_primary (a int primary key) partition by list (a);
> create table d4_primary1 partition of d4_primary for values in (1);
> create table d4_primary2 partition of d4_primary for values in (2);
> insert into d4_primary values (1);
> insert into d4_primary values (2);
> create table d4_fk (a int references d4_primary);
> insert into d4_fk values (2);
> create table d4_pid (pid int);
>
> Session 1:
> begin isolation level serializable;
> select * from d4_primary;
>  a
> ---
>  1
>  2
> (2 rows)
>
> Session 2:
> alter table d4_primary detach partition d4_primary1 concurrently;
> 
>
> Session 1:
> -- can see d4_primary1 as detached at this point, though still scans!
> select * from d4_primary;
>  a
> ---
>  1
>  2
> (2 rows)
> insert into d4_fk values (1);
> INSERT 0 1
> end;
>
> Session 2:
> 
> ALTER TABLE
>
> Session 1:
> -- FK violated
> select * from d4_primary;
>  a
> ---
>  2
> (1 row)
> select * from d4_fk;
>  a
> ---
>  1
> (1 row)
>
> The 2nd select that session 1 performs adds d4_primary1, whose detach
> it *sees* is pending, to the PartitionDesc, but without setting its
> includes_detached.  The subsequent insert's RI query, because it uses
> that PartitionDesc as-is, returns 1 as being present in d4_primary,
> leading to the insert succeeding.  When session 1's transaction
> commits, the waiting ALTER proceeds with committing the 2nd part of
> the DETACH, without having a chance again to check if the DETACH would
> lead to the foreign key of d4_fk being violated.
>
> It seems problematic to me that the logic of setting includes_detached
> is oblivious of the special check in find_inheritance_children() to
> decide whether "force"-include a detach-pending child based on
> cross-checking its pg_inherit tuple's xmin against the active
> snapshot.  Maybe fixing that would help, although I haven't tried that
> myself yet.

I tried that in the attached.  It is indeed the above failing
isolation test whose output needed to be adjusted.

While at it, I tried rewording the comment around that special
visibility check done to force-include detached partitions, as I got
confused by the way it's worded currently.  Actually, it may be a good
idea to add some comments around the intended include_detached
behavior in the places where PartitionDesc is used; e.g.
set_relation_partition_info() lacks a one-liner on why it's okay for
the planner to not see detached partitions.  Or perhaps, a comment for
includes_detached of PartitionDesc should describe the various cases
in which it is true and the cases in which it is not.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


PartitionDesc-includes_detached-thinko-fix.patch
Description: Binary data


interaction between csps with dummy tlists and set_customscan_references

2021-04-12 Thread Luc Vlaming

Hi,

Whilst developing a CSP that potentially sits (directly) above e.g. any 
union or anything with a dummy tlist we observed some problems as the 
set_customscan_references cannot handle any dummy tlists and will give 
invalid varno errors. I was wondering how we can fix this, and I was 
wondering what the reason is that there is actually no callback in the 
csp interface for the set_customscan_references. Can someone maybe 
clarify this for me?


Thanks!

Regards,
Luc




"could not find pathkey item to sort" for TPC-DS queries 94-96

2021-04-12 Thread Luc Vlaming

Hi,

When trying to run on master (but afaik also PG-13) TPC-DS queries 94, 
95 and 96 on a SF10 I get the error "could not find pathkey item to sort".
When I disable enable_gathermerge the problem goes away and then the 
plan for query 94 looks like below. I tried figuring out what the 
problem is but to be honest I would need some pointers as the code that 
tries to matching equivalence members in prepare_sort_from_pathkeys is 
something i'm really not familiar with.


To reproduce you can either ingest and test using the toolkit I used too 
(see https://github.com/swarm64/s64da-benchmark-toolkit/), or 
alternatively just use the schema (see 
https://github.com/swarm64/s64da-benchmark-toolkit/tree/master/benchmarks/tpcds/schemas/psql_native)


Best,
Luc


 Limit  (cost=229655.62..229655.63 rows=1 width=72)
   ->  Sort  (cost=229655.62..229655.63 rows=1 width=72)
 Sort Key: (count(DISTINCT ws1.ws_order_number))
 ->  Aggregate  (cost=229655.60..229655.61 rows=1 width=72)
   ->  Nested Loop Semi Join  (cost=1012.65..229655.59 
rows=1 width=16)
 ->  Nested Loop  (cost=1012.22..229653.73 rows=1 
width=20)
   Join Filter: (ws1.ws_web_site_sk = 
web_site.web_site_sk)
   ->  Nested Loop  (cost=1012.22..229651.08 
rows=1 width=24)
 ->  Gather  (cost=1011.80..229650.64 
rows=1 width=28)

   Workers Planned: 2
   ->  Nested Loop Anti Join 
(cost=11.80..228650.54 rows=1 width=28)
 ->  Hash Join 
(cost=11.37..227438.35 rows=2629 width=28)
   Hash Cond: 
(ws1.ws_ship_date_sk = date_dim.d_date_sk)
   ->  Parallel Seq 
Scan on web_sales ws1  (cost=0.00..219548.92 rows=3000992 width=32)
   ->  Hash 
(cost=10.57..10.57 rows=64 width=4)
 ->  Index Scan 
using idx_d_date on date_dim  (cost=0.29..10.57 rows=64 width=4)
   Index 
Cond: ((d_date >= '2000-03-01'::date) AND (d_date <= '2000-04-30'::date))
 ->  Index Only Scan using 
idx_wr_order_number on web_returns wr1  (cost=0.42..0.46 rows=2 width=4)
   Index Cond: 
(wr_order_number = ws1.ws_order_number)
 ->  Index Scan using 
customer_address_pkey on customer_address  (cost=0.42..0.44 rows=1 width=4)
   Index Cond: (ca_address_sk = 
ws1.ws_ship_addr_sk)
   Filter: ((ca_state)::text = 
'GA'::text)
   ->  Seq Scan on web_site  (cost=0.00..2.52 
rows=10 width=4)
 Filter: ((web_company_name)::text = 
'pri'::text)
 ->  Index Scan using idx_ws_order_number on 
web_sales ws2  (cost=0.43..1.84 rows=59 width=8)
   Index Cond: (ws_order_number = 
ws1.ws_order_number)

   Filter: (ws1.ws_warehouse_sk <> ws_warehouse_sk)

The top of the stacktrace is:
#0  errfinish (filename=0x5562dc1a5125 "createplan.c", lineno=6186, 
funcname=0x5562dc1a54d0 <__func__.14> "prepare_sort_from_pathkeys") at 
elog.c:514
#1  0x5562dbc2d7de in prepare_sort_from_pathkeys 
(lefttree=0x5562dc5a2f58, pathkeys=0x5562dc4eabc8, relids=0x0, 
reqColIdx=0x0, adjust_tlist_in_place=, 
p_numsortkeys=0x7ffc0b8cda84, p_sortColIdx=0x7ffc0b8cda88, 
p_sortOperators=0x7ffc0b8cda90, p_collations=0x7ffc0b8cda98, 
p_nullsFirst=0x7ffc0b8cdaa0) at createplan.c:6186
#2  0x5562dbe8d695 in make_sort_from_pathkeys (lefttree=out>, pathkeys=, relids=) at createplan.c:6313
#3  0x5562dbe8eba3 in create_sort_plan (flags=1, 
best_path=0x5562dc548d68, root=0x5562dc508cf8) at createplan.c:2118
#4  create_plan_recurse (root=0x5562dc508cf8, best_path=0x5562dc548d68, 
flags=1) at createplan.c:489
#5  0x5562dbe8f315 in create_gather_merge_plan 
(best_path=0x5562dc5782f8, root=0x5562dc508cf8) at createplan.c:1885
#6  create_plan_recurse (root=0x5562dc508cf8, best_path=0x5562dc5782f8, 
flags=) at createplan.c:541
#7  0x5562dbe8ddad in create_nestloop_plan 
(best_path=0x5562dc585668, root=0x5562dc508cf8) at createplan.c:4237
#8  create_join_plan (best_path=0x5562dc585668, root=0x5562dc508cf8) at 
createplan.c:1062
#9  create_plan_recurse (root=0x5562dc508cf8, best_path=0x5562dc585668, 
flags=) at createplan.c:418
#10 0x5562dbe8ddad in create_nestloop_plan 
(best_path=0x5562dc5c4428, root=0x5562dc508cf8) at createplan.c:4237
#11 create_join_plan (best_path=0x5562dc5c4428, 

Re: Replication slot stats misgivings

2021-04-12 Thread vignesh C
On Mon, Apr 12, 2021 at 4:46 PM Amit Kapila  wrote:
>
> On Sat, Apr 10, 2021 at 6:51 PM vignesh C  wrote:
> >
>
> Thanks, 0001 and 0002 look good to me. I have a minor comment for 0002.
>
> 
> +total_bytesbigint
> +   
> +   
> +Amount of decoded transactions data sent to the decoding output 
> plugin
> +while decoding the changes from WAL for this slot. This can be used 
> to
> +gauge the total amount of data sent during logical decoding.
>
> Can we slightly extend it to say something like: Note that this
> includes the bytes streamed and or spilled. Similarly, we can extend
> it for total_txns.
>

Thanks for the comments, the comments are fixed in the v8 patch attached.
Thoughts?

Regards,
Vignesh
From b3288d575d9b8c26381fa8da773960d16ae2a1f3 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Sat, 10 Apr 2021 08:14:52 +0530
Subject: [PATCH v8 1/5] Changed char datatype to NameData datatype for
 slotname.

Changed char datatype to NameData datatype for slotname.
---
 src/backend/postmaster/pgstat.c   | 32 +++
 src/backend/replication/logical/logical.c | 13 ++---
 src/backend/replication/slot.c|  7 -
 src/backend/utils/adt/pgstatfuncs.c   |  2 +-
 src/include/pgstat.h  | 11 +++-
 5 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4467625f7..666ce95d08 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -64,6 +64,7 @@
 #include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
+#include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -1539,7 +1540,7 @@ pgstat_reset_replslot_counter(const char *name)
 		if (SlotIsPhysical(slot))
 			return;
 
-		strlcpy(msg.m_slotname, name, NAMEDATALEN);
+		namestrcpy(_slotname, name);
 		msg.clearall = false;
 	}
 	else
@@ -1812,10 +1813,7 @@ pgstat_report_tempfile(size_t filesize)
  * --
  */
 void
-pgstat_report_replslot(const char *slotname, PgStat_Counter spilltxns,
-	   PgStat_Counter spillcount, PgStat_Counter spillbytes,
-	   PgStat_Counter streamtxns, PgStat_Counter streamcount,
-	   PgStat_Counter streambytes)
+pgstat_report_replslot(const PgStat_ReplSlotStats *repSlotStat)
 {
 	PgStat_MsgReplSlot msg;
 
@@ -1823,14 +1821,14 @@ pgstat_report_replslot(const char *slotname, PgStat_Counter spilltxns,
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(_hdr, PGSTAT_MTYPE_REPLSLOT);
-	strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
+	namestrcpy(_slotname, NameStr(repSlotStat->slotname));
 	msg.m_drop = false;
-	msg.m_spill_txns = spilltxns;
-	msg.m_spill_count = spillcount;
-	msg.m_spill_bytes = spillbytes;
-	msg.m_stream_txns = streamtxns;
-	msg.m_stream_count = streamcount;
-	msg.m_stream_bytes = streambytes;
+	msg.m_spill_txns = repSlotStat->spill_txns;
+	msg.m_spill_count = repSlotStat->spill_count;
+	msg.m_spill_bytes = repSlotStat->spill_bytes;
+	msg.m_stream_txns = repSlotStat->stream_txns;
+	msg.m_stream_count = repSlotStat->stream_count;
+	msg.m_stream_bytes = repSlotStat->stream_bytes;
 	pgstat_send(, sizeof(PgStat_MsgReplSlot));
 }
 
@@ -1846,7 +1844,7 @@ pgstat_report_replslot_drop(const char *slotname)
 	PgStat_MsgReplSlot msg;
 
 	pgstat_setheader(_hdr, PGSTAT_MTYPE_REPLSLOT);
-	strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
+	namestrcpy(_slotname, slotname);
 	msg.m_drop = true;
 	pgstat_send(, sizeof(PgStat_MsgReplSlot));
 }
@@ -5202,7 +5200,7 @@ pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
 	else
 	{
 		/* Get the index of replication slot statistics to reset */
-		idx = pgstat_replslot_index(msg->m_slotname, false);
+		idx = pgstat_replslot_index(NameStr(msg->m_slotname), false);
 
 		/*
 		 * Nothing to do if the given slot entry is not found.  This could
@@ -5538,7 +5536,7 @@ pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
 	 * Get the index of replication slot statistics.  On dropping, we don't
 	 * create the new statistics.
 	 */
-	idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
+	idx = pgstat_replslot_index(NameStr(msg->m_slotname), !msg->m_drop);
 
 	/*
 	 * The slot entry is not found or there is no space to accommodate the new
@@ -5763,7 +5761,7 @@ pgstat_replslot_index(const char *name, bool create_it)
 	Assert(nReplSlotStats <= max_replication_slots);
 	for (i = 0; i < nReplSlotStats; i++)
 	{
-		if (strcmp(replSlotStats[i].slotname, name) == 0)
+		if (namestrcmp([i].slotname, name) == 0)
 			return i;			/* found */
 	}
 
@@ -5776,7 +5774,7 @@ pgstat_replslot_index(const char *name, bool create_it)
 
 	/* Register new slot */
 	memset([nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-	strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
+	namestrcpy([nReplSlotStats].slotname, name);
 
 	return nReplSlotStats++;
 }
diff --git 

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2021-04-12 Thread Luc Vlaming

On 18-01-2021 08:47, Luc Vlaming wrote:

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!


I would like this topic to somehow progress and was wondering what other 
benchmarks / tests would be needed to have some progress? I've so far 
provided benchmarks for small(ish) queries and some tpch numbers, 
assuming those would be enough.




On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
I would like to propose a small patch to the JIT machinery which 
makes the
IR code generation lazy. The reason for postponing the generation 
of the IR

code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with 
their own
JITted functions, especially when e.g. partition-aware 
joins/aggregates are
enabled. However, only a fraction of those functions is actually 
executed
because the Parallel Append node distributes the workers among the 
nodes.
With the attached patch we get a lazy generation which makes that 
this is no

longer a problem.


I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the 
case in

your example.

If you e.g. look at a query like
   SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY 
blub;

on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation 
ends

up with copies of the inlined code).


As far as I can see you've basically falsified the second part of this
comment (which you moved):


+
+    /*
+ * Don't immediately emit nor actually generate the function.
+ * instead do so the first time the expression is actually 
evaluated.
+ * That allows to emit a lot of functions together, avoiding a 
lot of
+ * repeated llvm and memory remapping overhead. It also helps 
with not
+ * compiling functions that will never be evaluated, as can be 
the case
+ * if e.g. a parallel append node is distributing workers 
between its

+ * child nodes.
+ */



-    /*
- * Don't immediately emit function, instead do so the first 
time the

- * expression is actually evaluated. That allows to emit a lot of
- * functions together, avoiding a lot of repeated llvm and memory
- * remapping overhead.
- */


Greetings,

Andres Freund



Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I 
just found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and 
then other solutions could be added on top.


Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed 
and not optimized several times (see [1]). Requires all code to be 
emitted in the same module.

2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would 
either need a shm capable optimized IR cache or would not work with 
parallel workers.

4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming 
there is a constant overhead for JITting and the goal is to gain 
runtime.


Going forward I would first try to see if my current approach can 
work out. The only idea that would be counterproductive to my 
solution would be solution 1. Afterwards I'd like to continue with 
either solution 2, 5, or 3 in the hopes that we can reduce JIT 
overhead to a minimum and can therefore apply it more broadly.


To test out why and where the JIT performance decreased with my 
solution I improved the test script and added various queries to 
model some of the cases I think we should care about. I have not 
(yet) done big scale benchmarks as these queries seemed to already 
show enough problems for now. Now there are 4 queries which test 
JITting with/without partitions, and with varying amounts of workers 
and rowcounts. I hope these are indeed a somewhat representative set 
of queries.


As pointed out the current patch does create a degradation in 
performance wrt queries that are not partitioned (basically q3 and 
q4). After looking into those queries I noticed two things:
- 

Re: allow partial union-all and improve parallel subquery costing

2021-04-12 Thread Luc Vlaming

Hi David,

On 15-03-2021 14:09, David Steele wrote:

Hi Luc,

On 12/30/20 8:54 AM, Luc Vlaming wrote:


Created a commitfest entry assuming this is the right thing to do so 
that someone can potentially pick it up during the commitfest.


Providing an updated patch based on latest master.


Looks like you need another rebase: 
http://cfbot.cputube.org/patch_32_2787.log. Marked as Waiting for Author.


You may also want to give a more detailed description of what you have 
done here and why it improves execution plans. This may help draw some 
reviewers.


Regards,


Here's an improved and rebased patch. Hope the description helps some 
people. I will resubmit it to the next commitfest.


Regards,
Luc
>From e918e7cf8c9fe628c7daba2ccf37ad767691e4c7 Mon Sep 17 00:00:00 2001
From: Luc Vlaming 
Date: Mon, 12 Apr 2021 09:55:30 +0200
Subject: [PATCH v4] Add explicit partial UNION ALL path and improve parallel
 subquery rowcounts and costing.

By adding the partial union-all path we get parallel plans whenever
the flatten_simple_union_all cannot be applied, e.g. whenever
the column types do not exactly match. A simple testcase shows
this in the regression tests.
Also for e.g. tpc-ds query 5 we now get a more parallel plan
for the part that processes the csr CTE.
To make it more likely that the improved path is chosen another
small fix is added which corrects the rowcounts when subquery
nodes are used in parallel plans.
---
 src/backend/optimizer/path/costsize.c | 11 
 src/backend/optimizer/prep/prepunion.c|  4 ++
 .../regress/expected/incremental_sort.out | 10 ++--
 src/test/regress/expected/union.out   | 52 +++
 src/test/regress/sql/union.sql| 37 +
 5 files changed, 108 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8577c7b138..1da6879c6d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1426,6 +1426,17 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
 	startup_cost += path->path.pathtarget->cost.startup;
 	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
 
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double  parallel_divisor = get_parallel_divisor(>path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		run_cost /= parallel_divisor;
+	}
+
 	path->path.startup_cost += startup_cost;
 	path->path.total_cost += startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 037dfaacfd..7d4a6a19c2 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -679,6 +679,10 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 			   NIL, NULL,
 			   parallel_workers, enable_parallel_append,
 			   -1);
+
+		if (op->all && enable_parallel_append)
+			add_partial_path(result_rel, ppath);
+
 		ppath = (Path *)
 			create_gather_path(root, result_rel, ppath,
 			   result_rel->reltarget, NULL, NULL);
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index a417b566d9..a0a31ba053 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1487,14 +1487,12 @@ explain (costs off) select * from t union select * from t order by 1,3;
->  Unique
  ->  Sort
Sort Key: t.a, t.b, t.c
-   ->  Append
- ->  Gather
-   Workers Planned: 2
+   ->  Gather
+ Workers Planned: 2
+ ->  Parallel Append
->  Parallel Seq Scan on t
- ->  Gather
-   Workers Planned: 2
->  Parallel Seq Scan on t t_1
-(13 rows)
+(11 rows)
 
 -- Full sort, not just incremental sort can be pushed below a gather merge path
 -- by generate_useful_gather_paths.
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 75f78db8f5..cf7660f524 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1420,3 +1420,55 @@ where (x = 0) or (q1 >= q2 and q1 <= q2);
  4567890123456789 |  4567890123456789 | 1
 (6 rows)
 
+-- Test handling of appendrel with different types which disables the path flattening and
+-- forces a subquery node. for the subquery node ensure the rowcounts are correct.
+create function check_estimated_rows(text) returns table (estimated int)
+language plpgsql as
+$$
+declare
+ln text;
+tmp text[];
+first_row bool := true;
+begin
+for ln in
+execute format('explain %s', $1)
+loop
+tmp := 

  1   2   >