date:20220628

On Wed, 2022-06-29 at 10:17 +0900, Michael Paquier wrote:
> On Tue, Jun 28, 2022 at 11:52:56AM -0700, Jeff Davis wrote:
> > Patch attached. Some kinds of emit log hooks might find it useful
> > to
> > also compute the log_line_prefix.
> 
> Have you played with anything specific that would require that?  I
> am fine to expose this routine, being mostly curious about what kind
> of recent format implemented with the elog hook would use it.

Just a slightly different format that is directly digestible by another
system, while still preserving what the original messages in the file
would look like.

There are other ways to do it, but it's convenient. If we use, e.g.,
csv or json format, we lose the log_line_prefix and would need to
regenerate it from the individual fields.

Regards,
Jeff Davis

Re: Hardening PostgreSQL via (optional) ban on local file system access

On Tue, 2022-06-28 at 23:18 +0200, Hannu Krosing wrote:
> I was not after *completely* removing it, but just having an option
> which makes the superuser() function always return false.

Did you test that? I'm guessing that would cause lots of problems,
e.g., installing extensions.

Regards,
Jeff Davis

Re: Allowing REINDEX to have an optional name

On Tue, Jun 28, 2022 at 11:02:25AM +0100, Simon Riggs wrote:
> Attached patch is tested, documented and imho ready to be committed,
> so I will mark it so in CFapp.

The behavior introduced by this patch should be reflected in
reindexdb.  See in particular reindex_one_database(), where a
REINDEX_SYSTEM is enforced first on the catalogs for the
non-concurrent mode when running the reindex on a database.

+-- unqualified reindex database
+-- if you want to test REINDEX DATABASE, uncomment the following line,
+-- but note that this adds about 0.5s to the regression tests and the
+-- results are volatile when run in parallel to other tasks. Note also
+-- that REINDEX SYSTEM is specifically not tested because it can deadlock.
+-- REINDEX (VERBOSE) DATABASE;

No need to add that IMHO.

 REINDEX [ ( option [,
 ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [
 CONCURRENTLY ] name
+REINDEX [ ( option [,
 ...] ) ] { DATABASE | SYSTEM } [ CONCURRENTLY ] [ name ]

Shouldn't you remove DATABASE and SYSTEM from the first line, keeping
only INDEX. TABLE and SCHEMA?  The second line, with its optional
"name" would cover the DATABASE and SYSTEM cases at 100%.

-if (strcmp(objectName, get_database_name(objectOid)) != 0)
+if (objectName && strcmp(objectName, get_database_name(objectOid)) != 
0)
 ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("can only reindex the currently open database")));
 if (!pg_database_ownercheck(objectOid, GetUserId()))
 aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_DATABASE,
-   objectName);
+   get_database_name(objectOid));

This could call get_database_name() just once.

+* You might think it would be good to include catalogs,
+* but doing that can deadlock, so isn't much use in real world,
+* nor can we safely test that it even works.

Not sure what you mean here exactly.
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade allows itself to be run twice

On Sat, Jun 25, 2022 at 11:04:37AM -0500, Justin Pryzby wrote:
> I expect pg_upgrade to fail if I run it twice in a row.

Yep.

> It would be reasonable if it were to fail during the "--check" phase,
> maybe by failing like this:
> | New cluster database "..." is not empty: found relation "..."

So, we get a complaint that the new cluster is not empty after one
pg_upgrade run with a new command of pg_upgrade, with or without
--check.  This happens in check_new_cluster(), where we'd fatal if a
relation uses a namespace different than pg_catalog.

> But that fails to happen if the cluster has neither tables nor matviews, in
> which case, it passes the check phase and then fails like this:

Indeed, as of get_rel_infos().  

> I'll concede that a cluster which has no tables sounds a lot like a toy, but I
> find it disturbing that nothing prevents running the process twice, up to the
> point that it's evidently corrupted the catalog.

I have heard of cases where instances were only used with a set of
foreign tables, for example.  Not sure that this is spread enough to
worry about, but this would fail as much as your case.

> While looking at this, I noticed that starting postgres --single immediately
> after initdb allows creating objects with OIDs below FirstNormalObjectId
> (thereby defeating pg_upgrade's check).  I'm not familiar with the behavioral
> differences of single user mode, and couldn't find anything in the
> documentation.

This one comes from NextOID in the control data file after a fresh
initdb, and GetNewObjectId() would enforce that in a postmaster
environment to be FirstNormalObjectId when assigning the first user
OID.  Would you imply an extra step at the end of initdb to update the
control data file of the new cluster to reflect FirstNormalObjectId?
--
Michael


signature.asc
Description: PGP signature

Re: margay fails assertion in stats/dsa/dsm code

2022-06-28 Thread Thomas Munro

On Wed, Jun 29, 2022 at 6:04 AM Robert Haas  wrote:
> My first thought was that the return value of the call to
> dsm_impl_op() at the end of dsm_attach() is not checked and that maybe
> it was returning NULL, but it seems like whoever wrote
> dsm_impl_posix() was pretty careful to ereport(elevel, ...) in every
> failure path, and elevel is ERROR here, so I don't see any issue. My
> second thought was that maybe control had escaped from dsm_attach()
> due to an error before we got to the step where we actually map the
> segment, but then the dsm_segment * would be returned to the caller.
> Maybe they could retrieve it later using dsm_find_mapping(), but that
> function has no callers in core.

Thanks for looking.  Yeah.  I also read through that code many times
and drew the same conclusion.

> So I'm kind of stumped too, but did you by any chance check whether
> there are any DSM-related messages in the logs before the assertion
> failure?

Marcel kindly granted me access to his test machine, where the failure
can be reproduced by running make check lots of times.  I eventually
figured out that the problem control flow is ... of course ... the one
path that doesn't ereport(), and that's when errno == EEXIST.  That is
a path that is intended to handle DSM_OP_CREATE.  Here we are handling
DSM_OP_ATTACH, and I have verified that we're passing in just O_RDWR.
EEXIST is a nonsensical error for shm_open() without flags containing
O_CREAT | O_EXCL (according to POSIX and Solaris's man page).

On this OS, shm_open() opens plain files in /tmp (normally a RAM disk,
so kinda like /dev/shm on Linux), that much I can tell with a plain
old "ls" command.  We can also read its long lost open source cousin
(which may be completely different for all I know, but I'd doubt it):

https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/rt/shm.c
https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/rt/pos4obj.c

Erm.  It looks like __pos4obj_lock() could possibly return -1 and
leave errno == EEXIST, if it runs out of retries?  Then shm_open()
would return -1, and we'd blow up.  However, for that to happen, one
of those "SHM_LOCK_TYPE" files would have to linger for 64 sleep
loops, and I'm not sure why that'd happen, or what to do about it.  (I
don't immediately grok what that lock file is even for.)

I suppose this could indicate that the machine and/or RAM disk is
overloaded/swapping and one of those open() or unlink() calls is
taking a really long time, and that could be fixed with some system
tuning.  I suppose it's also remotely possible that the process is
getting peppered with signals so that funky shell script-style locking
scheme is interrupted and doesn't really wait very long.  Or maybe I
guessed wrong and some other closed source path is to blame *shrug*.

As for whether PostgreSQL needs to do anything, perhaps we should
ereport for this unexpected error as a matter of self-preservation, to
avoid the NULL dereference you'd presumably get on a non-cassert build
with the current coding?  Maybe just:

-   if (errno != EEXIST)
+   if (op == DSM_OP_ATTACH || errno != EEXIST)
ereport(elevel,
(errcode_for_dynamic_shared_memory(),
 errmsg("could not open shared
memory segment \"%s\": %m",

margay would probably still fail until that underlying problem is
addressed, but less mysteriously on our side at least.

Can we do something to help stop users mistakenly using force_parallel_mode?

Over on [1] I noticed that the user had set force_parallel_mode to
"on" in the hope that would trick the planner into making their query
run more quickly.  Of course, that's not what they want since that GUC
is only there to inject some parallel nodes into the plan in order to
verify the tuple communication works.

I get the idea that Robert might have copped some flak about this at
some point, given that he wrote the blog post at [2].

The user would have realised this if they'd read the documentation
about the GUC. However, I imagine they only went as far as finding a
GUC with a name which appears to be exactly what they need.  I mean,
what else could force_parallel_mode possibly do?

Should we maybe rename it to something less tempting? Maybe
debug_parallel_query?

I wonder if \dconfig *parallel* is going to make force_parallel_mode
even easier to find once PG15 is out.

David

[1] 
https://www.postgresql.org/message-id/DB4PR02MB8774E06D595D3088BE04ED92E7B99%40DB4PR02MB8774.eurprd02.prod.outlook.com
[2] 
https://www.enterprisedb.com/postgres-tutorials/using-forceparallelmode-correctly-postgresql

Re: Support logical replication of DDLs

2022-06-28 Thread Amit Kapila

On Tue, Jun 28, 2022 at 5:43 PM Amit Kapila  wrote:
>
> 5.
> +static ObjTree *
> +deparse_CreateStmt(Oid objectId, Node *parsetree)
> {
> ...
> + tmp = new_objtree_VA("TABLESPACE %{tablespace}I", 0);
> + if (node->tablespacename)
> + append_string_object(tmp, "tablespace", node->tablespacename);
> + else
> + {
> + append_null_object(tmp, "tablespace");
> + append_bool_object(tmp, "present", false);
> + }
> + append_object_object(createStmt, "tablespace", tmp);
> ...
> }
>
> Why do we need to append the objects (tablespace, with clause, etc.)
> when they are not present in the actual CREATE TABLE statement? The
> reason to ask this is that this makes the string that we want to send
> downstream much longer than the actual statement given by the user on
> the publisher.
>

After thinking some more on this, it seems the user may want to
optionally change some of these attributes, for example, on the
subscriber, it may want to associate the table with a different
tablespace. I think to address that we can append these additional
attributes optionally, say via an additional parameter
(append_all_options/append_all_attributes or something like that) in
exposed APIs like deparse_utility_command().

-- 
With Regards,
Amit Kapila.

Re: better error description on logical replication

2022-06-28 Thread Amit Kapila

On Tue, Jun 28, 2022 at 5:50 PM Marcos Pegoraro  wrote:
>
> I don´t know how to create a patch, maybe someday, but for now I´m just 
> sending this little problem if somebody can solve it.
>
> In a multi schema environment where several tables has same structure is a 
> little bit hard to know which one already has that primary key.
>
> On log I see now on replica server.
> Message:duplicate key value violates unique constraint "pkcustomer"
> Detail: Key (customer_id)=(530540) already exists.
>
> So, I know what table is but I don´t know what schema it belongs.
>

On which version, have you tried this? In HEAD, I am getting below information:
ERROR:  duplicate key value violates unique constraint "idx_t1"
DETAIL:  Key (c1)=(1) already exists.
CONTEXT:  processing remote data for replication origin "pg_16388"
during "INSERT" for replication target relation "public.t1" in
transaction 739 finished at 0/150D640

You can see that CONTEXT has schema information. Will that serve your purpose?

-- 
With Regards,
Amit Kapila.

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote:
> Nice catch, but this looks like massive overkill. I think we can very
> simply fix the test in just a few lines of code, instead of a 190 line
> fix and a 130 line TAP test.
> 
> It was never intended to be able to compare markers like rc1 vs rc2, and
> I don't see any need for it. If you can show me a sane use case I'll
> have another look, but right now it seems quite unnecessary.
> 
> Here's my proposed fix.

Do you think that we should add some tests for that?  One place that
comes into mind is test_misc/, and this would be cheap as this does
not require setting up a node or such.
--
Michael


signature.asc
Description: PGP signature

Re: Export log_line_prefix(); useful for emit_log_hook.

On Tue, Jun 28, 2022 at 11:52:56AM -0700, Jeff Davis wrote:
> Patch attached. Some kinds of emit log hooks might find it useful to
> also compute the log_line_prefix.

Have you played with anything specific that would require that?  I
am fine to expose this routine, being mostly curious about what kind
of recent format implemented with the elog hook would use it.
--
Michael


signature.asc
Description: PGP signature

tuplesort Generation memory contexts don't play nicely with index builds

Hackers,

I noticed while doing some memory context related work that since we
now use generation.c memory contexts for tuplesorts (40af10b57) that
tuplesort_putindextuplevalues() causes memory "leaks" in the
generation context due to index_form_tuple() being called while we're
switched into the state->tuplecontext.

I use the word "leak" here slightly loosely.  It's only a leak due to
how generation.c uses no free lists to allow reuse pfree'd memory.

It looks like the code has been this way ever since 9f03ca915 (Avoid
copying index tuples when building an index.) That commit did add a
big warning at the top of index_form_tuple() that the function must be
careful to not leak any memory.

A quick fix would be just to add a new bool field, e.g, usegencxt to
tuplesort_begin_common() and pass that as false in all functions apart
from tuplesort_begin_heap(). That way we'll always be using an aset.c
context when we call index_form_tuple().

However, part of me thinks that 9f03ca915 is standing in the way of us
doing more in the future to optimize how we store tuples during sorts.
We might, one day, want to consider using a hand-rolled bump
allocator. If we ever do that we'd need to undo the work done by
9f03ca915.

Does anyone have any thoughts on this?

Here's a reproducer from the regression tests:

CREATE TABLE no_index_cleanup (i INT PRIMARY KEY, t TEXT);
-- Use uncompressed data stored in toast.
CREATE INDEX no_index_cleanup_idx ON no_index_cleanup(t);
ALTER TABLE no_index_cleanup ALTER COLUMN t SET STORAGE EXTERNAL;
INSERT INTO no_index_cleanup(i, t) VALUES (generate_series(1,30),
repeat('1234567890',269));
-- index cleanup option is ignored if VACUUM FULL
VACUUM (INDEX_CLEANUP TRUE, FULL TRUE) no_index_cleanup;

David

Re: First draft of the PG 15 release notes

2022-06-28 Thread Peter Geoghegan

On Tue, Jun 28, 2022 at 1:35 PM Bruce Momjian  wrote:
> Okay, text updated, thanks.  Applied patch attached.

I have some notes on these items:

1. "Allow vacuum to be more aggressive in setting the oldest frozenxid
(Peter Geoghegan)"

2. "Add additional information to VACUUM VERBOSE and autovacuum
logging messages (Peter Geoghegan)"

The main enhancement to VACUUM for Postgres 15 was item 1, which
taught VACUUM to dynamically track the oldest remaining XID (and the
oldest remaining MXID) that will remain in the table at the end of the
same VACUUM operation. These final/oldest XID/MXID values are what we
now use to set relfrozenxid and relminmxid in pg_class. Previously we
just set relfrozenxid/relminmxid to whatever XID/MXID value was used
to determine which XIDs/MXIDs needed to be frozen. These values often
indicated more about VACUUM implementation details (like the
vacuum_freeze_min_age GUc's value) than the actual true contents of the
table at the end of the most recent VACUUM.

It might be worth explaining the shift directly in the release notes.
The new approach is simpler and makes a lot more sense -- why should
the relfrozenxid be closely tied to freezing? We don't necessarily
have to freeze any tuple to advance relfrozenxid right up to the
removable cutoff/OldestXmin used by VACUUM. For example,
anti-wraparound VACUUMs that run against static tables now set
relfrozenxid/relminmxid to VACUUM's removable cutoff/OldestXmin
directly, without freezing anything (after the first time). Same with
tables that happen to have every row deleted -- only the actual
unfrozen XIDs/MXIDs left in the table matter, and if there happen to
be none at all then we can use the same relfrozenxid as we would for
a CREATE TABLE. All depends on what the workload allows.

There will also be a real practical benefit for users that allocate a
lot of MultiXactIds: We'll now have pg_class.relminmxid values that
are much more reliable indicators of what is really going on in the
table, MultiXactId-wise. I expect that this will make it much less
likely that anti-wraparound VACUUMs will run needlessly against the
largest tables, where there probably wasn't ever one single
MultiXactId. In other words, the implementation will have more
accurate information at the level of each table, and won't .

I think that very uneven consumption of MultiXactIds at the table
level is probably common in real databases. Plus VACUUM can usually
remove a non-running MultiXact from a tuple's xmax, regardless of
whether or not the mxid happens to be before the
vacuum_multixact_freeze_min_age-based MXID cutoff -- VACUUM has
always just set xmax to InvalidXid in passing when it's possible to do so
easily. MultiXacts are inherently pretty short-lived information about
row lockers at a point in time. We don't really need to keep them
around for very long. We may now be able to truncate the two MultiXact
related SLRUs much more frequently with some workloads.

Finally, note that the new VACUUM VERBOSE output (which is now pretty
much the same as the autovacuum log output) shows when and how
relfrozenxid/relminmxid have advanced. This should make it relatively
easy to observe these effects where they exist.

Thanks

--
Peter Geoghegan

Re: Logging query parmeters in auto_explain

On Tue, Jun 28, 2022 at 10:12:18AM +0100, Dagfinn Ilmari Mannsåker wrote:
> That makes sense, but I still think the query_log() function definition
> should go at the end (after done_testing()), so the machinery doesn't
> distract from what's actually being tested.

The majority of TAP scripts have their subroutines at the beginning of
the script, and there are few having that at the end.  I won't fight
you on that, but the former is more consistent.

> Also, the second paragraph of the second commit message now belongs in
> the first commit (without the word "Also").

Yep, will fix.  I usually rewrite commit messages before merging them.
--
Michael


signature.asc
Description: PGP signature

PostgreSQL 15 beta 2 release announcement draft

2022-06-28 Thread Jonathan S. Katz

Hi,

Attached is a draft of the release announcement for PostgreSQL 15 Beta
2. Please provide feedback on technical accuracy and if there are
glaring omissions.

Please provide any feedback prior to 2022-06-22 0:00 AoE.

Thanks,

Jonathan
The PostgreSQL Global Development Group announces that the second beta release
of
PostgreSQL 15 is now [available for
download](https://www.postgresql.org/download/).
This release contains previews of all features that will be available when
PostgreSQL 15 is made generally available, though some details of the release
can change during the beta period.

You can find information about all of the features and changes found in
PostgreSQL 15 in the [release
notes](https://www.postgresql.org/docs/15/release-15.html):

[https://www.postgresql.org/docs/15/release-15.html](https://www.postgresql.org/docs/15/release-15.html)

In the spirit of the open source PostgreSQL community, we strongly encourage you
to test the new features of PostgreSQL 15 on your systems to help us eliminate
bugs or other issues that may exist. While we do not advise you to run
PostgreSQL 15 Beta 1 in production environments, we encourage you to find ways
to run your typical application workloads against this beta release.

Your testing and feedback will help the community ensure that the PostgreSQL 15
release upholds our standards of delivering a stable, reliable release of the
world's most advanced open source relational database. Please read more about
our [beta testing process](https://www.postgresql.org/developer/beta/) and how
you can contribute:

[https://www.postgresql.org/developer/beta/](https://www.postgresql.org/developer/beta/)

Upgrading to PostgreSQL 15 Beta 2
-

To upgrade to PostgreSQL 15 Beta 2 from an earlier version of PostgreSQL,
you will need to use a strategy similar to upgrading between major versions of
PostgreSQL (e.g. `pg_upgrade` or `pg_dump` / `pg_restore`). For more
information, please visit the documentation section on
[upgrading](https://www.postgresql.org/docs/15/static/upgrading.html).

Changes Since Beta 1

There have been many bug fixes for PostgreSQL 15 reported during the Beta 1
period and applied to the Beta 2 release. This includes:

* `JSON_TABLE` output columns now use the collations of their data type.
* `pg_publication_tables` now provides information on column lists and row
filters.
* Prohibit combining publications with different column lists.
* `string` is now an unreserved keyword.
* Several fixes for the output of `EXPLAIN MERGE`.
* Multiples fixes for `COPY .. WITH (HEADER MATCH)`.
* Revert ignoring HOT updates for BRIN indexes.
* Internal fix for amcheck.
* Fix for `psql` to show `NOTICE` statements immediately, not at the end of a
transaction.
* Fix `\timing` in `psql` so that it will still return a time even if there is
an error.
* The `\dconfig` command in `psql` reduces the number of default settings
displayed when used without any arguments.
* Fix for `pg_upgrade` to improve its idempotence.
* Fix check in `pg_upgrade` for ICU collations.
* Allow `--partitions=0` to work with `pgbench`.

Please see the [release
notes](https://www.postgresql.org/docs/15/release-15.html)
for a complete list of new and changed features:

[https://www.postgresql.org/docs/15/release-15.html](https://www.postgresql.org/docs/15/release-15.html)

Testing for Bugs & Compatibility

The stability of each PostgreSQL release greatly depends on you, the community,
to test the upcoming version with your workloads and testing tools in order to
find bugs and regressions before the general availability of PostgreSQL 15. As
this is a Beta, changes to database behaviors, feature details, and APIs are
still possible. Your feedback and testing will help determine the final tweaks
on the new features, so please test in the near future. The quality of user
testing helps determine when we can make a final release.

A list of [open
issues](https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items)
is publicly available in the PostgreSQL wiki. You can
[report bugs](https://www.postgresql.org/account/submitbug/) using this form on
the PostgreSQL website:

[https://www.postgresql.org/account/submitbug/](https://www.postgresql.org/account/submitbug/)

Beta Schedule
-

This is the second beta release of version 15. The PostgreSQL Project will
release additional betas as required for testing, followed by one or more
release candidates, until the final release in late 2022. For further
information please see the [Beta
Testing](https://www.postgresql.org/developer/beta/) page.

Links
-

* [Download](https://www.postgresql.org/download/)
* [Beta Testing Information](https://www.postgresql.org/developer/beta/)
* [PostgreSQL 15 Beta Release
Notes](https://www.postgresql.org/docs/15/release-15.html)
* [PostgreSQL 15 Open
Issues](https://wiki.postgresql.org/wiki/PostgreSQL_15_Open

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

On Sat, 18 Jun 2022 at 08:06, Andres Freund  wrote:
> I also attached my heavily-WIP patches for the ExprEvalStep issues, I
> accidentally had only included a small part of the contents of the json fix.

I've now looked at the 0003 patch.  I like the idea you have about
moving some of the additional fields into ScalarArrayOpExprHashTable.
I think the patch can even go a little further and move the hash_finfo
into there too. This means we don't need to dereference the "op" in
saop_element_hash().

To make this work, I did need to tag the ScalarArrayOpExpr into the
ExprEvalStep.  That's required now since some of the initialization of
the hash function fields is delayed until
ExecEvalHashedScalarArrayOp().  We need to know the
ScalarArrayOpExpr's hashfuncid and inputcollid.

Your v2 patch did shift off some of this initialization work to
ExecEvalHashedScalarArrayOp().  The attached v3 takes that a bit
further.  This saves a bit more work for ScalarArrayOpExprs that are
evaluated 0 times.

Another small thing which I considered doing was to put the
hash_fcinfo_data field as the final field in
ScalarArrayOpExprHashTable so that we could allocate the memory for
the hash_fcinfo_data in the same allocation as the
ScalarArrayOpExprHashTable.  This would reduce the pointer
dereferencing done in saop_element_hash() a bit further.  I just
didn't notice anywhere else where we do that for FunctionCallInfo, so
I resisted doing this.

(There was also a small bug in your patch where you mistakenly cast to
an OpExpr instead of ScalarArrayOpExpr when you were fetching the
inputcollid)

David
From 7a1e71a6b622eff72537ae86187b312d01a5e11d Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 29 Jun 2022 10:43:41 +1200
Subject: [PATCH v3] Remove size increase in ExprEvalStep caused by hashed
 saops

50e17ad28 increased the size of ExprEvalStep from 64 bytes up to 88 bytes.
Lots of effort was spent during the development of the current expression
evaluation code to make an instance of this struct fit on a single cache
line.

Here we remove the fn_addr field. The values from this field can be
obtained via fcinfo, just with some extra pointer dereferencing.  The
extra indirection does not seem to cause any noticeable slowdowns.

Various other fields have been been moved into the
ScalarArrayOpExprHashTable struct. These fields are only used when the
ScalarArrayOpExprHashTable pointer has already been dereferenced, so no
additional pointer dereferences occur for these.

Author: Andres Freund
Discussion: 
https://postgr.es/m/20220616233130.rparivafipt6d...@alap3.anarazel.de
---
 src/backend/executor/execExpr.c   | 19 +--
 src/backend/executor/execExprInterp.c | 25 +
 src/include/executor/execExpr.h   |  7 +--
 3 files changed, 23 insertions(+), 28 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2831e7978b..2e75f3204d 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1203,8 +1203,6 @@ ExecInitExprRec(Expr *node, ExprState *state,
FmgrInfo   *finfo;
FunctionCallInfo fcinfo;
AclResult   aclresult;
-   FmgrInfo   *hash_finfo;
-   FunctionCallInfo hash_fcinfo;
Oid cmpfuncid;

/*
@@ -1262,18 +1260,6 @@ ExecInitExprRec(Expr *node, ExprState *state,
 */
if (OidIsValid(opexpr->hashfuncid))
{
-   hash_finfo = palloc0(sizeof(FmgrInfo));
-   hash_fcinfo = 
palloc0(SizeForFunctionCallInfo(1));
-   fmgr_info(opexpr->hashfuncid, 
hash_finfo);
-   fmgr_info_set_expr((Node *) node, 
hash_finfo);
-   InitFunctionCallInfoData(*hash_fcinfo, 
hash_finfo,
-   
 1, opexpr->inputcollid, NULL,
-   
 NULL);
-
-   
scratch.d.hashedscalararrayop.hash_finfo = hash_finfo;
-   
scratch.d.hashedscalararrayop.hash_fcinfo_data = hash_fcinfo;
-   
scratch.d.hashedscalararrayop.hash_fn_addr = hash_finfo->fn_addr;
-
/* Evaluate scalar directly into left 
function argument */
ExecInitExprRec(scalararg, state,

&fcinfo->args[0].value, &fcinfo->args[0].isnull);
@@ -1292,11 +1278,8 @@ ExecInitExprRec(Expr *

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-28 Thread Andres Freund

Hi,

On 2022-06-27 23:36:53 +0200, Hannu Krosing wrote:
> My current thinking is (based on more insights from Andres) that we
> should also have a startup flag to disable superuser altogether to
> avoid bypasses via direct manipulation of pg_proc.

To me that makes no sense whatsoever. You're not going to be able to create
extensions etc anymore.

> Experience shows that 99% of the time one can run PostgreSQL just fine
> without a superuser

IME that's not at all true. It might not be needed interactively, but that's
not all the same as not being needed at all.

IMO this whole thread is largely poking at the wrong side of the issue. A
superuser is a superuser is a superuser. There's reasons superusers exist,
because lots of operations are fundamentally not safe. IMO removing superuser
or making superuser not be a superuser is a fool's errand - time is much
better spent reducing the number of tasks that need superuser.

Greetings,

Andres Freund

doc: BRIN indexes and autosummarize

2022-06-28 Thread Roberto Mello

Here's a patch to clarify the BRIN indexes documentation, particularly with
regards
to autosummarize, vacuum and autovacuum. It basically breaks down a big
blob of a
paragraph into multiple paragraphs for clarity, plus explicitly tells how
summarization
happens manually or automatically.

I also added cross-references to various relevant sections, including the
create index
page.

On this topic... I'm not familiar with with the internals of BRIN indexes
and in
backend/access/common/reloptions.c I see:

{
"autosummarize",
"Enables automatic summarization on this BRIN index",
RELOPT_KIND_BRIN,
AccessExclusiveLock
},

Is the exclusive lock on the index why autosummarize is off by default?

What would be the downside (if any) of having autosummarize=on by default?

Roberto

--
Crunchy Data - passion for open source PostgreSQL


brin-autosummarize-docs.patch
Description: Binary data

Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-06-28 Thread Andrew Dunstan


On 2022-06-28 Tu 16:53, Jehan-Guillaume de Rorthais wrote:
> Hi,
>
> I found a comparaison bug when using the PostgreSQL::Version module. See:
>
>   $ perl -I. -MPostgreSQL::Version -le '
> my $v = PostgreSQL::Version->new("9.6");
>   
> print "not 9.6 > 9.0" unless $v >  9.0;
> print "not 9.6 < 9.0" unless $v <  9.0;
> print "9.6 <= 9.0"if $v <= 9.0;
> print "9.6 >= 9.0"if $v >= 9.0;'
>   not 9.6 > 9.0
>   not 9.6 < 9.0
>   9.6 <= 9.0
>   9.6 >= 9.0
>
> When using < or >, 9.6 is neither greater or lesser than 9.0. 
> When using <= or >=, 9.6 is equally greater and lesser than 9.0.
> The bug does not show up if you compare with "9.0" instead of 9.0.
> This bug is triggered with devel versions, eg. 14beta1 <=> 14.
>
> The bug appears when both objects have a different number of digit in the
> internal array representation:
>
>   $ perl -I. -MPostgreSQL::Version -MData::Dumper -le '
> print Dumper(PostgreSQL::Version->new("9.0")->{num});
> print Dumper(PostgreSQL::Version->new(9.0)->{num});
> print Dumper(PostgreSQL::Version->new(14)->{num});
> print Dumper(PostgreSQL::Version->new("14beta1")->{num});'
>   $VAR1 = [ '9', '0' ];
>   $VAR1 = [ '9' ];
>   $VAR1 = [ '14' ];
>   $VAR1 = [ '14', -1 ];
>
> Because of this, The following loop in "_version_cmp" is wrong because we are
> comparing two versions with different size of 'num' array:
>
>   for (my $idx = 0;; $idx++)
>   {
>   return 0 unless (defined $an->[$idx] && defined $bn->[$idx]);
>   return $an->[$idx] <=> $bn->[$idx]
> if ($an->[$idx] <=> $bn->[$idx]);
>   }
>
>
> If we want to keep this internal array representation, the only fix I can 
> think
> of would be to always use a 4 element array defaulted to 0. Previous examples
> would be:
>
>   $VAR1 = [ 9, 0, 0, 0 ];
>   $VAR1 = [ 9, 0, 0, 0 ];
>   $VAR1 = [ 14, 0, 0, 0 ];
>   $VAR1 = [ 14, 0, 0, -1 ];
>
> A better fix would be to store the version internally as version_num that are
> trivial to compute and compare. Please, find in attachment an implementation 
> of
> this.
>
> The patch is a bit bigger because it improved the devel version to support
> rc/beta/alpha comparison like 14rc2 > 14rc1.
>
> Moreover, it adds a bunch of TAP tests to check various use cases.


Nice catch, but this looks like massive overkill. I think we can very
simply fix the test in just a few lines of code, instead of a 190 line
fix and a 130 line TAP test.

It was never intended to be able to compare markers like rc1 vs rc2, and
I don't see any need for it. If you can show me a sane use case I'll
have another look, but right now it seems quite unnecessary.

Here's my proposed fix.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Version.pm b/src/test/perl/PostgreSQL/Version.pm
index 8f70491189..8d4dbbf694 100644
--- a/src/test/perl/PostgreSQL/Version.pm
+++ b/src/test/perl/PostgreSQL/Version.pm
@@ -123,9 +123,12 @@ sub _version_cmp
 
 	for (my $idx = 0;; $idx++)
 	{
-		return 0 unless (defined $an->[$idx] && defined $bn->[$idx]);
-		return $an->[$idx] <=> $bn->[$idx]
-		  if ($an->[$idx] <=> $bn->[$idx]);
+		return 0
+		  if ($idx >= @$an && $idx >= @$bn);
+		# treat a missing number as 0
+		my ($anum, $bnum) = ($an->[$idx] || 0, $bn->[$idx] || 0);
+		return $anum <=> $bnum
+		  if ($anum <=> $bnum);
 	}
 }

Re: Hardening PostgreSQL via (optional) ban on local file system access

2022-06-28 Thread Hannu Krosing

I was not after *completely* removing it, but just having an option
which makes the superuser() function always return false.

For known cases of needing a superuser there would be a way to enable
it , perhaps via a sentinel file or pg_hba-like configuration file.

And as first cut I would advocate for disabling SECURITY DEFINER
functions for simplicity and robustness of defense. A short term
solution for these would be to re-write them in C (or Go or Rust or
any other compiled language) as C functions have full access anyway.
But C functions fall in the same category as other defenses discussed
at the start of this thread - namely to use them, you already need
access to the file system.

Running production databases without superuser available is not as
impossible as you may think  - Cloud SQL version of PostgreSQL has
been in use with great success for years without exposing a real
superuser to end users (there are some places where
`cloudsqlsuperuser` gives you partial superuser'y abilities).

Letting user turn off the superuser access when no known need for it
exists (which is 99.9% in must use cases) would improve secondary
defenses noticeably.

It would also be a good start to figuring out the set of roles into
which one can decompose superuser access in longer run

--
Hannu

On Tue, Jun 28, 2022 at 8:30 PM Robert Haas  wrote:
>
> On Mon, Jun 27, 2022 at 5:37 PM Hannu Krosing  wrote:
> > My current thinking is (based on more insights from Andres) that we
> > should also have a startup flag to disable superuser altogether to
> > avoid bypasses via direct manipulation of pg_proc.
> >
> > Experience shows that 99% of the time one can run PostgreSQL just fine
> > without a superuser, so having a superuser available all the time is
> > kind of like leaving a loaded gun on the kitchen table because you
> > sometimes need to go hunting.
> >
> > I am especially waiting for Andres' feedback on viability this approach.
>
> Well, I'm not Andres but I don't think not having a superuser at all
> is in any way a viable approach. It's necessary to be able to
> administer the database system, and the bootstrap superuser can't be
> removed outright in any case because it owns a ton of objects.
>
> There are basically two ways of trying to solve this problem. On the
> one hand we could try to create a mode in which the privileges of the
> superuser are restricted enough that the superuser can't break out to
> the operating system. The list of things that would need to be blocked
> is, I think, more extensive than any list you've give so far. The
> other is to stick with the idea of an unrestricted superuser but come
> up with ways of giving a controlled subset of the superuser's
> privileges to a non-superuser. I believe this is the more promising
> approach, and there have been multiple discussion threads about it in
> the last six months.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Fix proposal for comparaison bugs in PostgreSQL::Version

2022-06-28 Thread Jehan-Guillaume de Rorthais

Hi,

I found a comparaison bug when using the PostgreSQL::Version module. See:

  $ perl -I. -MPostgreSQL::Version -le '
my $v = PostgreSQL::Version->new("9.6");
  
print "not 9.6 > 9.0" unless $v >  9.0;
print "not 9.6 < 9.0" unless $v <  9.0;
print "9.6 <= 9.0"if $v <= 9.0;
print "9.6 >= 9.0"if $v >= 9.0;'
  not 9.6 > 9.0
  not 9.6 < 9.0
  9.6 <= 9.0
  9.6 >= 9.0

When using < or >, 9.6 is neither greater or lesser than 9.0. 
When using <= or >=, 9.6 is equally greater and lesser than 9.0.
The bug does not show up if you compare with "9.0" instead of 9.0.
This bug is triggered with devel versions, eg. 14beta1 <=> 14.

The bug appears when both objects have a different number of digit in the
internal array representation:

  $ perl -I. -MPostgreSQL::Version -MData::Dumper -le '
print Dumper(PostgreSQL::Version->new("9.0")->{num});
print Dumper(PostgreSQL::Version->new(9.0)->{num});
print Dumper(PostgreSQL::Version->new(14)->{num});
print Dumper(PostgreSQL::Version->new("14beta1")->{num});'
  $VAR1 = [ '9', '0' ];
  $VAR1 = [ '9' ];
  $VAR1 = [ '14' ];
  $VAR1 = [ '14', -1 ];

Because of this, The following loop in "_version_cmp" is wrong because we are
comparing two versions with different size of 'num' array:

for (my $idx = 0;; $idx++)
{
return 0 unless (defined $an->[$idx] && defined $bn->[$idx]);
return $an->[$idx] <=> $bn->[$idx]
  if ($an->[$idx] <=> $bn->[$idx]);
}


If we want to keep this internal array representation, the only fix I can think
of would be to always use a 4 element array defaulted to 0. Previous examples
would be:

  $VAR1 = [ 9, 0, 0, 0 ];
  $VAR1 = [ 9, 0, 0, 0 ];
  $VAR1 = [ 14, 0, 0, 0 ];
  $VAR1 = [ 14, 0, 0, -1 ];

A better fix would be to store the version internally as version_num that are
trivial to compute and compare. Please, find in attachment an implementation of
this.

The patch is a bit bigger because it improved the devel version to support
rc/beta/alpha comparison like 14rc2 > 14rc1.

Moreover, it adds a bunch of TAP tests to check various use cases.

Regards,
>From d6b28440ad8eb61fd83be6a0df8f40108296bc4e Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 28 Jun 2022 22:17:48 +0200
Subject: [PATCH] Fix and improve PostgreSQL::Version

The internal array representation of a version was failing some
comparaisons because the loop comparing each digit expected both array
to have the same number of digit.

Comparaison like 9.6 <=> 9.0 or 14beta1 <=> 14 were failing.

This patch compute and use the numeric version internaly to compare
objects.

Moreover, it improve the comparaison on development versions paying
attention to the iteration number for rc, beta and alpha.
---
 src/test/perl/PostgreSQL/Version.pm  | 112 --
 src/test/perl/t/01-PostgreSQL::Version.t | 141 +++
 2 files changed, 216 insertions(+), 37 deletions(-)
 create mode 100644 src/test/perl/t/01-PostgreSQL::Version.t

diff --git a/src/test/perl/PostgreSQL/Version.pm b/src/test/perl/PostgreSQL/Version.pm
index 8f70491189..e42003d0d1 100644
--- a/src/test/perl/PostgreSQL/Version.pm
+++ b/src/test/perl/PostgreSQL/Version.pm
@@ -46,6 +46,7 @@ package PostgreSQL::Version;
 
 use strict;
 use warnings;
+use Carp;
 
 use Scalar::Util qw(blessed);
 
@@ -73,35 +74,73 @@ of a Postgres command like `psql --version` or `pg_config --version`;
 
 sub new
 {
-	my $class = shift;
-	my $arg   = shift;
-
-	chomp $arg;
-
-	# Accept standard formats, in case caller has handed us the output of a
-	# postgres command line tool
-	my $devel;
-	($arg, $devel) = ($1, $2)
-	  if (
-		$arg =~ m!^ # beginning of line
-  (?:\(?PostgreSQL\)?\s)? # ignore PostgreSQL marker
-  (\d+(?:\.\d+)*) # version number, dotted notation
-  (devel|(?:alpha|beta|rc)\d+)?   # dev marker - see version_stamp.pl
-		 !x);
-
-	# Split into an array
-	my @numbers = split(/\./, $arg);
-
-	# Treat development versions as having a minor/micro version one less than
-	# the first released version of that branch.
-	push @numbers, -1 if ($devel);
+	my $class  = shift;
+	my $arg= shift;
+	my $ver= '';
+	my $devel  = '';
+	my $vernum = 0;
+	my $devnum = 0;
+
+	$arg =~ m!^(?:[^\s]+\s)?   # beginning of line + optionnal command
+	(?:\(?PostgreSQL\)?\s)?# ignore PostgreSQL marker
+	(\d+)(?:\.(\d+))?(?:\.(\d+))?  # version number, dotted notation
+	([^\s]+)?  # dev marker - see version_stamp.pl
+	!x;
+
+	croak "could not parse version: $arg" unless defined $1;
+
+	$ver = "$1";
+	$vernum += 1 * $1;
+	$ver .= ".$2" if defined $2;
+
+	if ($vernum >= 10)
+	{
+		croak "versions 10 and after can not have a third digit"
+			if defined $3;
+		$vernum += $2 if defined $2;
+	}
+	else
+	{
+		my $minor;
+
+		if ( defined $2 )
+		{
+			$vernum += 100 * $2;
+		}
+

Re: First draft of the PG 15 release notes

2022-06-28 Thread Bruce Momjian

On Mon, Jun 27, 2022 at 11:37:19PM -0700, Noah Misch wrote:
> On Tue, May 10, 2022 at 11:44:15AM -0400, Bruce Momjian wrote:
> > I have completed the first draft of the PG 15 release notes
> 
> > 
> > 
> > 
> >  
> >   Remove PUBLIC creation permission on the  >   linkend="ddl-schemas-public">public schema
> >   (Noah Misch)
> >  
> > 
> >  
> >   This is a change in the default for newly-created databases in
> >   existing clusters and for new clusters;  USAGE
> 
> If you dump/reload an unmodified v14 template1 (as pg_dumpall and pg_upgrade
> do), your v15 template1 will have a v14 ACL on its public schema.  At that
> point, the fate of "newly-created databases in existing clusters" depends on
> whether you clone template1 or template0.  Does any of that detail belong
> here, or does the existing text suffice?

I think it is very confusing to have template0 have one value and
template1 have a different one, but as I understand it template0 will
only be used for pg_dump comparison, and that will keep template1 with
the same permissions, so I guess it is okay.

> >   permissions on the public schema has not
> >   been changed.  Databases restored from previous Postgres releases
> >   will be restored with their current permissions.  Users wishing
> >   to have the old permissions on new objects will need to grant
> 
> The phrase "old permissions on new objects" doesn't sound right to me, but I'm
> not sure why.  I think you're aiming for the fact that this is just a default;
> one can still change the ACL to anything, including to the old default.  If
> these notes are going to mention the old default like they do so far, I think
> they should also urge readers to understand
> https://www.postgresql.org/docs/devel/ddl-schemas.html#DDL-SCHEMAS-PATTERNS
> before returning to the old default.  What do you think?

Agreed, the new text is:

Users wishing to have the former permissions will need to grant
CREATE permission for PUBLIC on
the public schema; this change can be made on
template1 to cause all new databases to have these
permissions.

> 
> >   CREATE permission for PUBLIC
> >   on the public schema; this change can be made
> >   on template1 to cause all new databases
> >   to have these permissions.  template1
> >   permissions for pg_dumpall and
> >   pg_upgrade?
> 
> pg_dumpall will change template1.  I think pg_upgrade will too, and neither
> program will change template0.

Okay, I will remove that question mark sentence.

> >  
> > 
> > 
> > 
> > 
> > 
> >  
> >   Change the owner of the public schema to
> >   pg_database_owner (Noah Misch)
> >  
> > 
> >  
> >   Previously it was the literal user name of the database owner.
> 
> It was the bootstrap superuser.

Okay, text updated, thanks.  Applied patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

diff --git a/doc/src/sgml/release-15.sgml b/doc/src/sgml/release-15.sgml
index 6da3f89d08..47ac329e79 100644
--- a/doc/src/sgml/release-15.sgml
+++ b/doc/src/sgml/release-15.sgml
@@ -63,13 +63,11 @@ Author: Noah Misch 
   permissions on the public schema has not
   been changed.  Databases restored from previous Postgres releases
   will be restored with their current permissions.  Users wishing
-  to have the old permissions on new objects will need to grant
+  to have the former permissions will need to grant
   CREATE permission for PUBLIC
   on the public schema; this change can be made
   on template1 to cause all new databases
-  to have these permissions.  template1
-  permissions for pg_dumpall and
-  pg_upgrade?
+  to have these permissions.
  
 
 
@@ -85,7 +83,7 @@ Author: Noah Misch 
  
 
  
-  Previously it was the literal user name of the database owner.
+  Previously it was the literal user name of the bootstrap superuser.
   Databases restored from previous Postgres releases will be restored
   with their current owner specification.

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Jun 28, 2022 at 11:47 AM Tom Lane  wrote:
>> Now you do need something that will make the three meanings different
>> in order to test that step.  But I'd suggest some bit of throwaway code
>> that just assigns randomly different logical and physical orders.

> That seems like a good idea. Might also make sense to make the
> behavior configurable via a developer-only GUC, to enable exhaustive
> tests that use every possible permutation of physical/logical mappings
> for a given table.
> Perhaps the random behavior itself should work by selecting a value
> for the GUC at various key points via a PRNG. During CREATE TABLE, for
> example. This approach could make it easier to reproduce failures on the
> buildfarm.

Yeah, it can't be *too* random or debugging failures will be a nightmare.
My point is just to not spend a lot of engineering on this part, because
it won't be a long-term user feature.

regards, tom lane

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

On Sat, 18 Jun 2022 at 05:21, Andres Freund  wrote:
>
> On 2022-06-17 14:14:54 +1200, David Rowley wrote:
> > So, there appears to be no performance regression due to the extra
> > indirection. There's maybe even some gains due to the smaller step
> > size.
>
> "smaller step size"?

I mean smaller sizeof(ExprEvalStep).

David

Emit extra debug message when executing extension script.

Patch attached.

Helpful for debugging complex extension script problems.
From b278eba9ba35ec1c52a8c1aa5c080a6731f86fbe Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 28 Jun 2022 12:06:03 -0700
Subject: [PATCH] Emit debug message when executing extension script.

Outputting script filenames helps extension authors understand which
upgrade path is followed when creating an extension.
---
 src/backend/commands/extension.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 767d9b96190..0b4d428f6bf 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -887,6 +887,8 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 
 	filename = get_extension_script_filename(control, from_version, version);
 
+	ereport(DEBUG1, errmsg("executing extension script: %s", filename));
+
 	/*
 	 * If installing a trusted extension on behalf of a non-superuser, become
 	 * the bootstrap superuser.  (This switch will be cleaned up automatically
-- 
2.17.1

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Peter Geoghegan

On Tue, Jun 28, 2022 at 11:47 AM Tom Lane  wrote:
> Now you do need something that will make the three meanings different
> in order to test that step.  But I'd suggest some bit of throwaway code
> that just assigns randomly different logical and physical orders.

That seems like a good idea. Might also make sense to make the
behavior configurable via a developer-only GUC, to enable exhaustive
tests that use every possible permutation of physical/logical mappings
for a given table.

Perhaps the random behavior itself should work by selecting a value
for the GUC at various key points via a PRNG. During CREATE TABLE, for
example. This approach could make it easier to reproduce failures on the
buildfarm.

--
Peter Geoghegan

Export log_line_prefix(); useful for emit_log_hook.

Patch attached. Some kinds of emit log hooks might find it useful to
also compute the log_line_prefix.

Regards,
Jeff Davis

From c3d4f14602c043918b8b6dab88a976dac1923208 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 28 Jun 2022 11:39:33 -0700
Subject: [PATCH] Export log_line_prefix(); useful for emit_log_hook.

---
 src/backend/utils/error/elog.c | 3 +--
 src/include/utils/elog.h   | 3 +++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 55ee5423afb..4cd6a713ebb 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -176,7 +176,6 @@ static pg_noinline void set_backtrace(ErrorData *edata, int num_skip);
 static void set_errdata_field(MemoryContextData *cxt, char **ptr, const char *str);
 static void write_console(const char *line, int len);
 static const char *process_log_prefix_padding(const char *p, int *padding);
-static void log_line_prefix(StringInfo buf, ErrorData *edata);
 static void send_message_to_server_log(ErrorData *edata);
 static void send_message_to_frontend(ErrorData *edata);
 static void append_with_tabs(StringInfo buf, const char *str);
@@ -2441,7 +2440,7 @@ process_log_prefix_padding(const char *p, int *ppadding)
 /*
  * Format tag info for log lines; append to the provided buffer.
  */
-static void
+void
 log_line_prefix(StringInfo buf, ErrorData *edata)
 {
 	/* static counter for line numbers */
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index f5c6cd904de..71aaf40614c 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -16,6 +16,8 @@
 
 #include 
 
+#include "lib/stringinfo.h"
+
 /* Error level codes */
 #define DEBUG5		10			/* Debugging messages, in categories of
  * decreasing detail. */
@@ -439,6 +441,7 @@ extern PGDLLIMPORT bool syslog_split_messages;
 #define LOG_DESTINATION_JSONLOG	16
 
 /* Other exported functions */
+extern void log_line_prefix(StringInfo buf, ErrorData *edata);
 extern void DebugFileOpen(void);
 extern char *unpack_sql_state(int sql_state);
 extern bool in_error_recursion_trouble(void);
-- 
2.17.1

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Tom Lane

Alvaro Herrera  writes:
> If you do not provide a column identity number or you use something else
> (e.g. attlognum) to cross-references attributes from other catalogs,
> then you'll have to edit pg_attrdef when a column moves; and any other
> reference to a column number will have to change.  Or think about
> pg_depend.  You don't want that.  This is why you need three columns,
> not two.

In previous go-rounds on this topic (of which there have been many),
we understood the need for attidnum as being equivalent to the familiar
notion that tables should have an immutable primary key, with anything
that users might wish to change *not* being the primary key.  This
side-steps the need to propagate changes of the pkey into referencing
tables, which is essentially what Alvaro is pointing out you don't
want to have to deal with.

FWIW, I'd lean to the idea that using three new column names would
be a good thing, because it'll force you to look at every single
reference in the code and figure out which meaning is needed at that
spot.  There will still be a large number of wrong-meaning bugs, but
that disciplined step will hopefully result in "large" being "tolerable".

>> I think that supporting at least a way to specify the logical order
>> during the table creation should be easy to implement

> As long as it is really simple (just some stuff in CREATE TABLE, nothing
> at all in ALTER TABLE) then that sounds good.  I just suggest not to
> complicate things too much to avoid the risk of failing the project
> altogether.

I think that any user-reachable knobs for controlling this should be
designed and built later.  The initial split-up of attnum meanings
is already going to be a huge lift, and anything at all that you can
do to reduce the size of that first patch is advisable.  If you don't
realize what a large chance there is that you'll utterly fail on that
first step, then you have failed to learn anything from the history
of this topic.

Now you do need something that will make the three meanings different
in order to test that step.  But I'd suggest some bit of throwaway code
that just assigns randomly different logical and physical orders.

regards, tom lane

Re: CREATE INDEX CONCURRENTLY on partitioned index

2022-06-28 Thread Justin Pryzby

Hi,

On Thu, Feb 10, 2022 at 06:07:08PM +0300, Alexander Pyhalov wrote:
> I've rebased patches and tried to fix issues I've seen. I've fixed reference
> after table_close() in the first patch (can be seen while building with
> CPPFLAGS='-DRELCACHE_FORCE_RELEASE').

Thanks for finding that.

The patches other than 0001 are more experimental, and need someone to check if
it's even a good approach to use, so I kept them separate from the essential
patch.

Your latest 0005 patch (mark intermediate partitioned indexes as valid) is
probably fixing a bug in my SKIPVALID patch, right ?  I'm not sure whether the
SKIPVALID patch should be merged into 0001, and I've been awaiting feedback on
the main patch before handling progress reporting.

Sorry for not responding sooner.  The patch saw no activity for ~11 months so I
wasn't prepared to pick it up in March, at least not without guidance from a
committer.

Would you want to take over this patch ?  I wrote it following someone's
question, but don't expect that I'd use the feature myself.  I can help review
it or try to clarify the organization of my existing patches (but still haven't
managed to work my way through your amendments to my patches).

Thanks for caring about partitioned DDL ;)

-- 
Justin

Re: Hardening PostgreSQL via (optional) ban on local file system access

On Mon, Jun 27, 2022 at 5:37 PM Hannu Krosing  wrote:
> My current thinking is (based on more insights from Andres) that we
> should also have a startup flag to disable superuser altogether to
> avoid bypasses via direct manipulation of pg_proc.
>
> Experience shows that 99% of the time one can run PostgreSQL just fine
> without a superuser, so having a superuser available all the time is
> kind of like leaving a loaded gun on the kitchen table because you
> sometimes need to go hunting.
>
> I am especially waiting for Andres' feedback on viability this approach.

Well, I'm not Andres but I don't think not having a superuser at all
is in any way a viable approach. It's necessary to be able to
administer the database system, and the bootstrap superuser can't be
removed outright in any case because it owns a ton of objects.

There are basically two ways of trying to solve this problem. On the
one hand we could try to create a mode in which the privileges of the
superuser are restricted enough that the superuser can't break out to
the operating system. The list of things that would need to be blocked
is, I think, more extensive than any list you've give so far. The
other is to stick with the idea of an unrestricted superuser but come
up with ways of giving a controlled subset of the superuser's
privileges to a non-superuser. I believe this is the more promising
approach, and there have been multiple discussion threads about it in
the last six months.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Alvaro Herrera

On 2022-Jun-28, Julien Rouhaud wrote:

> On Tue, Jun 28, 2022 at 10:53:14AM +0200, Alvaro Herrera wrote:

> > My feeling is that every aspect of user interaction should show
> > columns ordered in logical order.
> 
> I'm not entirely sure of what you meant.  Assuming tables a(a, z) and b(b, z),
> what do you think those queries should return?
> 
> SELECT * FROM a JOIN b on a.z = b.z
> Currently it returns (a.a, a.z, b.b, b.z)
> 
> SELECT * FROM a JOIN b USING (z)
> Currently it returns a.z, a.a, b.b.
> 
> Should it now return (a.a, z, b.b) as long as the tables have that logical
> order, whether or not any other position (attnum / attphysnum) is different or
> stay the same as now?

For all user-visible intents and purposes, the column order is whatever
the logical order is (attlognum), regardless of attnum and attphysnum.
If the logical order is changed, then the order of the output columns of
a join will change to match.  The attnum and attphysnum are completely
irrelevant to all these purposes.  So, to answer your question, if the
join expands in this way at present, then it should continue to expand
that way if you define a table that has different attnum/attphysnum but
the same attlognum for those columns.

> I'm not following.  If we keep attnum as the official identity position and
> use attlognum as the position that should be used in any interactive command,
> wouldn't that risk to break every single client?

Yeah, it might break a lot of tools, but other things break tools too
and the world just moves on.

But if you don't want to break tools, I can think of two alternatives:

1. make the immutable column identity something like attidnum and
   keep attnum as the logical column order.
   This keeps tools happy, but if they try to match pg_attrdef by attnum
   bad things will happen.

2. in order to avoid possible silent breakage, remove attnum altogether
   and just have attidnum, attlognum, attphysnum; then every tool is
   forced to undergo an update.  Any cross-catalog relationships are now
   correct.

> Imagine you have some framework that automatically generates queries based on
> the catalog, if it sees table abc with:
> c: attnum 1, attphysnum 1, attlognum 3
> b: attnum 2, attphysnum 2, attlognum 2
> a: attnum 3, attphysnum 3, attlognum 1

Hopefully the framework will add a column list,
  INSERT INTO abc (c,b,a) VALUES ('c', 'b', 'a');
to avoid this problem.  But if it doesn't, then yeah it will misbehave,
and I don't think you should try to make it not misbehave.

> Also, about the default values evaluation (see [2]), should it be tied to your
> attnum, attphysnum or attlognum?

Default is tied to column identity.  If you change column order, the
defaults don't need to change at all.  Similarly, if the server decides
to repack the columns in a different way to save alignment padding, the
defaults don't need to change.

If you do not provide a column identity number or you use something else
(e.g. attlognum) to cross-references attributes from other catalogs,
then you'll have to edit pg_attrdef when a column moves; and any other
reference to a column number will have to change.  Or think about
pg_depend.  You don't want that.  This is why you need three columns,
not two.

> I think that supporting at least a way to specify the logical order
> during the table creation should be easy to implement

As long as it is really simple (just some stuff in CREATE TABLE, nothing
at all in ALTER TABLE) then that sounds good.  I just suggest not to
complicate things too much to avoid the risk of failing the project
altogether.

For testability, her's a crazy idea: have some test mode (maybe #ifdef
USE_ASSERT_CHECKING) that randomizes attlognum to start at some N >> 1,
and only attidnum starts at 1.  Then they never match and all tools need
to ensure they handle weird cases correctly.

> (there shouldn't be any question on whether it needs to invalidate any
> cache or what lock level to use), and could also be added in the
> initial submission without much extra efforts, which could help with
> the testing.

Famous last words :-)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)

Re: making relfilenodes 56 bits

2022-06-28 Thread Matthias van de Meent

On Tue, 28 Jun 2022 at 13:45, Simon Riggs  wrote:
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.

I was reading the thread to keep up with storage-related prototypes
and patches, and this specifically doesn't sound quite right to me. I
do not know what values you considered to be 'low' or what 'can be
cached easily', so here's some field data:

I have seen PostgreSQL clusters that utilized the relative isolation
of seperate databases within the same cluster (instance / postmaster)
to provide higher guarantees of data access isolation while still
being able to share a resource pool, which resulted in several
clusters containing upwards of 100 databases.

I will be the first to admit that it is quite unlikely to be common
practise, but this workload increases the number of dbOid+spcOid
combinations to 100s (even while using only a single tablespace),
which in my opinion requires some more thought than just handwaving it
into an smgr array and/or checkpoint records.

Kind regards,

Matthias van de Meent

Re: margay fails assertion in stats/dsa/dsm code

On Thu, Jun 2, 2022 at 8:06 PM Thomas Munro  wrote:
> I know that on Solaris we use dynamic_shared_memory=posix.  The other
> Solaris/Sparc system is wrasse, and it's not doing this.  I don't see
> it yet, but figured I'd report this much to the list in case someone
> else does.

My first thought was that the return value of the call to
dsm_impl_op() at the end of dsm_attach() is not checked and that maybe
it was returning NULL, but it seems like whoever wrote
dsm_impl_posix() was pretty careful to ereport(elevel, ...) in every
failure path, and elevel is ERROR here, so I don't see any issue. My
second thought was that maybe control had escaped from dsm_attach()
due to an error before we got to the step where we actually map the
segment, but then the dsm_segment * would be returned to the caller.
Maybe they could retrieve it later using dsm_find_mapping(), but that
function has no callers in core.

So I'm kind of stumped too, but did you by any chance check whether
there are any DSM-related messages in the logs before the assertion
failure?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Hardening PostgreSQL via (optional) ban on local file system access

On Mon, 2022-06-27 at 23:36 +0200, Hannu Krosing wrote:
> My current thinking is (based on more insights from Andres) that we
> should also have a startup flag to disable superuser altogether to
> avoid bypasses via direct manipulation of pg_proc.

What do you mean by "disable superuser altogether"? What about SECURITY
DEFINER functions, or extension install scripts, or background workers
created by extensions?

Do you mean prevent logging in as superuser and prevent SET ROLE to
superuser? What if a user *becomes* superuser in the middle of a
session?

If we go down this road, I wonder if we should reconsider the idea of
changing superuser status of an existing role. Changing superuser
status already creates some weirdness. We could go so far as to say you
can only create/drop superusers via a tool or config file when the
server is shut down.

I don't think I've ever used more than a couple superusers, and I don't
think I've had a good reason to change superuser status of an existing
user before, except as a hack to have non-superuser-owned
subscriptions. There are probably better solutions to that problem. [CC
Mark as he may be interested in this discussion.]

Regards,
Jeff Davis

Re: making relfilenodes 56 bits

On Tue, Jun 28, 2022 at 11:25 AM Robert Haas  wrote:
> But the primary problem we're trying to solve here is that right now
> we sometimes reuse the same filename for a whole new file, and that
> results in bugs that only manifest themselves in obscure
> circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
> There are residual failure modes even now related to the "tombstone"
> files that are created when you drop a relation: remove everything but
> the first file from the main fork but then keep that file (only)
> around until after the next checkpoint. OID wraparound is another
> annoyance that has influenced the design of quite a bit of code over
> the years and where we probably still have bugs. If we don't reuse
> relfilenodes, we can avoid a lot of that pain. Combining the DB OID
> and TS OID fields doesn't solve that problem.

Oh wait, I'm being stupid. You were going to combine those fields but
then also widen the relfilenode, so that would solve this problem
after all. Oops, I'm dumb.

I still think this is a lot more complicated though, to the point
where I'm not sure we can really make it work at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support logical replication of DDLs

2022-06-28 Thread vignesh C

On Tue, Jun 21, 2022 at 5:49 PM houzj.f...@fujitsu.com
 wrote:
>
> On Monday, June 20, 2022 11:32 AM houzj.f...@fujitsu.com 
>  wrote:
> >
> > On Saturday, June 18, 2022 3:38 AM Zheng Li  wrote:
> > > On Wed, Jun 15, 2022 at 12:00 AM Amit Kapila 
> > > wrote:
> > > >
> > > > On Wed, Jun 15, 2022 at 5:44 AM Zheng Li  wrote:
> > > > >
> > > > >
> > > > > While I agree that the deparser is needed to handle the potential
> > > > > syntax differences between the pub/sub, I think it's only relevant
> > > > > for the use cases where only a subset of tables in the database are
> > > > > replicated. For other use cases where all tables, functions and
> > > > > other objects need to be replicated, (for example, creating a
> > > > > logical replica for major version upgrade) there won't be any syntax
> > > > > difference to handle and the schemas are supposed to match exactly
> > > > > between the pub/sub. In other words the user seeks to create an
> > > > > identical replica of the source database and the DDLs should be
> > > > > replicated as is in this case.
> > > > >
> > > >
> > > > I think even for database-level replication we can't assume that
> > > > source and target will always have the same data in which case "Create
> > > > Table As ..", "Alter Table .. " kind of statements can't be replicated
> > > > as it is because that can lead to different results.
> > > "Create Table As .." is already handled by setting the skipData flag of 
> > > the
> > > statement parsetreee before replay:
> > >
> > > /*
> > > * Force skipping data population to avoid data inconsistency.
> > > * Data should be replicated from the publisher instead.
> > > */
> > > castmt->into->skipData = true;
> > >
> > > "Alter Table .. " that rewrites with volatile expressions can also be 
> > > handled
> > > without any syntax change, by enabling the table rewrite replication and
> > > converting the rewrite inserts to updates. ZJ's patch introduced this 
> > > solution.
> > > I've also adopted this approach in my latest patch
> > > 0012-Support-replication-of-ALTER-TABLE-commands-that-rew.patch
> > >
> > > > The other point
> > > > is tomorrow we can extend the database level option/syntax to exclude
> > > > a few objects (something like [1]) as well in which case we again need
> > > > to filter at the publisher level
> > >
> > > I think for such cases it's not full database replication and we could 
> > > treat it as
> > > table level DDL replication, i.e. use the the deparser format.
> >
> > Hi,
> >
> > Here are some points in my mind about the two approaches discussed here.
> >
> > 1) search_patch vs schema qualify
> >
> > Again, I still think it will bring more flexibility and security by schema 
> > qualify the
> > objects in DDL command as mentioned before[1].
> >
> > Besides, a schema qualified DDL is also more appropriate for other use
> > cases(e.g. a table-level replication). As it's possible the schema is 
> > different
> > between pub/sub and it's easy to cause unexpected and undetectable failure 
> > if
> > we just log the search_path.
> >
> > It makes more sense to me to have the same style WAL log(schema qualified)
> > for
> > both database level or table level replication as it will bring more
> > flexibility.
> >
> >
> > > "Create Table As .." is already handled by setting the skipData flag of 
> > > the
> > > statement parsetreee before replay:
> >
> > 2) About the handling of CREATE TABLE AS:
> >
> > I think it's not a appropriate approach to set the skipdata flag on 
> > subscriber
> > as it cannot handle EXECUTE command in CTAS.
> >
> > CREATE TABLE q5_prep_results AS EXECUTE q5(200, 'DT');
> >
> > The Prepared statement is a temporary object which we don't replicate. So if
> > you directly execute the original SQL on subscriber, even if you set 
> > skipdata
> > it will fail.
> >
> > I think it difficult to make this work as you need handle the create/drop of
> > this prepared statement. And even if we extended subscriber's code to make 
> > it
> > work, it doesn't seems like a standard and elegant approach.
> >
> >
> > > "Alter Table .. " that rewrites with volatile expressions can also be 
> > > handled
> > > without any syntax change, by enabling the table rewrite replication and
> > > converting the rewrite inserts to updates. ZJ's patch introduced this 
> > > solution.
> >
> > 3) About the handling of ALTER TABLE rewrite.
> >
> > The approach I proposed before is based on the event trigger + deparser
> > approach. We were able to improve that approach as we don't need to
> > replicate
> > the rewrite in many cases. For example: we don't need to replicate rewrite 
> > dml
> > if there is no volatile/mutable function. We should check and filter these 
> > case
> > at publisher (e.g. via deparser) instead of checking that at subscriber.
> >
> > Besides, as discussed, we need to give warning or error for the cases when 
> > DDL
> > contains volatile function which would be executed[2]. We should check this 
> > at
>

Re: making relfilenodes 56 bits

On Tue, Jun 28, 2022 at 7:45 AM Simon Riggs
 wrote:
> Another approach would be to condense spcOid and dbOid into a single
> 4-byte Oid-like number, since in most cases they are associated with
> each other, and not often many of them anyway. So this new number
> would indicate both the database and the tablespace. I know that we
> want to be able to make file changes without doing catalog lookups,
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.
>
> typedef struct buftag
> {
>  Oid db_spcOid;
>  ForkNumber  uint32;
>  RelFileNumber   uint64;
> } BufferTag;

I've thought about this before too, because it does seem like the DB
OID and tablespace OID are a poor use of bit space. You might not even
need to keep the db_spcOid value in any persistent place, because it
could just be an alias for buffer mapping lookups that might change on
every restart. That does have the problem that you now need a
secondary hash table - in theory of unbounded size - to store mappings
from  to db_spcOid, and that seems complicated and hard
to get right. It might be possible, though. Alternatively, you could
imagine a durable mapping that also affects the on-disk structure, but
I don't quite see how to make that work: for example, pg_basebackup
wants to produce a tar file for each tablespace directory, and if the
pathnames no longer contain the tablespace OID but only the db_spcOid,
then that doesn't work any more.

But the primary problem we're trying to solve here is that right now
we sometimes reuse the same filename for a whole new file, and that
results in bugs that only manifest themselves in obscure
circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
There are residual failure modes even now related to the "tombstone"
files that are created when you drop a relation: remove everything but
the first file from the main fork but then keep that file (only)
around until after the next checkpoint. OID wraparound is another
annoyance that has influenced the design of quite a bit of code over
the years and where we probably still have bugs. If we don't reuse
relfilenodes, we can avoid a lot of that pain. Combining the DB OID
and TS OID fields doesn't solve that problem.

> That way we could just have a simple 64-bit RelFileNumber, without
> restriction, and probably some spare bytes on the ForkNumber, if we
> needed them later.

In my personal opinion, the ForkNumber system is an ugly wart which
has nothing to recommend it except that the VM and FSM forks are
awesome. But if we could have those things without needing forks, I
think that would be way better. Forks add code complexity in tons of
places, and it's barely possible to scale it to the 4 forks we have
already, let alone any larger number. Furthermore, there are really
negative performance effects from creating 3 files per small relation
rather than 1, and we sure can't afford to have that number get any
bigger. I'd rather kill the ForkNumber system with fire that expand it
further, but even if we do expand it, we're not close to being able to
cope with more than 256 forks per relation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Unify DLSUFFIX on Darwin

2022-06-28 Thread Andrew Dunstan



On 2022-06-24 Fr 10:13, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 22.06.22 15:45, Tom Lane wrote:
>>> Doesn't this amount to a fundamental ABI break for extensions?
>>> Yesterday they had to ship foo.so, today they have to ship foo.dylib.
>> Extensions generally only load the module files using the extension-free 
>> base name.  And if they do specify the extension, they should use the 
>> provided DLSUFFIX variable and not hardcode it.  So I don't see how this 
>> would be a problem.
> Hm.  Since we force people to recompile extensions for new major versions
> anyway, maybe it'd be all right.  I'm sure there is *somebody* out there
> who will have to adjust their build scripts, but it does seem like it
> shouldn't be much worse than other routine API changes.
>
> [ thinks for a bit... ]  Might be worth double-checking that pg_upgrade
> doesn't get confused in a cross-version upgrade.  A quick grep doesn't
> find that it refers to DLSUFFIX anywhere, but it definitely does pay
> attention to extensions' shared library names.
>
>   


The buildfarm client uses `make show_dl_suffix` to determine filenames
to look for when seeing if an installation is complete. It looks like
that will continue to work.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Temporary tables versus wraparound... again

2022-06-28 Thread Greg Stark

Simple Rebase
From 8dfed1a64308a84cc15679e09af69ca6989b608b Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Thu, 31 Mar 2022 15:50:02 -0400
Subject: [PATCH v7 3/3] Add test for truncating temp tables advancing
 relfrozenxid

This test depends on other transactions not running at the same time
so that relfrozenxid can advance so it has to be moved to its own
schedule.
---
 src/test/regress/expected/temp.out | 37 ++
 src/test/regress/parallel_schedule | 10 +---
 src/test/regress/sql/temp.sql  | 30 
 3 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out
index a5b3ed34a3..244b868ef7 100644
--- a/src/test/regress/expected/temp.out
+++ b/src/test/regress/expected/temp.out
@@ -82,6 +82,43 @@ SELECT * FROM temptest;
 -
 (0 rows)
 
+DROP TABLE temptest;
+-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the
+-- table is truncated. This requires this test not be run in parallel
+-- with other tests as concurrent transactions will hold back the
+-- globalxmin
+CREATE TEMP TABLE temptest(col text) ON COMMIT DELETE ROWS;
+SELECT reltoastrelid, reltoastrelid::regclass AS relname FROM pg_class where oid = 'temptest'::regclass \gset toast_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = :toast_reltoastrelid \gset toast_old_
+BEGIN;
+INSERT INTO temptest (select repeat('foobar',generate_series(1,1000)));
+ANALYZE temptest; -- update relpages, reltuples
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset temp_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = :toast_reltoastrelid \gset toast_temp_
+COMMIT;
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset new_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = :toast_reltoastrelid \gset toast_new_
+-- make sure relpages and reltuple match a newly created table and
+-- relfrozenxid is advanced
+SELECT :old_relpages <> :temp_relpages AS pages_analyzed,
+   :old_relpages  = :new_relpages  AS pages_reset,
+   :old_reltuples<> :temp_reltuplesAS tuples_analyzed,
+   :old_reltuples = :new_reltuples AS tuples_reset,
+   :old_relfrozenxid <> :new_relfrozenxid  AS frozenxid_advanced;
+ pages_analyzed | pages_reset | tuples_analyzed | tuples_reset | frozenxid_advanced 
++-+-+--+
+ t  | t   | t   | t| t
+(1 row)
+
+-- The toast table can't be analyzed so relpages and reltuples can't
+-- be tested easily make sure frozenxid is advanced
+SELECT :toast_old_relfrozenxid <> :toast_new_relfrozenxid  AS frozenxid_advanced;
+ frozenxid_advanced 
+
+ t
+(1 row)
+
 DROP TABLE temptest;
 -- Test ON COMMIT DROP
 BEGIN;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 103e11483d..99f28c405e 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -116,10 +116,14 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson
 # --
 # Another group of parallel tests
 # with depends on create_misc
-# NB: temp.sql does a reconnect which transiently uses 2 connections,
-# so keep this parallel group to at most 19 tests
 # --
-test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
+test: plancache limit plpgsql copy2 domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
+
+# --
+# Run this alone because it transiently uses 2 connections and also
+# tests relfrozenxid advances when truncating temp tables
+# --
+test: temp
 
 # --
 # Another group of parallel tests
diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql
index 424d12b283..5f8647a8aa 100644
--- a/src/test/regress/sql/temp.sql
+++ b/src/test/regress/sql/temp.sql
@@ -79,6 +79,36 @@ SELECT * FROM temptest;
 
 DROP TABLE temptest;
 
+-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the
+-- table is truncated. This requires this test not be run in parallel
+-- with other tests as concurrent transactions will hold back the
+-- globalxmin
+CREATE TEMP TABLE temptest(col text) ON COMMIT DELETE ROWS;
+
+SELECT reltoastrelid, reltoastrelid::regclass AS relname FROM pg_class where oid = 'temptest'::regclass \gset toast_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_
+SELECT relpages, reltuples, relfrozenxid FROM pg_class where oid = :toast_reltoastrelid \gset toast_old_
+BEGIN;
+INS

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Julien Rouhaud

Hi,

On Tue, Jun 28, 2022 at 08:38:56AM -0500, Justin Pryzby wrote:
> On Tue, Jun 28, 2022 at 04:32:30PM +0800, Julien Rouhaud wrote:
> > psql displays a table columns information using the logical order rather the
> > physical order, and if verbose emits an addition "Physical order" footer if 
> > the
> > logical layout is different from the physical one.
> 
> FYI: the footer would work really poorly for us, since we use hundreds of
> columns and sometimes over 1000 (historically up to 1600).

Yeah :)  As I mentioned originally at [1]: "I also changed psql to display the
column in logical position, and emit an extra line with the physical position
in the verbose mode, but that's a clearly poor design which would need a lot
more thoughts."

> I think it'd be
> better to show the physical position as an additional column, or a \d option 
> to
> sort by physical attnum.  (I'm not sure if it'd be useful for our case to see
> the extra columns, but at least it won't create a "footer" which is multiple
> pages long.

Yes, I was also thinking something like that could work.  I just did it with
the extra footer for now because I needed a quick way to check in which order
my tables were supposed to be displayed / stored during development.  As soon
as I get a clearer picture of what approach should be used I will clearly work
on this, and all other things that still need some care.

> Actually, I've sometimes wished for a "\d-" quiet mode which would
> show everything *except* the list of column names, or perhaps only show those
> columns which are referenced by the list of indexes/constraints/stats
> objects/etc).

I never had to work on crazy wide relations like that myself but I can easily
imagine how annoying it can get.  No objection from me, although it would be
good to start a new thread to attract more attention and see what other are
thinking.

> BTW, since 2 years ago, when rewriting partitions to promote a column type, we
> recreate the parent table sorted by attlen, to minimize alignment overhead in
> new children.  AFAICT your patch is about adding an logical column order, not
> about updating tables with a new physical order.

Indeed, the only thing it could do in such case is to allow you to create the
columns in an optimal order in the first place, without messing with the output.

But if the people who originally creates the table don't think about alignment
and things like that, there's still nothing that can be done with this feature.

That being said, in theory if such a feature existed, and if we also had a DDL
to allowed to specify a different logical order at creation time, it would be
easy to create a module that automatically reorder the columns before the table
is created to make sure that the columns are physically stored in an optimal
way.

[1] 
https://www.postgresql.org/message-id/20220623101155.3dljtwradu7eik6g@jrouhaud

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Justin Pryzby

On Tue, Jun 28, 2022 at 04:32:30PM +0800, Julien Rouhaud wrote:
> psql displays a table columns information using the logical order rather the
> physical order, and if verbose emits an addition "Physical order" footer if 
> the
> logical layout is different from the physical one.

FYI: the footer would work really poorly for us, since we use hundreds of
columns and sometimes over 1000 (historically up to 1600).  I think it'd be
better to show the physical position as an additional column, or a \d option to
sort by physical attnum.  (I'm not sure if it'd be useful for our case to see
the extra columns, but at least it won't create a "footer" which is multiple
pages long.  Actually, I've sometimes wished for a "\d-" quiet mode which would
show everything *except* the list of column names, or perhaps only show those
columns which are referenced by the list of indexes/constraints/stats
objects/etc).

BTW, since 2 years ago, when rewriting partitions to promote a column type, we
recreate the parent table sorted by attlen, to minimize alignment overhead in
new children.  AFAICT your patch is about adding an logical column order, not
about updating tables with a new physical order.

-- 
Justin

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Julien Rouhaud

Hi,

On Tue, Jun 28, 2022 at 09:00:05AM -0400, Isaac Morland wrote:
> On Tue, 28 Jun 2022 at 05:32, Julien Rouhaud  wrote:
>
> > I think that supporting at least a way to specify the logical order during
> > the
> > table creation should be easy to implement (there shouldn't be any
> > question on whether it needs to invalidate any cache or what lock level to
> > use), and could also be added in the initial submission without much extra
> > efforts, which could help with the testing.
> >
>
> I think the meaning of “logical order” (well, the meaning it has for me, at
> least) implies that the logical order of a table after CREATE TABLE is the
> order in which the columns were given in the table creation statement.
>
> If there needs to be a way of specifying the physical order separately,
> that is a different matter.

Well, the way I see it is that the logical order is something that can be
changed, and therefore is the one that needs to be spelled out explicitly if
you want it to differ from the physical order.

But whether the physical or logical order is the one that needs explicit
additional syntax, it would still be nice to provide in a first iteration.  And
both versions would be the same to implement, difficulty wise.
>
> ALTER TABLE ADD … is another matter. Syntax there to be able to say BEFORE
> or AFTER an existing column would be nice to have. Presumably it would
> physically add the column at the end but set the logical position as
> specified.

Yes, but it raises some questions about lock level, cache invalidation and such
so I chose to ignore that for the moment.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-28 Thread John Naylor

On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  wrote:
>
> > I
> > suspect other optimizations would be worth a lot more than using AVX2:
> > - collapsing inner nodes
> > - taking care when constructing the key (more on this when we
> > integrate with VACUUM)
> > ...and a couple Andres mentioned:
> > - memory management: in
> > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > - node dispatch:
> > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> >
> > Therefore, I would suggest that we use SSE2 only, because:
> > - portability is very easy
> > - to avoid a performance hit from indirecting through a function pointer
>
> Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

> I've done benchmark tests while changing the node types. The code base
> is v3 patch that doesn't have the optimization you mentioned below
> (memory management and node dispatch) but I added the code to use SSE2
> for node-16 and node-32.

Great, this is helpful to visualize what's going on!

> * sse2_4_16_48_256
> * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
>
> * sse2_4_32_128_256
> * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 
> 31
> * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 1

> Observations are:
>
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

> In dense case, since most nodes have around 100 children, the radix
> tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.

> the other hand, the radix tree that doesn't have node-128 has a better
> number in terms of insertion performance. This is probably because we
> need to iterate over 'isset' flags from the beginning of the array in
> order to find an empty slot when inserting new data. We do the same
> thing also for node-48 but it was better than node-128 as it's up to
> 48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

> In terms of lookup performance, the results vary but I could not find
> any common pattern that makes the performance better or worse. Getting
> more statistics such as the number of each node type per tree level
> might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Making the subquery alias optional in the FROM clause

2022-06-28 Thread Isaac Morland

On Tue, 28 Jun 2022 at 00:32, Julien Rouhaud  wrote:

> As to forcing SQL-complaint queries, that ship sailed a long time ago:
> > Postgres allows but does not enforce the use of SQL-compliant queries,
> and
> > many of its important features are extensions anyway, so forcing SQL
> > compliant queries is out of the question (although I could see the
> utility
> > of a mode where it warns or errors on non-compliant queries, at least in
> > principle).
>
> Sure, but it doesn't mean that we should support even more non-compliant
> syntax
> without any restraint.  In this case, I don't see much benefit as it's not
> solving performance problem or something like that.
>

It's improving developer performance by eliminating the need to make up
utterly useless names. I don't care if behind the scenes names are
assigned, although it would be even better if the names didn't exist at
all. I just want the computer to do stuff for me that requires absolutely
no human judgement whatsoever.

> As to bad habits, I'm having trouble understanding. Why do you think
> > leaving the alias off a subquery is a bad habit (assuming it were
> allowed)?
>
> I think It's a bad habit because as far as I can see it's not supported on
> mysql or sqlserver.
>

So it’s a bad habit to use features of Postgres that aren’t available on
MySQL or SQL Server?

For myself, I don’t care one bit about whether my code will run on those
systems, or Oracle: as far as I’m concerned I write Postgres applications,
not SQL applications. Of course, many people have a need to support other
systems, so I appreciate the care we take to document the differences from
the standard, and I hope we will continue to support standard queries. But
if it’s a bad habit to use Postgres-specific features, why do we create any
of those features?

> If the name is never used, why are we required to supply it?
>
> But similarly, I many times relied on the fact that writable CTE are
> executed
> even if not explicitly referenced.  So by the same argument shouldn't we
> allow
> something like this?
>
> WITH (INSERT INTO t SELECT * pending WHERE ts < now())
> SELECT now() AS last_processing_time;
>

I’m not necessarily opposed to allowing this too. But the part which causes
me annoyance is normal subquery naming.

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Isaac Morland

On Tue, 28 Jun 2022 at 05:32, Julien Rouhaud  wrote:

> I think that supporting at least a way to specify the logical order during
> the
> table creation should be easy to implement (there shouldn't be any
> question on whether it needs to invalidate any cache or what lock level to
> use), and could also be added in the initial submission without much extra
> efforts, which could help with the testing.
>

I think the meaning of “logical order” (well, the meaning it has for me, at
least) implies that the logical order of a table after CREATE TABLE is the
order in which the columns were given in the table creation statement.

If there needs to be a way of specifying the physical order separately,
that is a different matter.

ALTER TABLE ADD … is another matter. Syntax there to be able to say BEFORE
or AFTER an existing column would be nice to have. Presumably it would
physically add the column at the end but set the logical position as
specified.

Re: [PATCH] Compression dictionaries for JSONB

2022-06-28 Thread Aleksander Alekseev

Hi Simon,

Many thanks for your feedback!

I'm going to submit an updated version of the patch in a bit. I just
wanted to reply to some of your questions / comments.

> Dictionaries have no versioning. [...]

> Does the order of entries in the dictionary allow us to express a priority? 
> i.e. to allow Huffman coding. [...]

This is something we discussed in the RFC thread. I got an impression
that the consensus was reached:

1. To simply use 32-bit codes in the compressed documents, instead of
16-bit ones as it was done in ZSON;
2. Not to use any sort of variable-length coding;
3. Not to use dictionary versions. New codes can be added to the
existing dictionaries by executing ALTER TYPE mydict ADD ENTRY. (This
also may answer your comment regarding a limit on SQL statement size.)
4. The compression scheme can be altered in the future if needed.
Every compressed document stores algorithm_version (1 byte).

Does this plan of action sound OK to you? At this point it is not too
difficult to make design changes.

-- 
Best regards,
Aleksander Alekseev

better error description on logical replication

2022-06-28 Thread Marcos Pegoraro

I don´t know how to create a patch, maybe someday, but for now I´m just
sending this little problem if somebody can solve it.

In a multi schema environment where several tables has same structure is a
little bit hard to know which one already has that primary key.

On log I see now on replica server.
Message:duplicate key value violates unique constraint "pkcustomer"
Detail: Key (customer_id)=(530540) already exists.

So, I know what table is but I don´t know what schema it belongs.

Thanks
Marcos

Re: Support logical replication of DDLs

2022-06-28 Thread Amit Kapila

On Tue, Jun 21, 2022 at 5:49 PM houzj.f...@fujitsu.com
 wrote:
>
> On Monday, June 20, 2022 11:32 AM houzj.f...@fujitsu.com 
>  wrote:
> >
>
> Attach the new version patch set which added support for CREATE/DROP/ATER
> Sequence and CREATE/DROP Schema ddl commands which are provided by Ajin
> Cherian off list.
>

Few questions/comments on v9-0001-Functions-to-deparse-DDL-commands
===
1.
+/*
+ * Similar to format_type_internal, except we return each bit of information
+ * separately:
...
...
+static void
+format_type_detailed(Oid type_oid, int32 typemod,
+ Oid *nspid, char **typname, char **typemodstr,
+ bool *typarray)

The function mentioned in the comments seems to be changed to
format_type_extended in commit a26116c6. If so, change it accordingly.

2. It is not clear to me why format_type_detailed needs to use
'peculiar_typmod' label and then goto statement? In
format_type_extended, we have similar code but without using the goto
statement, so can't we use a similar way here as well?

3.
format_type_detailed()
{
...
+ typeform->typstorage != 'p')

It is better to replace the constant 'p' with TYPSTORAGE_PLAIN.

4. It seems to me that the handling of some of the built-in types is
different between format_type_detailed and format_type_extended. Can
we add some comments to explain the same?

5.
+static ObjTree *
+deparse_CreateStmt(Oid objectId, Node *parsetree)
{
...
+ tmp = new_objtree_VA("TABLESPACE %{tablespace}I", 0);
+ if (node->tablespacename)
+ append_string_object(tmp, "tablespace", node->tablespacename);
+ else
+ {
+ append_null_object(tmp, "tablespace");
+ append_bool_object(tmp, "present", false);
+ }
+ append_object_object(createStmt, "tablespace", tmp);
...
}

Why do we need to append the objects (tablespace, with clause, etc.)
when they are not present in the actual CREATE TABLE statement? The
reason to ask this is that this makes the string that we want to send
downstream much longer than the actual statement given by the user on
the publisher.

-- 
With Regards,
Amit Kapila.

Re: BufferAlloc: don't take two simultaneous locks

2022-06-28 Thread Yura Sokolov

В Вт, 28/06/2022 в 14:26 +0300, Yura Sokolov пишет:
> В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:
> 
> > Tests:
> > - tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
> >   (ie max frequency is 2.20GHz)
> 
> Forgot to mention:
> - this time it was Centos7.9.2009 (Core) with Linux mn10 
> 3.10.0-1160.el7.x86_64
> 
> Perhaps older kernel describes poor master's performance on 2 sockets
> compared to my previous results (when this server had Linux 5.10.103-1 
> Debian).
> 
> Or there is degradation in PostgreSQL's master branch between.
> I'll try to check today.

No, old master commit ( 7e12256b47 Sat Mar 12 14:21:40 2022) behaves same.
So it is clearly old-kernel issue. Perhaps, futex was much slower than this
days.

Re: Comments referring to pg_start/stop_backup

2022-06-28 Thread David Steele


On 6/28/22 01:00, Kyotaro Horiguchi wrote:

At Tue, 28 Jun 2022 13:41:58 +0900, Michael Paquier  wrote 
in

Hi all,

While browsing through the recent changes with the base backup APIs, I
have noticed that a couple of comments did not get the renaming of the
SQL functions to pg_backup_start/stop, as of the attached.

That's not a big deal, but let's be right.


+1 and I don't find other instances of the same mistake.


Yes, these also look good to me. They are a bit tricky to search for so 
I can see how we missed them.


Regards,
-David

Testing Index Skip scan

2022-06-28 Thread Alexandre Felipe

I have been interested in a query that returns a batch of results filtered
by a subset of the first column of an index and ordered by the second.

I created a simple (hopefully) reproducible example of the issue, the two
queries describe the same data but have very different costs (explain
output included in the attached file).
server_version 12.8

On slack #pgsql-hackers channel, @sfrost, it was suggested that what I
described is achieved by index skip scan. How can I get a development build
to test this feature?


create table ordered_skip_index_scan_test (id serial not null
constraint ordered_skip_index_scan_test_pkey primary key, col1 INT4, col2 
INT4);

insert into ordered_skip_index_scan_test (col1, col2)
  select trunc(sqrt(random()*1e6)) as col1, (random()*1e6)::int as col2
from generate_series(1, 11e6);

create index ordered_skip_index_col2 on ordered_skip_index_scan_test using 
btree (col2)
create index ordered_skip_index_col1_col2 on ordered_skip_index_scan_test using 
btree (col1, col2)
analyse ordered_skip_index_scan_test;

select count(*) from ordered_skip_index_scan_test;

explain (analyse, buffers)
select * from ordered_skip_index_scan_test
where col2 > 5
  and col1 in (1, 999)
order by col2
limit 10;
-- Limit  (cost=0.43..263.30 rows=10 width=12) (actual time=0.083..5.668 
rows=10 loops=1)
--   Buffers: shared hit=5362
--   ->  Index Scan using ordered_skip_index_col2 on 
ordered_skip_index_scan_test  (cost=0.43..492524.49 rows=18737 width=12) 
(actual time=0.082..5.664 rows=10 loops=1)
-- Index Cond: (col2 > 5)
-- Filter: (col1 = ANY ('{1,999}'::integer[]))
-- Rows Removed by Filter: 5343
-- Buffers: shared hit=5362
-- Planning Time: 0.135 ms
-- Execution Time: 5.693 ms


explain (analyse, buffers)
with t as ((select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 13
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   -- Not taking advantage of the partial
   -- results to adjust the bounds of the
   -- subsequent queries
   where col2 between 5 and 50
 and col1 = 8
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 5
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 845
   limit 10))
select * from t
order by col2
limit 10;
-- Limit  (cost=159.95..159.97 rows=10 width=12) (actual time=0.225..0.230 
rows=10 loops=1)
--   Buffers: shared hit=52
--   ->  Sort  (cost=159.95..160.05 rows=40 width=12) (actual time=0.224..0.227 
rows=10 loops=1)
-- Sort Key: ordered_skip_index_scan_test.col2
-- Sort Method: top-N heapsort  Memory: 25kB
-- Buffers: shared hit=52
-- ->  HashAggregate  (cost=158.28..158.68 rows=40 width=12) (actual 
time=0.167..0.178 rows=40 loops=1)
--   Group Key: ordered_skip_index_scan_test.id, 
ordered_skip_index_scan_test.col1, ordered_skip_index_scan_test.col2
--   Batches: 1  Memory Usage: 24kB
--   Buffers: shared hit=52
--   ->  Append  (cost=0.43..157.98 rows=40 width=12) (actual 
time=0.026..0.144 rows=40 loops=1)
-- Buffers: shared hit=52
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.026..0.044 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test  (cost=0.43..17214.39 rows=4424 width=12) 
(actual time=0.025..0.042 rows=10 loops=1)
-- Index Cond: ((col1 = 13) AND (col2 >= 5) 
AND (col2 <= 50))
-- Buffers: shared hit=13
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.011..0.028 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test ordered_skip_index_scan_test_1  
(cost=0.43..17214.39 rows=4424 width=12) (actual time=0.011..0.026 rows=10 
loops=1)
-- Index Cond: ((col1 = 8) AND (col2 >= 5) 
AND (col2 <= 50))
-- Buffers: shared hit=13
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.010..0.025 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test ordered_skip_index_scan_test_2  
(cost=0.43..17214.39 rows=4424 width=12) (actual time=0.010..0.023 rows=10 
loops=1)
-- Index Cond: ((col1 = 5) AND (col2 >= 5) 
AND (col2 <= 50))
-- Buffers: shared hit=

Re: making relfilenodes 56 bits

2022-06-28 Thread Simon Riggs

On Sat, 25 Jun 2022 at 02:30, Andres Freund  wrote:

> > And then like this in 0003:
> >
> > typedef struct buftag
> > {
> > Oid spcOid;
> > Oid dbOid;
> > RelFileNumber   fileNumber:56;
> > ForkNumber  forkNum:8;
> > } BufferTag;
>
> Probably worth checking the generated code / the performance effects of using
> bitfields (vs manual maskery). I've seen some awful cases, but here it's at a
> byte boundary, so it might be ok.

Another approach would be to condense spcOid and dbOid into a single
4-byte Oid-like number, since in most cases they are associated with
each other, and not often many of them anyway. So this new number
would indicate both the database and the tablespace. I know that we
want to be able to make file changes without doing catalog lookups,
but since the number of combinations is usually 1, but even then, low,
it can be cached easily in a smgr array and included in the checkpoint
record (or nearby) for ease of use.

typedef struct buftag
{
 Oid db_spcOid;
 ForkNumber  uint32;
 RelFileNumber   uint64;
} BufferTag;

That way we could just have a simple 64-bit RelFileNumber, without
restriction, and probably some spare bytes on the ForkNumber, if we
needed them later.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: BufferAlloc: don't take two simultaneous locks

2022-06-28 Thread Yura Sokolov

В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:

> Tests:
> - tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
>   (ie max frequency is 2.20GHz)

Forgot to mention:
- this time it was Centos7.9.2009 (Core) with Linux mn10 3.10.0-1160.el7.x86_64

Perhaps older kernel describes poor master's performance on 2 sockets
compared to my previous results (when this server had Linux 5.10.103-1 Debian).

Or there is degradation in PostgreSQL's master branch between.
I'll try to check today.

regards

---

Yura Sokolov

Re: Allowing REINDEX to have an optional name

2022-06-28 Thread Simon Riggs

On Tue, 28 Jun 2022 at 08:29, Simon Riggs  wrote:
>
> On Thu, 2 Jun 2022 at 01:02, Michael Paquier  wrote:
> >
> > On Tue, May 31, 2022 at 02:30:32PM +0200, Alvaro Herrera wrote:
> > > I was thinking the opposite: REINDEX DATABASE with or without a database
> > > name should always process the user relations and skip system catalogs.
> > > If the user wants to do both, then they can use REINDEX SYSTEM in
> > > addition.
> > >
> > > The reason for doing it like this is that there is no way to process
> > > only user tables and skip catalogs.  So this is better for
> > > composability.
> >
> > No objections from me to keep this distinction at the end, as long as
> > the the database name in the command has no impact on the chosen
> > behavior.
>
> OK, that's clear. Will progress.

Attached patch is tested, documented and imho ready to be committed,
so I will mark it so in CFapp.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


reindex_not_require_database_name.v3.patch
Description: Binary data

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Julien Rouhaud

Hi,

On Tue, Jun 28, 2022 at 10:53:14AM +0200, Alvaro Herrera wrote:
> On 2022-Jun-28, Julien Rouhaud wrote:
>
> > So, assuming that the current JOIN expansion order shouldn't be
> > changed, I implemented the last approach I mentioned.
>
> Yeah, I'm not sure that this is a good assumption.  I mean, if logical
> order is the order in which users see the table columns, then why
> shouldn't JOIN expand in the same way?  My feeling is that every aspect
> of user interaction should show columns ordered in logical order.  When
> I said that "only star expansion changes" upthread, what I meant is that
> there was no need to support any additional functionality such as
> letting the column order be changed or the server changing things
> underneath to avoid alignment padding, etc.

I'm not entirely sure of what you meant.  Assuming tables a(a, z) and b(b, z),
what do you think those queries should return?

SELECT * FROM a JOIN b on a.z = b.z
Currently it returns (a.a, a.z, b.b, b.z)

SELECT * FROM a JOIN b USING (z)
Currently it returns a.z, a.a, b.b.

Should it now return (a.a, z, b.b) as long as the tables have that logical
order, whether or not any other position (attnum / attphysnum) is different or
stay the same as now?

> Anyway, I think your 0001 is not a good first step.

FWIW this is just what you were previously suggesting at [1].

> I think a better
> first step is a patch that adds two more columns to pg_attribute:
> attphysnum and attlognum (or something like that.  This is the name I
> used years ago, but if you want to choose different, it's okay.)  In
> 0001, these columns would all be always identical, and there's no
> functionality to handle the case where they differ (probably even add
> some Assert that they are identical).  The idea behind these three
> columns is: attnum is a column identity and it never changes from the
> first value that is assigned to the column.  attphysnum represents the
> physical position of the table.  attlognum is the position where the
> column appears for user interaction.

I'm not following.  If we keep attnum as the official identity position and
use attlognum as the position that should be used in any interactive command,
wouldn't that risk to break every single client?

Imagine you have some framework that automatically generates queries based on
the catalog, if it sees table abc with:
c: attnum 1, attphysnum 1, attlognum 3
b: attnum 2, attphysnum 2, attlognum 2
a: attnum 3, attphysnum 3, attlognum 1

and you ask that layer to generate an insert with something like {'a': 'a',
'b': 'b', 'c': 'c'}, what would prevent it from generating:

INSERT INTO abc VALUES ('c', 'b', 'a');

while attlognum says it should have been

INSERT INTO abc VALUES ('a', 'b', 'c');

> In a 0002 patch, you would introduce backend support for the case where
> attlognum differs from the other two; but the other two are always the
> same and it's okay if the server misbehaves or crashes if attphysnum is
> different from attnum (best: keep the asserts that they are always the
> same).  Doing it this way limits the number of cases that you have to
> deal with, because there will be enough difficulty already.  You need to
> change RTE expansion everywhere: *-expansion, COPY, JOIN, expansion of
> SQL function results, etc ...  even psql \d ;-)  But, again: the
> physical position is always column identity and there's no way to
> reorder the columns physically for storage efficiency.

Just to clarify my understanding, apart from the fact that I'm only using
attphysnum (for your attnum and attphysnum) and attnum (for your attlognum), is
there any difference in the behavior with what I started to implement (if what
I started to implement was finished of course) and what you're saying here?

Also, about the default values evaluation (see [2]), should it be tied to your
attnum, attphysnum or attlognum?

> You could put ALTER TABLE support for moving columns as 0003.  (So
> testing for 0002 would just be some UPDATE sentences or some hack that
> lets you test various cases.)
>
> In a 0004 patch, you would introduce backend support for attphysnum to
> be different.  Probably no DDL support yet, since maybe we don't want
> that, but instead we would like the server to figure out the best
> possible packing based on alignment padding, nullability varlenability.
> So testing for this part is again just some UPDATEs.
>
> I think 0001+0002 are already a submittable patchset.

I think that supporting at least a way to specify the logical order during the
table creation should be easy to implement (there shouldn't be any
question on whether it needs to invalidate any cache or what lock level to
use), and could also be added in the initial submission without much extra
efforts, which could help with the testing.

[1] 
https://www.postgresql.org/message-id/202108181639.xjuovrpwgkr2@alvherre.pgsql
[2] 
https://www.postgresql.org/message-id/20220626024824.qnlpp6vikzjvuxs3%40jrouhaud

Re: Logging query parmeters in auto_explain

2022-06-28 Thread Dagfinn Ilmari Mannsåker

Michael Paquier  writes:

> On Mon, Jun 27, 2022 at 12:27:57PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> I forgot to mention, I also changed the order of the query and
>> parameters, so that they can more naturally be left out when no changes
>> are needed.
>
> I can see that, and I have added $node as an extra parameter of the
> routine.  After putting my hands on that, it also felt a bit unnatural
> to do the refactoring of the tests and the addition of the new GUC in
> the same patch, so I have split things as the attached.  The amount of
> coverage is still the same but it makes the integration of the feature
> easier to follow. 

That makes sense, but I still think the query_log() function definition
should go at the end (after done_testing()), so the machinery doesn't
distract from what's actually being tested.

Also, the second paragraph of the second commit message now belongs in
the first commit (without the word "Also").

Thanks,
ilmari

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Alvaro Herrera

On 2022-Jun-28, Julien Rouhaud wrote:

> So, assuming that the current JOIN expansion order shouldn't be
> changed, I implemented the last approach I mentioned.

Yeah, I'm not sure that this is a good assumption.  I mean, if logical
order is the order in which users see the table columns, then why
shouldn't JOIN expand in the same way?  My feeling is that every aspect
of user interaction should show columns ordered in logical order.  When
I said that "only star expansion changes" upthread, what I meant is that
there was no need to support any additional functionality such as
letting the column order be changed or the server changing things
underneath to avoid alignment padding, etc.

Anyway, I think your 0001 is not a good first step.  I think a better
first step is a patch that adds two more columns to pg_attribute:
attphysnum and attlognum (or something like that.  This is the name I
used years ago, but if you want to choose different, it's okay.)  In
0001, these columns would all be always identical, and there's no
functionality to handle the case where they differ (probably even add
some Assert that they are identical).  The idea behind these three
columns is: attnum is a column identity and it never changes from the
first value that is assigned to the column.  attphysnum represents the
physical position of the table.  attlognum is the position where the
column appears for user interaction.

In a 0002 patch, you would introduce backend support for the case where
attlognum differs from the other two; but the other two are always the
same and it's okay if the server misbehaves or crashes if attphysnum is
different from attnum (best: keep the asserts that they are always the
same).  Doing it this way limits the number of cases that you have to
deal with, because there will be enough difficulty already.  You need to
change RTE expansion everywhere: *-expansion, COPY, JOIN, expansion of
SQL function results, etc ...  even psql \d ;-)  But, again: the
physical position is always column identity and there's no way to
reorder the columns physically for storage efficiency.

You could put ALTER TABLE support for moving columns as 0003.  (So
testing for 0002 would just be some UPDATE sentences or some hack that
lets you test various cases.)

In a 0004 patch, you would introduce backend support for attphysnum to
be different.  Probably no DDL support yet, since maybe we don't want
that, but instead we would like the server to figure out the best
possible packing based on alignment padding, nullability varlenability.
So testing for this part is again just some UPDATEs.

I think 0001+0002 are already a submittable patchset.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"

Re: Add LZ4 compression in pg_dump

2022-06-28 Thread gkokolatos







--- Original Message ---
On Sunday, June 26th, 2022 at 5:55 PM, Justin Pryzby  
wrote:


>
>
> Hi,
>
> Will you be able to send a rebased patch for the next CF ?

Thank you for taking an interest in the PR. The plan is indeed to sent
a new version.

> If you update for the review comments I sent in March, I'll plan to do another
> round of review.

Thank you.

>
> On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
>
> > LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> >
> > I ran into that on an ubuntu LTS, so I don't think it's so old that it
> > shouldn't be handled more gracefully. LZ4 should either have an explicit
> > version check, or else shouldn't depend on that feature (or should define a
> > safe fallback version if the library header doesn't define it).
> >
> > https://packages.ubuntu.com/liblz4-1
> >
> > 0003: typo: of legacy => or legacy
> >
> > There are a large number of ifdefs being added here - it'd be nice to 
> > minimize
> > that. basebackup was organized to use separate files, which is one way.
> >
> > $ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
> > src/bin/pg_dump/compress_io.c:19
> >
> > In last year's CF entry, I had made a union within CompressorState. LZ4
> > doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
> > ZSTD_CStream).
> >
> > 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> > from commit ffd53659c. You're passing both the compression method and level.
> > I think there should be a structure which includes both. In the future, that
> > can also handle additional options. I hope to re-use these same things for
> > wal_compression=method:level.
> >
> > You renamed this:
> >
> > |- COMPR_ALG_LIBZ
> > |-} CompressionAlgorithm;
> > |+ COMPRESSION_GZIP,
> > |+} CompressionMethod;
> >
> > ..But I don't think that's an improvement. If you were to change it, it 
> > should
> > say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> > structs and typedefs. zlib is not idential to gzip, which uses a different
> > header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.
> >
> > The cf* changes in pg_backup_archiver could be split out into a separate
> > commit. It's strictly a code simplification - not just preparation for more
> > compression algorithms. The commit message should "See also:
> > bf9aa490db24b2334b3595ee33653bf2fe39208c".
> >
> > The changes in 0002 for cfopen_write seem insufficient:
> > |+ if (compressionMethod == COMPRESSION_NONE)
> > |+ fp = cfopen(path, mode, compressionMethod, 0);
> > | else
> > | {
> > | #ifdef HAVE_LIBZ
> > | char *fname;
> > |
> > | fname = psprintf("%s.gz", path);
> > |- fp = cfopen(fname, mode, compression);
> > |+ fp = cfopen(fname, mode, compressionMethod, compressionLevel);
> > | free_keep_errno(fname);
> > | #else
> >
> > The only difference between the LIBZ and uncompressed case is the file
> > extension, and it'll be the only difference with LZ4 too. So I suggest to
> > first handle the file extension, and the rest of the code path is not
> > conditional on the compression method. I don't think cfopen_write even needs
> > HAVE_LIBZ - can't you handle that in cfopen_internal() ?
> >
> > This patch rejects -Z0, which ought to be accepted:
> > ./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
> > pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set
> >
> > Your 0003 patch shouldn't reference LZ4:
> > +#ifndef HAVE_LIBLZ4
> > + if (*compressionMethod == COMPRESSION_LZ4)
> > + supports_compression = false;
> > +#endif
> >
> > The 0004 patch renames zlibOutSize to outsize - I think the patch series 
> > should
> > be constructed such as to minimize the size of the method-specific patches. 
> > I
> > say this anticipating also adding support for zstd. The preliminary patches
> > should have all the boring stuff. It would help for reviewing to keep the
> > patches split up, or to enumerate all the boring things that are being 
> > renamed
> > (like change OutputContext to cfp, rename zlibOutSize, ...).
> >
> > 0004: The include should use  and not "lz4.h"
> >
> > freebsd/cfbot is failing.
> >
> > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > exercise it more on CI.
>
>
> On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:
>
> > On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> >
> > > You're passing both the compression method and level. I think there should
> > > be a structure which includes both. In the future, that can also handle
> > > additional options.
> >
> > I'm not sure if there's anything worth saving, but I did that last year with
> > 0003-Support-multiple-compression-algs-levels-opts.patch
> > I sent a rebased copy off-list.
> > https://www.postgresql.org/message-id/flat/20210104025321.ga9...@telsasoft.com#ca1b9f9d3552c87fa874731cad9d8391
> >
> > | fatal("not built with LZ4 support");
> > | fatal("not built with l

Re: Logging query parmeters in auto_explain