date:20220606

An inverted index using roaring bitmaps

2022-06-06 Thread Chinmay Kanchi

I've been doing some preliminary prep work to see how an inverted index
using roaring bitmaps (https://roaringbitmap.org/) would perform. I'm
presenting some early work using SQL code with the roaring bitmap Postgres
extension (https://github.com/ChenHuajun/pg_roaringbitmap) to simulate a
hypothetical index using this approach.

I'd like to solicit feedback from the community to see if this is something
worth pursuing or if there are potential issues that I'm not aware of (or
if this approach has been considered in the past and discarded for whatever
reason). For context, my experience with Postgres is primarily as a user
and I'm not at all familiar with the code base, so please be gentle :).

That said, here's a quick and dirty demo:

I have a table "cities"

  Table "public.cities"
 Column  | Type | Collation | Nullable | Default
-+--+---+--+-
 city| text |   |  |
 country | text |   |  |
Indexes:
"cities_country_idx" btree (country)

select count(*) from cities;
 count

 139739
(1 row)

And just some sample rows:

select * from cities order by random() limit 10;
  city  |   country
+--
 Alcalá de la Selva | Spain
 Bekirhan   | Turkey
 Ceggia | Italy
 Châtillon-en-Vendelais | France
 Hohenfelde | Germany
 Boedo  | Argentina
 Saint-Vith | Belgium
 Gaggenau   | Germany
 Lake Ozark | United States
 Igunga | Tanzania, United Republic of
(10 rows)

Since a bitmap requires you to convert your inputs into integers, I created
a function as a hack to convert our TIDs to integers. It's ugly as hell,
but it serves. 2048 is 2^11, which according to the GIN index source code
is a safe assumption for the highest possible MaxHeapTuplesPerPage.

create function ctid_to_int(ctid tid) returns integer as $$
select (ctid::text::point)[0] * 2048 + (ctid::text::point)[1];
$$
language sql returns null on null input;

And the reverse:
create function int_to_ctid(i integer) returns tid as $$
select point(i/2048, i%2048)::text::tid;
$$
language sql returns null on null input;

In addition, I created a table "cities_rb" to roughly represent an "index"
on the country column:

create table cities_rb as (select country,
roaringbitmap(array_agg(ctid_to_int(ctid))::text) idx from cities group by
country);

 Table "public.cities_rb"
 Column  | Type  | Collation | Nullable | Default
-+---+---+--+-
 country | text  |   |  |
 idx | roaringbitmap |   |  |


Now for the fun stuff - to simulate the "index" I will be running some
queries against the cities_rb table using bitmap aggregations and comparing
them to functionally the same queries using the BTree index on cities.

explain analyze select ctid from cities where country = 'Japan';
QUERY PLAN

--
 Bitmap Heap Scan on cities  (cost=18.77..971.83 rows=1351 width=6) (actual
time=0.041..0.187 rows=1322 loops=1)
   Recheck Cond: (country = 'Japan'::text)
   Heap Blocks: exact=65
   ->  Bitmap Index Scan on cities_country_idx  (cost=0.00..18.43 rows=1351
width=0) (actual time=0.031..0.031 rows=1322 loops=1)
 Index Cond: (country = 'Japan'::text)
 Planning Time: 0.055 ms
 Execution Time: 0.233 ms
(7 rows)

explain analyze select rb_to_array(idx) from cities_rb where country =
'Japan';
 QUERY PLAN

-
 Seq Scan on cities_rb  (cost=0.00..14.88 rows=1 width=32) (actual
time=0.050..0.067 rows=1 loops=1)
   Filter: (country = 'Japan'::text)
   Rows Removed by Filter: 229
 Planning Time: 0.033 ms
 Execution Time: 0.076 ms
(5 rows)

explain analyze select count(*) from cities where country = 'Japan';
 QUERY PLAN

-
 Aggregate  (cost=35.31..35.32 rows=1 width=8) (actual time=0.151..0.151
rows=1 loops=1)
   ->  Index Only Scan using cities_country_idx on cities
 (cost=0.29..31.94 rows=1351 width=0) (actual time=0.026..0.103 rows=1322
loops=1)
 Index Cond: (country = 'Japan'::text)
 Heap Fetches: 0
 Planning Time: 0.056 ms
 Execution Time: 0.180 ms
(6 rows)

explain analyze select rb_cardinality(idx) from cities_rb where country =
'Japan';
 QUERY PLAN

Re: proposal - psql - use pager for \watch command

2022-06-06 Thread Pavel Stehule

út 7. 6. 2022 v 6:50 odesílatel Thomas Munro 
napsal:

> On Tue, Jun 7, 2022 at 3:23 PM Tom Lane  wrote:
> > The code needs a comment about why it's emitting a newline, though.
> > In particular, it had better explain why that should be conditional
> > on !pagerpipe, because that makes no sense to me.
>
> Yeah.  OK, here's my take:
>
> +   /*
> +* If the terminal driver echoed "^C",
> libedit/libreadline might be
> +* confused about the cursor position.  Therefore,
> inject a newline
> +* before the next prompt is displayed.  We only do
> this when not using
> +* a pager, because pagers are expected to restore the
> screen to a sane
> +* state on exit.
> +*/
>
> AFAIK pagers conventionally use something like termcap ti/te[1] to
> restore the screen, or equivalents in tinfo etc (likely via curses).
> If we were to inject an extra newline we'd just have a blank line for
> nothing.  I suppose there could be a hypothetical pager that doesn't
> follow that convention, and in fact both less and pspg have a -X
> option to preserve last output, but in any case I expect that pagers
> disable echoing, so I don't think the ^C will make it to the screen,
> and furthermore ^C isn't used for exit anyway.  Rather than speculate
> about the precise details, I just said "... sane state on exit".
> Pavel, do you agree?
>

Applications designed to be used as pager are usually careful about the
final cursor position. Without it, there can be no wanted artefacts. pspg
should work in pgcli, which is a more sensitive environment than psql.

I think modern pagers like less or pspg will work in all modes correctly.
There can be some legacy pagers like "pg" or very old implementations of
"more". But we don't consider it probably (more just in comment).

Regards

Pavel



> Here's how it looks after I enter and then exit Pavel's streaming pager:
>
> $ PSQL_WATCH_PAGER='pspg --stream' ~/install/bin/psql postgres
> psql (15beta1)
> Type "help" for help.
>
> postgres=# select;
> --
> (1 row)
>
> postgres=# \watch 1
> postgres=#
>
> FWIW it's the same with PSQL_WATCH_PAGER='less'.
>
> [1]
> https://www.gnu.org/software/termutils/manual/termcap-1.3/html_node/termcap_39.html
>

Re: proposal - psql - use pager for \watch command

2022-06-06 Thread Thomas Munro

On Tue, Jun 7, 2022 at 3:23 PM Tom Lane  wrote:
> The code needs a comment about why it's emitting a newline, though.
> In particular, it had better explain why that should be conditional
> on !pagerpipe, because that makes no sense to me.

Yeah.  OK, here's my take:

+   /*
+* If the terminal driver echoed "^C",
libedit/libreadline might be
+* confused about the cursor position.  Therefore,
inject a newline
+* before the next prompt is displayed.  We only do
this when not using
+* a pager, because pagers are expected to restore the
screen to a sane
+* state on exit.
+*/

AFAIK pagers conventionally use something like termcap ti/te[1] to
restore the screen, or equivalents in tinfo etc (likely via curses).
If we were to inject an extra newline we'd just have a blank line for
nothing.  I suppose there could be a hypothetical pager that doesn't
follow that convention, and in fact both less and pspg have a -X
option to preserve last output, but in any case I expect that pagers
disable echoing, so I don't think the ^C will make it to the screen,
and furthermore ^C isn't used for exit anyway.  Rather than speculate
about the precise details, I just said "... sane state on exit".
Pavel, do you agree?

Here's how it looks after I enter and then exit Pavel's streaming pager:

$ PSQL_WATCH_PAGER='pspg --stream' ~/install/bin/psql postgres
psql (15beta1)
Type "help" for help.

postgres=# select;
--
(1 row)

postgres=# \watch 1
postgres=#

FWIW it's the same with PSQL_WATCH_PAGER='less'.

[1] 
https://www.gnu.org/software/termutils/manual/termcap-1.3/html_node/termcap_39.html
From a9389a8755379d4df2c52eb760d355aa5cf0ff96 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 7 Jun 2022 14:18:51 +1200
Subject: [PATCH v2] Fix \watch's interaction with libedit/libreadline.

When you hit ^C, the terminal driver in Unix-like systems echos "^C" as
well as sending an interrupt signal (depending on stty settings).  The
line editing library (libedit/libreadline) is then confused about the
current cursor location, and might corrupt the display.  Fix, by moving
to a new line before the next prompt is displayed.

Author: Pavel Stehule 
Reported-by: Tom Lane 
Discussion: https://postgr.es/m/3278793.1626198638%40sss.pgh.pa.us
---
 src/bin/psql/command.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index b51d28780b..7f366c80e9 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5169,6 +5169,18 @@ do_watch(PQExpBuffer query_buf, double sleep)
 		pclose(pagerpipe);
 		restore_sigpipe_trap();
 	}
+	else
+	{
+		/*
+		 * If the terminal driver echoed "^C", libedit/libreadline might be
+		 * confused about the cursor position.  Therefore, inject a newline
+		 * before the next prompt is displayed.  We only do this when not using
+		 * a pager, because pagers are expected to restore the screen to a sane
+		 * state on exit.
+		 */
+		fprintf(stdout, "\n");
+		fflush(stdout);
+	}

 #ifdef HAVE_POSIX_DECL_SIGWAIT
 	/* Disable the interval timer. */
-- 
2.35.1

tablesync copy ignores publication actions

2022-06-06 Thread Peter Smith

The logical replication tablesync ignores the publication 'publish'
operations during the initial data copy.

This is current/known PG behaviour (e.g. as recently mentioned [1])
but it was not documented anywhere.

This patch just documents the existing behaviour and gives some examples.

--
[1] 
https://www.postgresql.org/message-id/CAA4eK1L_98LF7Db4yFY1PhKKRzoT83xtN41jTS5X%2B8OeGrAkLw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-PGDOCS-tablesync-ignores-publish-operation.patch
Description: Binary data

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-06 Thread Kyotaro Horiguchi

At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman  wrote in 
> On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy 
> >  wrote in
> > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman  wrote:
> > > >
> > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > > > list, and I haven't seen any activity on it (maybe this is because I
> > > > emailed directly instead of using the form?), but I got some time to
> > > > take a look and concluded that a first-level fix is pretty simple.
> > > >
> > > > A quick background refresher: after promoting a standby rewinding the
> > > > former primary requires that a checkpoint have been completed on the
> > > > new primary after promotion. This is correctly documented. However
> > > > pg_rewind incorrectly reports to the user that a rewind isn't
> > > > necessary because the source and target are on the same timeline.
> > ...
> > > > Attached is a patch that detects this condition and reports it as an
> > > > error to the user.
> >
> > I have some random thoughts on this.
> >
> > There could be a problem in the case of gracefully shutdowned
> > old-primary, so I think it is worth doing something if it can be in a
> > simple way.
> >
> > However, I don't think we can simply rely on minRecoveryPoint to
> > detect that situation, since it won't be reset on a standby. A standby
> > also still can be the upstream of a cascading standby.  So, as
> > discussed in the thread for the comment [2], what we can do here would be
> > simply waiting for the timelineID to advance, maybe having a timeout.
> 
> To confirm I'm following you correctly, you're envisioning a situation like:
> 
> - Primary A
> - Replica B replicating from primary
> - Replica C replicating from replica B
> 
> then on failover from A to B you end up with:
> 
> - Primary B
> - Replica C replication from primary
> - [needs rewind] A
> 
> and you try to rewind A from C as the source?

Yes. I think it is a legit use case.  That being said, like other
points, it might be acceptable.

> > In a case of single-step replication set, a checkpoint request to the
> > primary makes the end-of-recovery checkpoint fast.  It won't work as
> > expected in cascading replicas, but it might be acceptable.
> 
> "Won't work as expected" because there's no way to guarantee
> replication is caught up or even advancing?

Maybe no.  I meant that restartpoints don't run more frequently than
the intervals of checkpoint_timeout even if checkpint records come
more frequently.

> > > > In the spirit of the new-ish "ensure shutdown" functionality I could
> > > > imagine extending this to automatically issue a checkpoint when this
> > > > situation is detected. I haven't started to code that up, however,
> > > > wanting to first get buy-in on that.
> > > >
> > > > 1: 
> > > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com
> > >
> > > Thanks. I had a quick look over the issue and patch - just a thought -
> > > can't we let pg_rewind issue a checkpoint on the new primary instead
> > > of erroring out, maybe optionally? It might sound too much, but helps
> > > pg_rewind to be self-reliant i.e. avoiding external actor to detect
> > > the error and issue checkpoint the new primary to be able to
> > > successfully run pg_rewind on the pld primary and repair it to use it
> > > as a new standby.
> >
> > At the time of the discussion [2] for the it was the hinderance that
> > that requires superuser privileges.  Now that has been narrowed down
> > to the pg_checkpointer privileges.
> >
> > If we know that the timeline IDs are different, we don't need to wait
> > for a checkpoint.
> 
> Correct.
> 
> > It seems to me that the exit status is significant. pg_rewind exits
> > with 1 when an invalid option is given. I don't think it is great if
> > we report this state by the same code.
> 
> I'm happy to change that; I only chose "1" as a placeholder for
> "non-zero exit status".
> 
> > I don't think we always want to request a non-spreading checkpoint.
> 
> I'm not familiar with the terminology "non-spreading checkpoint".

Does "immediate checkpoint" works?  That is, a checkpoint that runs at
full-speed (i.e. with no delays between writes).

> > [2] 
> > https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com
> 
> I read through that thread, and one interesting idea stuck out to me:
> making "tiimeline IDs are the same" an error exit status. On the one
> hand that makes a certain amount of sense because it's unexpected. But
> on the other hand there are entirely legitimate situations where upon
> failover the timeline IDs happen to match (e.g., for use it happens
> some percentage of the time naturally as we are using sync replication
> and failovers often involve STONITHing the original primary, so it's
> entirely possible that the promoted replica be

Re: proposal - psql - use pager for \watch command

2022-06-06 Thread Tom Lane

Thomas Munro  writes:
> On Mon, May 9, 2022 at 7:07 PM Pavel Stehule  wrote:
>> út 13. 7. 2021 v 19:50 odesílatel Tom Lane  napsal:
>>> ^C\watch cancelled
>>> regression=#

> Do we really need the extra text?  What about just \n, so you get:

> postgres=# \watch 1
> ...blah blah...
> ^C
> postgres=#

Fine by me.

> This affects all release branches too.  Should we bother to fix this
> there?  For them, I think the fix is just:

If we're doing something as nonintrusive as just adding a newline,
it'd probably be OK to backpatch.

The code needs a comment about why it's emitting a newline, though.
In particular, it had better explain why that should be conditional
on !pagerpipe, because that makes no sense to me.

regards, tom lane

Re: proposal - psql - use pager for \watch command

2022-06-06 Thread Thomas Munro

On Mon, May 9, 2022 at 7:07 PM Pavel Stehule  wrote:
> út 13. 7. 2021 v 19:50 odesílatel Tom Lane  napsal:
>> ^Cregression=#
>>
>> then as you can see I get nothing but the "^C" echo before the next
>> psql prompt.  The problem with this is that now libreadline is
>> misinformed about the cursor position, messing up any editing I
>> might try to do on the next line of input.  So I think it would
>> be a good idea to have some explicit final output when the \watch
>> command terminates, along the line of
>>
>> ...
>> Tue Jul 13 13:44:46 2021 (every 2s)
>>
>>   now
>> ---
>>  2021-07-13 13:44:46.396572-04
>> (1 row)
>>
>> ^C\watch cancelled
>> regression=#
>>
>> This strikes me as a usability improvement even without the
>> readline-confusion angle.
>
> here is an patch

I played with this.  On a libedit build (tested on my Mac), an easy
way to see corruption is to run eg SELECT;, then \watch 1, then ^C,
then up-arrow to see the previous command clobber the wrong columns.
On a libreadline build (tested on my Debian box), that simple test
doesn't fail in the same way.  Though there may be some other way to
make it misbehave that would take me longer to find, it's enough for
me that libedit is befuddled by what we're doing.

Do we really need the extra text?  What about just \n, so you get:

postgres=# \watch 1
...blah blah...
^C
postgres=#

This affects all release branches too.  Should we bother to fix this
there?  For them, I think the fix is just:

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index d1ee795cb6..3a88d5d6c4 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -4992,6 +4992,9 @@ do_watch(PQExpBuffer query_buf, double sleep)
sigint_interrupt_enabled = false;
}

+   fprintf(pset.queryFout, "\n");
+   fflush(pset.queryFout);
+
pg_free(title);
return (res >= 0);
 }
From 409a5ce2d81c6fd976ff7cc0ed4de50197d4ce55 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 7 Jun 2022 14:18:51 +1200
Subject: [PATCH] Fix \watch's interaction with libedit/libreadline.

When you hit ^C, the terminal driver in Unix-like systems echos "^C" as
well as sending an interrupt signal (depending on stty settings).  The
line editing library (libedit/libreadline) is then confused about the
current cursor location, and might corrupt the display.  Fix, by moving
to a new line before the next prompt is displayed.

Author: Pavel Stehule 
Reported-by: Tom Lane 
Discussion: https://postgr.es/m/3278793.1626198638%40sss.pgh.pa.us
---
 src/bin/psql/command.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index b51d28780b..5a974a5ad9 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5169,6 +5169,11 @@ do_watch(PQExpBuffer query_buf, double sleep)
 		pclose(pagerpipe);
 		restore_sigpipe_trap();
 	}
+	else
+	{
+		fprintf(stdout, "\n");
+		fflush(stdout);
+	}

 #ifdef HAVE_POSIX_DECL_SIGWAIT
 	/* Disable the interval timer. */
-- 
2.35.1

Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-06 Thread Justin Pryzby

tather => rather
is charge => in charge

On Mon, Jun 06, 2022 at 01:17:52PM +0900, Michael Paquier wrote:
> but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would

On Tue, Jun 07, 2022 at 08:30:47AM +0900, Michael Paquier wrote:
> I would reduce the format to be MMDDTHHMMSS_ms (we could also use

I think it's better with a dot (HHMMSS.ms) rather than underscore (HHMMSS_ms).

On Tue, Jun 07, 2022 at 11:42:37AM +0900, Michael Paquier wrote:
> + /* append milliseconds */
> + snprintf(timebuf, sizeof(timebuf), "%s_%03d",
> +  timebuf, (int) (time.tv_usec / 1000));

> +   with a timestamp formatted as per ISO 8601
> +   (%Y%m%dT%H%M%S) appended by an underscore and
> +   the timestamp's milliseconds, where all the generated files are stored.

The ISO timestamp can include milliseconds (or apparently fractional parts of
the "lowest-order" unit), so the "appended by" part doesn't need to be
explained here.

+   snprintf(timebuf, sizeof(timebuf), "%s_%03d",
+timebuf, (int) (time.tv_usec / 1000));

Is it really allowed to sprintf a buffer onto itself ?
I can't find any existing cases doing that.

It seems useless in any case - you could instead
snprintf(timebuf+strlen(timebuf), or increment len+=snprintf()...

Or append the milliseconds here:

+   len = snprintf(log_opts.basedir, MAXPGPATH, "%s/%s", log_opts.rootdir,
+  timebuf);

-- 
Justin

Re: Error from the foreign RDBMS on a foreign table I have no privilege on

2022-06-06 Thread Laurenz Albe

On Sat, 2022-06-04 at 21:18 +, Phil Florent wrote:
> I opened an issue with an attached code on oracle_fdw git page : 
> https://github.com/laurenz/oracle_fdw/issues/534 
> Basically I expected to obtain a "no privilege" error from PostgreSQL when I 
> have no read privilege
> on the postgres foreign table but I obtained an Oracle error instead.
> Laurenz investigated and closed the issue but he suggested perhaps I should 
> post that on
> the hackers list since it also occurs with postgres-fdw on some occasion(I 
> have investigated some more,
> and postgres_fdw does the same thing when you turn onuse_remote_estimate.). 
> Hence I do...

To add more detais: permissions are checked at query execution time, but if 
"use_remote_estimate"
is used, the planner already accesses the remote table, even if the user has no 
permissions
on the foreign table.

I feel that that is no bug, but I'd be curious to know if others disagree.

Yours,
Laurenz Albe

Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-06 Thread Michael Paquier

On Tue, Jun 07, 2022 at 08:30:47AM +0900, Michael Paquier wrote:
> If we don't split by the millisecond, we would come back to the
> problems of the original report.  On my laptop, the --check phase
> that passes takes more than 1s, but the one that fails takes 0.1s, so
> a follow-up run would complain with the path conflicts.  So at the end
> I would reduce the format to be MMDDTHHMMSS_ms (we could also use
> a logic that checks for conflicts and appends an extra number if
> needed, though the addition of the extra ms is a bit shorter).

So, attached is the patch I would like to apply for all that (commit
message included).  One issue I missed previously is that the TAP test
missed the log files on failure, so I had to tweak that with a find
routine.  I have fixed a few comments, and improved the docs to
describe the directory structure.

We are still need a refresh of the buildfarm client for the case where
pg_upgrade is tested without TAP, like that I guess:
--- a/PGBuild/Modules/TestUpgrade.pm
+++ b/PGBuild/Modules/TestUpgrade.pm
@@ -140,6 +140,7 @@ sub check
  $self->{pgsql}/src/bin/pg_upgrade/log/*
  $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/*.diffs

$self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/*
+ 
$self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/*/log/*
  $self->{pgsql}/src/test/regress/*.diffs"
);
$log->add_log($_) foreach (@logfiles);
--
Michael
From a865e736951814d0b7aeee2b93ac4e87d355af0f Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 7 Jun 2022 11:29:31 +0900
Subject: [PATCH v3] Restructure pg_upgrade output directories for better
 idempotence

38bfae3 has moved the contents written to files by pg_upgrade under a
new directory called pg_upgrade_output.d/ located in the new cluster's
data folder, and it used a simple structure made of two subdirectories
leading to a fixed structure: log/ and dump/.  This design has made
weaker pg_upgrade on repeated calls, as we could get failures when
creating one or more of those directories, while potentially losing the
logs of a previous run (logs are retained automatically on failure, and
cleaned up on success unless --retain is specified).  So a user would
need to clean up pg_upgrade_output.d/ as an extra step for any repeated
calls of pg_upgrade.  The most common scenario here is --check followed
by the actual upgrade, but one could see the failure when specifying an
incorrect argument value.  Removing entirely the logs would have the
disadvantage of removing all the past information, even if --retain was
specified at some past step.

This result is annoying for a lot of users and automated upgrade flows.
So, tather than requiring a manual removal of pg_upgrade_output.d/, this
redesigns the set of output directories in a more dynamic way, based on
a suggestion from Tom Lane and Daniel Gustafsson.  pg_upgrade_output.d/
is still the base path, but a second directory level is added, mostly
named after an ISO-8601-formatted timestamp (in short human-readable,
with milliseconds appended to the name to avoid any conflicts).  The
logs and dumps are saved within the same subdirectories as previously,
as of log/ and dump/, but these are located inside the subdirectory
named after the timestamp.

The logs of a given run are removed only after a successful run if
--retain is not used, and pg_upgrade_output.d/ is kept if there are any
logs from a previous run.  Note that previously, pg_upgrade would have
kept the logs even after a successful --check but that was inconsistent
compared to the case without --check.  The code is charge of the removal
of the output directories is now refactored into a single routine.

Two TAP tests are added with a --check command (one failure and one
success), to look after the case fixed here.  Note that the test had to
be tweaked a bit to fit with the new directory structure so as it can
find any logs generated on failure.  This is still going to require a
change in the buildfarm client for the case where pg_upgrade is tested
without the TAP test, though.

Reported-by: Tushar Ahuja
Author: Michael Paquier
Reviewed-by: Daniel Gustafsson, Justin Pryzby
Discussion: https://postgr.es/m/77e6ecaa-2785-97aa-f229-4b6e047cb...@enterprisedb.com
---
 src/bin/pg_upgrade/check.c |  2 +
 src/bin/pg_upgrade/pg_upgrade.c| 65 +-
 src/bin/pg_upgrade/pg_upgrade.h| 14 --
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 49 ++-
 src/bin/pg_upgrade/util.c  | 42 +
 doc/src/sgml/ref/pgupgrade.sgml|  5 +-
 6 files changed, 148 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 6114303b52..ace7387eda 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -210,6 +210,8 @@ report_clusters_compatible(void)
 		pg_log(PG_REPORT, "\n*Clusters are compatible*\n");
 		/*

Re: pg_buffercache: add sql test

2022-06-06 Thread Dong Wook Lee

I removed it on your advice.
Thanks.

2022년 6월 7일 (화) 오전 2:04, Stephen Frost 님이 작성:
>
> Greetings,
>
> * Daniel Gustafsson (dan...@yesql.se) wrote:
> > > On 6 Jun 2022, at 15:30, Dong Wook Lee  wrote:
> >
> > > I just wrote a test code for the `pg_buffercache` extension which
> > > doesn't not have test code.
> >
> > Please add this patch to the next commitfest to make sure it's not lost 
> > before
> > then.
> >
> >   https://commitfest.postgresql.org/38/
>
> Seems to be there now, at least:
>
> https://commitfest.postgresql.org/38/3674/
>
> However, I don't think we should have a 'target version' set for this
> (and in particular it shouldn't be 15).  I'd suggest removing that.
>
> Thanks,
>
> Stephen

Re: Reducing Memory Consumption (aset and generation)

2022-06-06 Thread Ranier Vilela

Em seg., 6 de jun. de 2022 às 21:14, Ranier Vilela 
escreveu:

> Em seg., 6 de jun. de 2022 às 20:37, David Rowley 
> escreveu:
>
>> On Mon, 6 Jun 2022 at 07:28, Ranier Vilela  wrote:
>> > 4) 004-generation-reduces-memory-consumption-BUG.patch
>> > Same to the (2), but with BUG.
>> > It only takes a few tweaks to completely remove the field block.
>>
>> > This fails with make check.
>> > I couldn't figure out why it doesn't work with 16 bits (struct
>> GenerationChunk).
>>
>> Hi David, thanks for taking a look at this.
>
>
>> I think you're misunderstanding how blocks and chunks work here.  A
>> block can have many chunks.  You can't find the block that a chunk is
>> on by subtracting Generation_BLOCKHDRSZ from the pointer given to
>> GenerationFree(). That would only work if the chunk happened to be the
>> first chunk on a block. If it's anything apart from that then you'll
>> be making adjustments to the memory of some prior chunk on the block.
>> I imagine this is the reason you can't get the tests to pass.
>>
> Ok, I am still learning about this.
> Can you explain why subtracting Generation_BLOCKHDRSZ from the pointer,
> works for sizeof(struct GenerationChunk) = 24 bits,
> When all references for the block field have been removed.
> This pass check-world.
>
>
>>
>> Can you also explain why you think moving code around randomly or
>> adding unlikely() macros helps reduce the memory consumption overheads
>> of generation contexts?
>
> Of course, those changes do not reduce memory consumption.
> But, IMO, I think those changes improve the access to memory regions,
> because of the locality of the data.
>
> About "unlikely macros", this helps the branchs prediction, when most of
> the time,
> malloc and related functions, will not fail.
>
>
>>   I imagine you think that's helping to further
>> improve performance, but you've not offered any evidence of that
>> separately from the other changes you've made. If you think those are
>> useful changes then I recommend you run individual benchmarks and
>> offer those as proof that those changes are worthwhile.
>>
> Ok, I can understand, are changes unrelated.
>
Let's restart this, to simplify the review and commit work.

Regarding the patches now, we have:
1) v1-001-aset-reduces-memory-consumption.patch
Reduces memory used by struct AllocBlockData by minus 8 bits,
reducing the total size to 32 bits, which leads to "fitting" two structs in
a 64bit cache.

Remove tests elog(ERROR, "could not find block containing chunk %p" and
elog(ERROR, "could not find block containing chunk %p", moving them to
MEMORY_CONTEXT_CHECKING context.

Since 8.2 versions, nobody complains about these tests.

But if is not acceptable, have the option (3)
v1-003-aset-reduces-memory-consumption.patch

2) v1-002-generation-reduces-memory-consumption.patch
Reduces memory used by struct GenerationBlock, by minus 8 bits,
reducing the total size to 32 bits, which leads to "fitting" two structs in
a 64bit cache.

3) v1-003-aset-reduces-memory-consumption.patch
Same to the (1), but without remove the tests:
elog(ERROR, "could not find block containing chunk %p" and
elog(ERROR, "could not find block containing chunk %p",
But at the cost of removing a one tiny part of the tests and
moving them to MEMORY_CONTEXT_CHECKING context.

Since 8.2 versions, nobody complains about these tests.

regards,
Ranier Vilela
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index ec3e264a73..901a954508 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -150,11 +150,13 @@ typedef AllocSetContext *AllocSet;
  */
 typedef struct AllocBlockData
 {
-	AllocSet	aset;			/* aset that owns this block */
 	AllocBlock	prev;			/* prev block in aset's blocks list, if any */
 	AllocBlock	next;			/* next block in aset's blocks list, if any */
 	char	   *freeptr;		/* start of free space in this block */
 	char	   *endptr;			/* end of space in this block */
+#ifdef MEMORY_CONTEXT_CHECKING
+	AllocSet	aset;			/* aset that owns this block */
+#endif
 }			AllocBlockData;
 
 /*
@@ -485,11 +487,13 @@ AllocSetContextCreateInternal(MemoryContext parent,
 
 	/* Fill in the initial block's block header */
 	block = (AllocBlock) (((char *) set) + MAXALIGN(sizeof(AllocSetContext)));
-	block->aset = set;
 	block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 	block->endptr = ((char *) set) + firstBlockSize;
 	block->prev = NULL;
 	block->next = NULL;
+#ifdef MEMORY_CONTEXT_CHECKING
+	block->aset = set;
+#endif
 
 	/* Mark unallocated space NOACCESS; leave the block header alone. */
 	VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, block->endptr - block->freeptr);
@@ -743,8 +747,10 @@ AllocSetAlloc(MemoryContext context, Size size)
 
 		context->mem_allocated += blksize;
 
-		block->aset = set;
 		block->freeptr = block->endptr = ((char *) block) + blksize;
+#ifdef MEMORY_CONTEXT_CHECKING
+		block->aset = set;
+#endif
 
 		chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
 		chunk->a

Re: Collation version tracking for macOS

2022-06-06 Thread Thomas Munro

On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby  wrote:
> On 6/3/22 3:58 PM, Tom Lane wrote
> > Thomas Munro  writes:
> >> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
> >>  wrote:
> >>> It feels to me like we're still not really thinking clearly about this
> >>> within the PG community, and that the seriousness of this issue is not
> >>> fully understood.
> >> FWIW A couple of us tried quite hard to make smarter warnings, and
> >> that thread and others discussed a lot of those topics, like the
> >> relevance to constraints and so forth.
> > I think the real problem here is that the underlying software mostly
> > doesn't take this issue seriously.  Unfortunately, that leads one to
> > the conclusion that we need to maintain our own collation code and
> > data (e.g., our own fork of ICU), and that isn't happening.  Unlike
> > say Oracle, we do not have the manpower; nor do we want to bloat our
> > code base that much.
> >
> > Short of maintaining our own fork, ranting about the imperfections
> > of the situation is a waste of time.
> The first step to a solution is admitting that the problem exists.

We've been discussing this topic for years and I don't think anyone
thinks the case is closed...

> Ignoring broken backups, segfaults and data corruption as a "rant"
> implies that we simply throw in the towel and tell users to suck it up
> or switch engines. There are other ways to address this short of the
> community doing all the work itself. One simple example would be to
> refuse to start if the collation provider has changed since initdb
> (which we'd need to allow users to override).

Yeah, it's been discussed, but never proposed.  The problem is that
you need to start up to fix the problem.  Another option is not to use
affected indexes, but that doesn't help with other forms of the
problem (partition constraints, etc).

> A more sophisticated
> option would be to provide the machinery for supporting multiple
> collation libraries.

Earlier I mentioned distinct "providers" but I take that back, that's
too complicated.  Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions?  Or some
cleverer, shorter notation.  Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not.  For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with.  You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK.  Maintaining our own fork(s) of ICU would seem like massive
overkill and I don't think anyone has suggested that; the question on
my mind is  whether we could rely on existing packages.  Then you'd be
exposed only to changes that happen within (say) the ICU 63 package's
lifetime... I recall looking into whether that can happen but ... I
don't recall the answer.

Re: Collation version tracking for macOS

2022-06-06 Thread Jeremy Schneider

> On Jun 6, 2022, at 17:10, Jim Nasby  wrote:
> Ignoring broken backups, segfaults and data corruption as a "rant" implies 
> that we simply throw in the towel and tell users to suck it up or switch 
> engines.

Well now, let’s be clear, I was the one who called my email a “rant”.  🙂

And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong 
that rants don’t usually help move things forward.

Thomas - thanks for the link back to one of the threads. I spent some time 
reading through that and it’s a lot of material; I haven’t read the whole 
thread yet. If you have some others that would also be particularly good 
background, let me know. I’m doing a chunk of this in my spare time at the 
moment, but I do want to keep getting more up to speed. I was pulled into a 
bunch of various things related to PostgreSQL and ICU and collation and OS’s 
over the past couple years, so I learned a lot from on-the-ground experience 
and I am interested in trying to get a little more involved in the conversation 
here.

Personally, I really do think there should at least be an *option* to tell the 
DB to fully error rather than just warn on version mismatch. Correctness 
matters to many users, and being able to *trust* string comparisons are correct 
is pretty damn fundamental all throughout a database. It really doesn’t get any 
more basic and the potential for bad things to happen is pretty astronomical, 
if you can’t trust those. I understand the consternation about dealing with 
upgrades of large & busy databases, but I’m still surprised that the community 
consensus arrived at the present behavior, and I have a lot of reading to do, 
to really understand how that happened and where the dialogue is today.

Multiple versions of ICU sounds nice for users who need real linguistic 
collation (like what Oracle and DB2 offer), but I still feel like there needs 
to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good 
enough” for 99% of users and that is guaranteed to be the same everywhere on 
every platform and just won’t ever change. I think glibc needs to be phased out 
somehow. At a minimum, not the default for new users… to stop the bleeding. If 
MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant 
to spend too much time on a POC now though, it feels like my idea is the 
outlier and the general PG hacker consensus would be to reject this idea. (But 
maybe I’m wrong?)

Anyway, again, apologies for my pants-on-fire email last week. I hope I can 
enjoy a few beers someday - or coffee for the non-drinkers - with a few other 
PG collation nerds (which I never set out to be, but it may have befallen me 
).

-Jeremy

Sent from my TI-83

Re: Collation version tracking for macOS

2022-06-06 Thread Tom Lane

Jim Nasby  writes:
>> I think the real problem here is that the underlying software mostly
>> doesn't take this issue seriously.

> The first step to a solution is admitting that the problem exists. 
> Ignoring broken backups, segfaults and data corruption as a "rant" 
> implies that we simply throw in the towel and tell users to suck it up 
> or switch engines. There are other ways to address this short of the 
> community doing all the work itself. One simple example would be to 
> refuse to start if the collation provider has changed since initdb 
> (which we'd need to allow users to override).

You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has).  Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.

regards, tom lane

Re: Reducing Memory Consumption (aset and generation)

2022-06-06 Thread Ranier Vilela

Em seg., 6 de jun. de 2022 às 20:37, David Rowley 
escreveu:

> On Mon, 6 Jun 2022 at 07:28, Ranier Vilela  wrote:
> > 4) 004-generation-reduces-memory-consumption-BUG.patch
> > Same to the (2), but with BUG.
> > It only takes a few tweaks to completely remove the field block.
>
> > This fails with make check.
> > I couldn't figure out why it doesn't work with 16 bits (struct
> GenerationChunk).
>
> Hi David, thanks for taking a look at this.


> I think you're misunderstanding how blocks and chunks work here.  A
> block can have many chunks.  You can't find the block that a chunk is
> on by subtracting Generation_BLOCKHDRSZ from the pointer given to
> GenerationFree(). That would only work if the chunk happened to be the
> first chunk on a block. If it's anything apart from that then you'll
> be making adjustments to the memory of some prior chunk on the block.
> I imagine this is the reason you can't get the tests to pass.
>
Ok, I am still learning about this.
Can you explain why subtracting Generation_BLOCKHDRSZ from the pointer,
works for sizeof(struct GenerationChunk) = 24 bits,
When all references for the block field have been removed.
This pass check-world.


>
> Can you also explain why you think moving code around randomly or
> adding unlikely() macros helps reduce the memory consumption overheads
> of generation contexts?

Of course, those changes do not reduce memory consumption.
But, IMO, I think those changes improve the access to memory regions,
because of the locality of the data.

About "unlikely macros", this helps the branchs prediction, when most of
the time,
malloc and related functions, will not fail.


>   I imagine you think that's helping to further
> improve performance, but you've not offered any evidence of that
> separately from the other changes you've made. If you think those are
> useful changes then I recommend you run individual benchmarks and
> offer those as proof that those changes are worthwhile.
>
Ok, I can understand, are changes unrelated.

regards,
Ranier Vilela

Re: pg_auth_members.grantor is bunk

2022-06-06 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Jun 2, 2022 at 3:51 PM Tom Lane  wrote:
> > > I sort of thought http://postgr.es/m/3981966.1646429...@sss.pgh.pa.us
> > > constituted a completed investigation of this sort. No?
> >
> > I didn't think so.  It's clear that the spec expects us to track the
> > grantor, but I didn't chase down what it expects us to *do* with that
> > information, nor what it thinks the rules are for merging multiple
> > authorizations.
> 
> Hmm, OK. Well, one problem is that I've never had any luck
> interpreting what the spec says about anything, and I've sort of given
> up. But even if that were not so, I'm a little unclear what other
> conclusion is possible here. The spec either wants the same behavior
> that we already have for other object types, which is what I am here
> proposing that we do, or it wants something different. If it wants
> something different, it probably wants that for all object types, not
> just roles. Since I doubt we would want the behavior for roles to be
> inconsistent with what we do for all other object types, in that case
> we would probably either change the behavior for all other object
> types to something new, and then clean up the role stuff afterwards,
> or else first do what I proposed here and then later change it all at
> once. In which case the proposal that I've made is as good a way to
> start as any.
> 
> Now, if it happens to be the case that the spec proposes a different
> behavior for roles than for non-role objects, and if the behavior for
> roles is something other than the only we currently have for non-role
> objects, then I'd agree that the plan I propose here needs revision. I
> suspect that's unlikely but I can't make anything of the spec so 
> maybe?

Thankfully, at least from my reading, the spec isn't all that
complicated on this particular point.  The spec talks about "role
authorization descriptor"s and those are "created with role name,
grantee, and grantor" and then further says "redundant duplicate role
authorization descriptors are destroyed", presumably meaning that the
entire thing has to be identical.  In other words, yeah, the PK should
include the grantor.  There's a further comment that the 'set of
involved grantees' is the union of all the 'grantees', clearly
indicating that you can have multiple GRANT 'foo' to 'bar's with
distinct grantees.

In terms of how that's then used, yeah, it's during REVOKE because a
REVOKE is only able to 'find' role authorization descriptors which match
the triple of role revoked, grantee, grantor (though there's a caveat in
that the 'grantor' role could be the current role, or the current user).

Interestingly, at least in my looking it over today, it doesn't seem
that the 'grantor' could be 'any applicable role' (which is what's
usually used to indicate that it could be any role that the current role
inherits), meaning you have to include the GRANTED BY in the REVOKE
statement or do a SET ROLE first when doing a REVOKE if it's for a role
that you aren't currently running as (but which you are a member of).

Anyhow, in other words, I do think Robert's got it right here.  Happy to
discuss further though if there are doubts.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Reducing Memory Consumption (aset and generation)

2022-06-06 Thread David Rowley

On Mon, 6 Jun 2022 at 07:28, Ranier Vilela  wrote:
> 4) 004-generation-reduces-memory-consumption-BUG.patch
> Same to the (2), but with BUG.
> It only takes a few tweaks to completely remove the field block.

> This fails with make check.
> I couldn't figure out why it doesn't work with 16 bits (struct 
> GenerationChunk).

I think you're misunderstanding how blocks and chunks work here.  A
block can have many chunks.  You can't find the block that a chunk is
on by subtracting Generation_BLOCKHDRSZ from the pointer given to
GenerationFree(). That would only work if the chunk happened to be the
first chunk on a block. If it's anything apart from that then you'll
be making adjustments to the memory of some prior chunk on the block.
I imagine this is the reason you can't get the tests to pass.

Can you also explain why you think moving code around randomly or
adding unlikely() macros helps reduce the memory consumption overheads
of generation contexts?  I imagine you think that's helping to further
improve performance, but you've not offered any evidence of that
separately from the other changes you've made. If you think those are
useful changes then I recommend you run individual benchmarks and
offer those as proof that those changes are worthwhile.

David

Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-06 Thread Michael Paquier

On Mon, Jun 06, 2022 at 01:53:35PM -0500, Justin Pryzby wrote:
> It seems important to use a format in most-significant-parts-first which sorts
> nicely by filename, but otherwise anything could be okay.

Agreed.

> Apparently, that's ISO 8601, which can optionally use separators
> (-MM-DDTHH:MM:SS).

OK, let's use a T, with the basic format and a minimal number of
separators then, we get 20220603T082255.

> I was thinking this would not include fractional seconds.  Maybe that would
> mean that the TAP tests would need to sleep(1) at some points...

If we don't split by the millisecond, we would come back to the
problems of the original report.  On my laptop, the --check phase
that passes takes more than 1s, but the one that fails takes 0.1s, so
a follow-up run would complain with the path conflicts.  So at the end
I would reduce the format to be MMDDTHHMMSS_ms (we could also use
a logic that checks for conflicts and appends an extra number if
needed, though the addition of the extra ms is a bit shorter).
--
Michael


signature.asc
Description: PGP signature

Re: replacing role-level NOINHERIT with a grant-level option

2022-06-06 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Feb 7, 2022 at 11:13 AM Joe Conway  wrote:
> > > It seems to me that the INHERIT role flag isn't very well-considered.
> > > Inheritance, or the lack of it, ought to be decided separately for
> > > each inherited role. However, that would be a major architectural
> > > change.
> >
> > Agreed -- that would be useful.
> 
> Is this a kind of change people would support? Here's a quick sketch:

On the whole, moving things to pg_auth_members when they're about
relationships between roles makes more sense to me too (ISTR mentioning
that somewhere or at least thinking about it but not sure where and it
doesn't really matter).

> It's been proposed elsewhere by Stephen and others that we ought to
> have the ability to grant ADMIN OPTION on a role without granting
> membership in that role. There's some overlap between these two
> proposals. If your concern is just about accident prevention, you will
> be happy to use this instead of that. If you want real access
> restrictions, you will want that, not this. I think that's OK. Doing
> this first doesn't mean we can't do that later. What about the
> reverse? Could we add ADMIN-without-membership first, and then decide
> whether to do this later? Maybe so, but I have come to feel that
> NOINHERIT is such an ugly wart that we'll be happier the sooner we
> kill it. It seems to make every discussion that we have about any
> other potential change in this area more difficult, and I venture that
> ADMIN-without-membership wouldn't turn out to be an exception.

Being able to segregate the control over privileges from the control
over objects is definitely helpful in some environments.  I don't see
any strong reason that one must happen before the other and am happy to
defer to whomever has cycles to work on these things to sort that out.

> Based on previous discussions I think that, long term, we're probably
> headed toward a future where role grants have a bunch of different
> flags, each of which controls a single behavior: whether you can
> implicitly use the privileges of the role, whether you can access them
> via SET ROLE, whether you can grant the role to others, or revoke it
> from them, etc. I don't know exactly what the final list will look
> like, and hopefully it won't be so long that it makes us all wish we
> were dead, but there doesn't seem to be any possibility that implicit
> inheritance of permissions won't be in that list. I spent a few
> minutes considering whether I ought to instead propose that we just
> nuke NOINHERIT from orbit and replace it with absolutely nothing, and
> concluded that such a proposal had no hope of attracting consensus.

I agree that a future where there's more granularity in terms of role
grants would be a better place than where we are now.  I'd be -1 on just
removing 'NOINHERIT', to no one's surprise, I'm sure.

> Perhaps the idea of replacing it with that is more powerful and at
> least IMHO cleaner will.

-EPARSEFAIL (tho I'm guessing you were intending to say that replacing
the role-attribute noinherit with something more powerful would be a
generally agreeable way forward, which I would generally support)

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Jun 2, 2022 at 12:26 PM Robert Haas  wrote:
> > 1. Extend the GRANT role_name TO role_name [ WITH ADMIN OPTION ] with
> > a new, optional clause, something like WITH NO INHERIT or WITH
> > NOINHERIT or WITHOUT INHERIT.
> 
> I just realized that, with ADMIN OPTION, you can change your mind
> after the initial grant, and we probably would want to allow the same
> kind of thing here. The existing syntax is:

Yes.

> 1. GRANT foo TO bar [WITH ADMIN OPTION];
> 2. REVOKE foo FROM bar;
> 3. REVOKE ADMIN OPTION FOR foo FROM bar;
> 
> To grant the admin option later, you just use (1) again and this time
> include WITH ADMIN OPTION. To revoke it without removing the grant
> entirely, you use (3). This could easily be generalized to any number
> of options all of which are Boolean properties and all of which have a
> default value of false, but INHERIT is a Boolean property with a
> default value of true, and calling the property NOINHERIT to dodge
> that issue seems dumb. I'd like to find a way to extend the syntax
> that can accommodate not only INHERIT as an option, but any other
> options we might want to add in the future, and maybe even leave room
> for non-Boolean properties, just in case. The question of which
> options we think it valuable to add is separate from what the syntax
> ought to be, so for syntax discussion purposes let's suppose we want
> to add three new options: FRUIT, which can be strawberry, banana, or
> kiwi; CHOCOLATE, a Boolean whose default value is true; and SARDINES,
> another Boolean whose default value is false. Then:
> 
> GRANT foo TO bar WITH FRUIT STRAWBERRY;
> GRANT foo TO bar WITH CHOCOLATE FALSE;
> GRANT foo TO bar WITH SARDINES TRUE;
> GRANT foo TO bar WITH SARDINES; -- s

Re: pgcon unconference / impact of block size on performance

2022-06-06 Thread David Rowley

On Sun, 5 Jun 2022 at 11:23, Tomas Vondra  wrote:
> At on of the pgcon unconference sessions a couple days ago, I presented
> a bunch of benchmark results comparing performance with different
> data/WAL block size. Most of the OLTP results showed significant gains
> (up to 50%) with smaller (4k) data pages.

A few years ago when you and I were doing analysis into the TPC-H
benchmark, we found that larger page sizes helped various queries,
especially Q1.  It would be good to see what the block size changes in
performance in a query such as: SELECT sum(value) FROM
table_with_tuples_of_several_hundred_bytes;.  I don't recall the
reason why 32k pages helped there, but it seems reasonable that doing
more work for each lookup in shared buffers might be 1 reason.

Maybe some deeper analysis into various workloads might convince us
that it might be worth having an initdb option to specify the
blocksize. There'd be various hurdles to get over in the code to make
that work. I doubt we could ever make the default smaller than it is
today as it would nobody would be able to insert rows larger than 4
kilobytes into a table anymore. Plus pg_upgrade issues.

David

[1] https://www.postgresql.org/docs/current/limits.html

Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-06 Thread David Rowley

On Mon, 6 Jun 2022 at 21:34, Jean Landercy - BEEODIVERSITY
 wrote:
> SELECT COUNT(*) FROM items;
> -- ERROR:  variable not found in subplan target list
> -- SQL state: XX000

Can you share some more details about what "items" is. psql's "\d
items" output would be useful.  From what you've reported we can't
tell if this is a table or a view.

> The bug is tricky to reproduce, I could not succeed to replicate elsewhere 
> (dump/restore does not preserve it).

Can you share the output of:

select 
attname,atttypid::regtype,attnum,atthasdef,atthasmissing,attgenerated,attisdropped
from pg_attribute where attrelid = 'items'::regclass order by attnum;

This will let us see if there's something strange going on with
dropped or has missing columns.  There may be some sequence of ALTER
TABLE ADD COLUMN ... DEFAULT / DROP COLUMN that is causing this. The
output of that might help us see if that could be a factor.

David

Re: oat_post_create expected behavior

2022-06-06 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 6, 2022 at 1:35 PM Jeff Davis  wrote:
>> Out of curiosity, why not? The proposed patch only runs it if the
>> object access hook is set. Do you see a situation where it would be
>> confusing that an earlier DDL change is visible? And if so, would it
>> make more sense to call CCI unconditionally?

> Well, I think that a fair amount of work has been done previously to
> cut down on unnecessary CCIs. I suspect Tom in particular is likely to
> object to adding a whole bunch more of them, and I think that
> objection would have some merit.

We've gotten things to the point where a no-op CCI is pretty cheap,
so I'm not sure there is a performance concern here.  I do wonder
though if there are semantic or bug-hazard considerations.  A CCI
that occurs only if a particular hook is loaded seems pretty scary
from a testability standpoint.

> I definitely think if we were going to do it, it would need to be
> unconditional. Otherwise I think we'll end up with bugs, because
> nobody's going to go test all of that code with and without an object
> access hook installed every time they tweak some DDL-related code.

Right, same thing I'm saying.  I also think we should discourage
people from doing cowboy CCIs inside their OAT hooks, because that
makes the testability problem even worse.  Maybe that means we
need to uniformly move the CREATE hooks to after a system-provided
CCI, but I've not thought hard about the implications of that.

regards, tom lane

Re: How about a psql backslash command to show GUCs?

2022-06-06 Thread Tom Lane

Robert Haas  writes:
> I think part of the problem here, though, is that one can imagine a
> variety of charters that might be useful. A user could want to see the
> parameters that have values in their session that differ from the
> system defaults, or parameters that have values which differ from the
> compiled-in defaults, or parameters that can be changed without a
> restart, or all the pre-computed parameters, or all the parameters
> that contain "vacuum" in the name, or all the query-planner-related
> parameters, or all the parameters on which any privileges have been
> granted. And it's just a judgement call which of those things we ought
> to try to accommodate in the psql syntax and which ones ought to be
> done by writing an ad-hoc query against pg_settings.

Sure.  Nonetheless, having decided to introduce this command, we have
to make that judgement call.

psql-ref.sgml currently explains that

If pattern is specified,
only parameters whose names match the pattern are listed.  Without
a pattern, only
parameters that are set to non-default values are listed.
(Use \dconfig * to see all parameters.)

so we have the "all of them" and "ones whose names match a pattern"
cases covered.  And the definition of the default behavior as
"only ones that are set to non-default values" seems reasonable enough,
but the question is what counts as a "non-default value", or for that
matter what counts as "setting".

I think a reasonable case can be made for excluding "internal" GUCs
on the grounds that (a) they cannot be set, and therefore (b) whatever
value they have might as well be considered the default.

regards, tom lane

Re: pgcon unconference / impact of block size on performance

2022-06-06 Thread Fabien COELHO




Hello Tomas,


At on of the pgcon unconference sessions a couple days ago, I presented
a bunch of benchmark results comparing performance with different
data/WAL block size. Most of the OLTP results showed significant gains
(up to 50%) with smaller (4k) data pages.


You wrote something about SSD a long time ago, but the link is now dead:

http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/

See also:

http://www.cybertec.at/postgresql-block-sizes-getting-started/
http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD.html

[...]


The other important factor is the native SSD page, which is similar to
sectors on HDD. SSDs however don't allow in-place updates, and have to
reset/rewrite of the whole native page. It's actually more complicated,
because the reset happens at a much larger scale (~8MB block), so it
does matter how quickly we "dirty" the data. The consequence is that
using data pages smaller than the native page (depends on the device,
but seems 4K is the common value) either does not help or actually hurts
the write performance.

All the SSD results show this behavior - the Optane and Samsung nicely
show that 4K is much better (in random write IOPS) than 8K, but 1-2K
pages make it worse.


Yep. ISTM that uou should also consider the underlying FS block size. Ext4 
uses 4 KiB by default, so if you write 2 KiB it will write 4 KiB anyway.


There is no much doubt that with SSD we should reduce the default page 
size. There are some negative impacts (eg more space is lost because of 
headers and the number of tuples that can be fitted), but I guess the 
should be an overall benefit. It would help a lot if it would be possible 
to initdb with a different block size, without recompiling.


--
Fabien.

Re: oat_post_create expected behavior

2022-06-06 Thread Robert Haas

On Mon, Jun 6, 2022 at 3:46 PM Jeff Davis  wrote:
> On Mon, 2022-06-06 at 13:43 -0400, Robert Haas wrote:
> > Yeah, that comment could be made more clear.
>
> I still don't understand what the rule is.
>
> Is the rule that OAT_POST_CREATE must always use SnapshotSelf for any
> catalog access? And if so, do we need to update code in contrib
> extensions to follow that rule?

I don't think there is a rule in the sense that you want there to be
one. We sometimes call the object access hook before the CCI, and
sometimes after, and the sepgsql code knows which cases are handled
which ways and proceeds differently on that basis. If we went and
changed the sepgsql code that uses system catalog lookups to use
SnapshotSelf instead, I think it would still work, but it would be
less efficient, so that doesn't seem like a desirable change to me. If
it's possible to make the hook placement always happen after a CCI,
then we could change the sepgsql code to always use catalog lookups,
which would probably be more efficient but likely require adding some
CCI calls, which may attract objections from Tom --- or maybe it
won't. Absent either of those things, I'm inclined to just make the
comment clearly state that we're not consistent about it. That's not
great, but it may be the best we can do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: oat_post_create expected behavior

2022-06-06 Thread Jeff Davis

On Mon, 2022-06-06 at 13:43 -0400, Robert Haas wrote:
> Yeah, that comment could be made more clear.

I still don't understand what the rule is.

Is the rule that OAT_POST_CREATE must always use SnapshotSelf for any
catalog access? And if so, do we need to update code in contrib
extensions to follow that rule?

Regards,
Jeff Davis

Re: replacing role-level NOINHERIT with a grant-level option

2022-06-06 Thread Robert Haas

On Thu, Jun 2, 2022 at 12:26 PM Robert Haas  wrote:
> 1. Extend the GRANT role_name TO role_name [ WITH ADMIN OPTION ] with
> a new, optional clause, something like WITH NO INHERIT or WITH
> NOINHERIT or WITHOUT INHERIT.

I just realized that, with ADMIN OPTION, you can change your mind
after the initial grant, and we probably would want to allow the same
kind of thing here. The existing syntax is:

1. GRANT foo TO bar [WITH ADMIN OPTION];
2. REVOKE foo FROM bar;
3. REVOKE ADMIN OPTION FOR foo FROM bar;

To grant the admin option later, you just use (1) again and this time
include WITH ADMIN OPTION. To revoke it without removing the grant
entirely, you use (3). This could easily be generalized to any number
of options all of which are Boolean properties and all of which have a
default value of false, but INHERIT is a Boolean property with a
default value of true, and calling the property NOINHERIT to dodge
that issue seems dumb. I'd like to find a way to extend the syntax
that can accommodate not only INHERIT as an option, but any other
options we might want to add in the future, and maybe even leave room
for non-Boolean properties, just in case. The question of which
options we think it valuable to add is separate from what the syntax
ought to be, so for syntax discussion purposes let's suppose we want
to add three new options: FRUIT, which can be strawberry, banana, or
kiwi; CHOCOLATE, a Boolean whose default value is true; and SARDINES,
another Boolean whose default value is false. Then:

GRANT foo TO bar WITH FRUIT STRAWBERRY;
GRANT foo TO bar WITH CHOCOLATE FALSE;
GRANT foo TO bar WITH SARDINES TRUE;
GRANT foo TO bar WITH SARDINES; -- same as previous
GRANT foo TO bar WITH SARDINES OPTION; -- also same as previous
GRANT foo TO bar WITH FRUIT KIWI, SARDINES; -- multiple options
GRANT foo TO bar WITH CHOCOLATE OPTION, SARDINES OPTION; -- dubious combination

In other words, you write a comma-separated list of option names. Each
option name can be followed by an option value or by the word OPTION.
If it is followed by the word OPTION or by nothing, the option value
is taken to be true. This would mean that, going forward, any of the
following would work: (a) GRANT foo TO bar WITH ADMIN OPTION, (b)
GRANT foo TO BAR WITH ADMIN, (c) GRANT foo TO BAR WITH ADMIN TRUE.

To revoke a grant entirely, you just say REVOKE foo FROM bar, as now.
To change an option for an existing grant, you can re-execute the
grant statement with a different WITH clause. Any options that are
explicitly mentioned will be changed to have the associated values;
unmentioned options will retain their existing values. If you want to
change the value of a Boolean option to false, you have a second
option, which is to write "REVOKE option_name OPTION FOR foo FROM
bar," which means exactly the same thing as "GRANT foo TO bar WITH
option_name FALSE".

In terms of what options to offer, the most obvious idea is to just
add INHERIT as a Boolean option which is true by default. We could go
further and also add a SET option, with the idea that INHERIT OPTION
controls whether you can exercise the privileges of the role without
SET ROLE, and SET OPTION controls whether you can switch to that role
using the SET ROLE command. Those two things together would give us a
way to get to the admin-without-membership concept that we have
previously discussed: GRANT foo TO BAR WITH ADMIN TRUE, INHERIT FALSE,
SET FALSE sounds like it should do the trick.

I briefly considered suggesting that the way to set a Boolean-valued
option to false ought to be to write "NO option_name" rather than
"option_name FALSE", since it reads more naturally, but I proposed
this instead because it's more like what we do for other options lists
(cf. EXPLAIN, VACUUM, COPY).

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-06 Thread Justin Pryzby

On Mon, Jun 06, 2022 at 04:50:55PM +, Jean Landercy - BEEODIVERSITY wrote:
> Dear Justin,
> 
> Thank you for your quick reply.
> Unfortunately, the server having this issue is an Azure Flexible Server.
> Upgrades are managed by Azure, I will have to wait until they release the 
> version 13.7.

I don't know what to suggest other than to open a support ticket with the
vendor.

-- 
Justin

Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-06 Thread Justin Pryzby

On Mon, Jun 06, 2022 at 07:43:53PM +0200, Daniel Gustafsson wrote:
> > On 6 Jun 2022, at 06:17, Michael Paquier  wrote:
> > On Mon, Jun 06, 2022 at 02:38:03AM +0200, Daniel Gustafsson wrote:
> >> On 5 Jun 2022, at 11:19, Michael Paquier  wrote:
> 
> >>> I have been toying with the idea of a sub-directory named with a
> >>> timestamp (Unix time, like log_line_prefix's %n but this could be
> >>> any format) under pg_upgrade_output.d/ and finished with the
> >>> attached. 
> >> 
> >> I was thinking more along the lines of %m to make it (more) human 
> >> readable, but
> >> I'm certainly not wedded to any format.

It seems important to use a format in most-significant-parts-first which sorts
nicely by filename, but otherwise anything could be okay.

> > Neither am I. I would not map exactly to %m as it uses whitespaces,
> > but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would
> > be fine? If there are other ideas for the format, just let me know.
> 
> I think this makes more sense from an end-user perspective.

Is it better to use "T" instead of "_" ?

Apparently, that's ISO 8601, which can optionally use separators
(-MM-DDTHH:MM:SS).

https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations

I was thinking this would not include fractional seconds.  Maybe that would
mean that the TAP tests would need to sleep(1) at some points...

-- 
Justin

Re: pgsql: Use pre-fetching for ANALYZE

2022-06-06 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2022-06-02 19:30:16 -0700, Andres Freund wrote:
> > On 2021-03-16 18:48:08 +, Stephen Frost wrote:
> > > Use pre-fetching for ANALYZE
> > > 
> > > When we have posix_fadvise() available, we can improve the performance
> > > of an ANALYZE by quite a bit by using it to inform the kernel of the
> > > blocks that we're going to be asking for.  Similar to bitmap index
> > > scans, the number of buffers pre-fetched is based off of the
> > > maintenance_io_concurrency setting (for the particular tablespace or,
> > > if not set, globally, via get_tablespace_maintenance_io_concurrency()).
> > 
> > I just looked at this as part of debugging a crash / hang in the AIO patch.
> > 
> > The code does:
> > 
> > block_accepted = table_scan_analyze_next_block(scan, targblock, 
> > vac_strategy);
> > 
> > #ifdef USE_PREFETCH
> > 
> > /*
> >  * When pre-fetching, after we get a block, tell the kernel 
> > about the
> >  * next one we will want, if there's any left.
> >  *
> >  * We want to do this even if the 
> > table_scan_analyze_next_block() call
> >  * above decides against analyzing the block it picked.
> >  */
> > if (prefetch_maximum && prefetch_targblock != 
> > InvalidBlockNumber)
> > PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> > prefetch_targblock);
> > #endif
> > 
> > I.e. we lock a buffer and *then* we prefetch another buffer. That seems 
> > like a
> > quite bad idea to me. Why are we doing IO while holding a content lock, if 
> > we
> > can avoid it?

At the end, we're doing a posix_fadvise() which is a kernel call but
hopefully wouldn't do actual IO when we call it.  Still, agreed that
it'd be better to do that without holding locks and no objection to
making such a change.

> It also seems decidedly not great from a layering POV to do the IO in
> analyze.c. There's no guarantee that the tableam maps blocks in a way that's
> compatible with PrefetchBuffer().  Yes, the bitmap heap scan code does
> something similar, but a) that is opt in by the AM, b) there's a comment
> saying it's quite crufty and should be fixed.

Certainly open to suggestions.  Are you thinking it'd make sense to add
a 'prefetch_block' method to TableAmRoutine?  Or did you have another
thought?

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-06-06 Thread Robert Haas

On Fri, Jun 3, 2022 at 10:04 AM Tom Lane  wrote:
> I agree with Robert's complaint that Parallel is far too generic
> a term here.  Also, the fact that this data is currently in struct
> Port seems like an artifact.

Why do we call this thing a Port, anyway?

I think I'd feel more comfortable here if we were defining what went
into which struct on some semantic basis rather than being like, OK,
so all the stuff we want to serialize goes into struct #1, and the
stuff we don't want to serialize goes into struct #2. I suppose if
it's just based on whether or not we want to serialize it, then the
placement of future additions will just be based on how people happen
to feel about the thing they're adding right at that moment, and there
won't be any consistency.

One could imagine dividing the Port struct into a couple of different
structs, e.g.

AuthenticationState: stuff that is needed only during authentication
and can be discarded thereafter (e.g. the HBA line, at least if the
comment is to be believed)
ClientCommunicationState: stuff that is used to communicate with the
client but doesn't need to be or can't be shared (e.g. the SSL object
itself)
ClientConnectionInfo: stuff that someone might want to look at for
information purposes at any time (e.g. authn_id, apparently)

Then we could serialize the third of these, keep the second around but
not serialize it, and free the first once connection setup is
complete.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-06 Thread Daniel Gustafsson

> On 6 Jun 2022, at 06:17, Michael Paquier  wrote:
> On Mon, Jun 06, 2022 at 02:38:03AM +0200, Daniel Gustafsson wrote:
>> On 5 Jun 2022, at 11:19, Michael Paquier  wrote:

>>> I have been toying with the idea of a sub-directory named with a
>>> timestamp (Unix time, like log_line_prefix's %n but this could be
>>> any format) under pg_upgrade_output.d/ and finished with the
>>> attached. 
>> 
>> I was thinking more along the lines of %m to make it (more) human readable, 
>> but
>> I'm certainly not wedded to any format.
> 
> Neither am I. I would not map exactly to %m as it uses whitespaces,
> but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would
> be fine? If there are other ideas for the format, just let me know.

I think this makes more sense from an end-user perspective.

>> As a user I would expect the logs from this current invocation to be removed
>> without --retain, and any other older log entries be kept. I think we should
>> remove log_opts.logdir and only remove log_opts.rootdir if it is left empty
>> after .logdir is removed.
> 
> Okay, however I think you mean log_opts.basedir rather than logdir?
> That's simple enough to switch around as pg_check_dir() does this
> job.

Correct, I mistyped.  The cleanup in this version of the patch looks sane to
me.

--
Daniel Gustafsson   https://vmware.com/

Re: oat_post_create expected behavior

2022-06-06 Thread Robert Haas

On Mon, Jun 6, 2022 at 1:35 PM Jeff Davis  wrote:
> Out of curiosity, why not? The proposed patch only runs it if the
> object access hook is set. Do you see a situation where it would be
> confusing that an earlier DDL change is visible? And if so, would it
> make more sense to call CCI unconditionally?

Well, I think that a fair amount of work has been done previously to
cut down on unnecessary CCIs. I suspect Tom in particular is likely to
object to adding a whole bunch more of them, and I think that
objection would have some merit.

I definitely think if we were going to do it, it would need to be
unconditional. Otherwise I think we'll end up with bugs, because
nobody's going to go test all of that code with and without an object
access hook installed every time they tweak some DDL-related code.

> Also, would it ever be reasonable for such a hook to call CCI itself?
> As you say, it could use SnapshotSelf, but sometimes you might want to
> call routines that assume they can use an MVCC snapshot. This question
> applies to the OAT_POST_ALTER hook as well as OAT_POST_CREATE.

I definitely wouldn't want to warrant that it doesn't break anything.
I think the extension can do it at its own risk, but I wouldn't
recommend it.

OAT_POST_ALTER is unlike OAT_POST_CREATE in that OAT_POST_ALTER
documents that it should be called after a CCI, whereas
OAT_POST_CREATE does not make a representation either way.

> >  Possibly there is some work that could be done to ensure
> > consistent placement of the calls to post-create hooks so that either
> > all of them happen before, or all of them happen after, a CCI has
> > occurred, but I'm not sure whether or not that is feasible.
>
> I like the idea of having a test in place so that we at least know when
> something changes. Otherwise it would be pretty easy to break an
> extension by adjusting some code.

Sure. I find writing meaningful tests for this kind of stuff hard, but
there are plenty of people around here who are better at figuring out
how to test obscure scenarios than I.

> > Currently,
> > I don't think we promise anything about whether a post-create hook
> > call will occur before or after a CCI, which is why
> > sepgsql_schema_post_create(), sepgsql_schema_post_create(), and
> > sepgsql_attribute_post_create() perform a catalog scan using
> > SnapshotSelf, while sepgsql_database_post_create() uses
> > get_database_oid(). You might want to adopt a similar technique.
>
> It would be good to document this a little better though:
>
>  * OAT_POST_CREATE should be invoked just after the object is created.
>  * Typically, this is done after inserting the primary catalog records
> and
>  * associated dependencies.
>
> doesn't really give any guidance, while the comment for alter does:
>
>  * OAT_POST_ALTER should be invoked just after the object is altered,
>  * but before the command counter is incremented.  An extension using
> the
>  * hook can use a current MVCC snapshot to get the old version of the
> tuple,
>  * and can use SnapshotSelf to get the new version of the tuple.

Yeah, that comment could be made more clear.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: oat_post_create expected behavior

2022-06-06 Thread Jeff Davis

On Mon, 2022-06-06 at 10:51 -0400, Robert Haas wrote:
> I don't think a proposal to add CommandCounterIncrement() calls just
> for the convenience of object access hooks has much chance of being
> accepted.

Out of curiosity, why not? The proposed patch only runs it if the
object access hook is set. Do you see a situation where it would be
confusing that an earlier DDL change is visible? And if so, would it
make more sense to call CCI unconditionally?

Also, would it ever be reasonable for such a hook to call CCI itself?
As you say, it could use SnapshotSelf, but sometimes you might want to
call routines that assume they can use an MVCC snapshot. This question
applies to the OAT_POST_ALTER hook as well as OAT_POST_CREATE.

>  Possibly there is some work that could be done to ensure
> consistent placement of the calls to post-create hooks so that either
> all of them happen before, or all of them happen after, a CCI has
> occurred, but I'm not sure whether or not that is feasible. 

I like the idea of having a test in place so that we at least know when
something changes. Otherwise it would be pretty easy to break an
extension by adjusting some code.

> Currently,
> I don't think we promise anything about whether a post-create hook
> call will occur before or after a CCI, which is why
> sepgsql_schema_post_create(), sepgsql_schema_post_create(), and
> sepgsql_attribute_post_create() perform a catalog scan using
> SnapshotSelf, while sepgsql_database_post_create() uses
> get_database_oid(). You might want to adopt a similar technique.

It would be good to document this a little better though:

 * OAT_POST_CREATE should be invoked just after the object is created.
 * Typically, this is done after inserting the primary catalog records
and
 * associated dependencies.

doesn't really give any guidance, while the comment for alter does:

 * OAT_POST_ALTER should be invoked just after the object is altered,
 * but before the command counter is incremented.  An extension using
the
 * hook can use a current MVCC snapshot to get the old version of the
tuple,
 * and can use SnapshotSelf to get the new version of the tuple.

Regards,
Jeff Davis

PS: I added this to the July CF: 
https://commitfest.postgresql.org/38/3676/

Re: Logging query parmeters in auto_explain

2022-06-06 Thread Dagfinn Ilmari Mannsåker

Dagfinn Ilmari Mannsåker  writes:

> Hi hackers,
>
> Inspired by a question on IRC, I noticed that while the core statement
> logging system gained the option to log statement parameters in PG 13,
> auto_explain was left out.
>
> Here's a patch that adds a corresponding
> auto_explain.log_parameter_max_length config setting, which controls the
> "Query Parameters" node in the logged plan.  Just like in core, the
> default is -1, which logs the parameters in full, and 0 disables
> parameter logging, while any other value truncates each parameter to
> that many bytes.

I've added added it to the upcoming commitfest:

https://commitfest.postgresql.org/38/3660/

- ilmari

Re: How about a psql backslash command to show GUCs?

2022-06-06 Thread Mark Dilger




> On Jun 6, 2022, at 9:02 AM, Tom Lane  wrote:
> 
> Thoughts?

I think it depends on your mental model of what \dconfig is showing you.  Is it 
showing you the list of configs that you can SET, or just the list of all 
configs?

Re: pg_buffercache: add sql test

2022-06-06 Thread Stephen Frost

Greetings,

* Daniel Gustafsson (dan...@yesql.se) wrote:
> > On 6 Jun 2022, at 15:30, Dong Wook Lee  wrote:
> 
> > I just wrote a test code for the `pg_buffercache` extension which
> > doesn't not have test code.
> 
> Please add this patch to the next commitfest to make sure it's not lost before
> then.
> 
>   https://commitfest.postgresql.org/38/

Seems to be there now, at least:

https://commitfest.postgresql.org/38/3674/

However, I don't think we should have a 'target version' set for this
(and in particular it shouldn't be 15).  I'd suggest removing that.

Thanks,

Stephen

Re: How about a psql backslash command to show GUCs?

2022-06-06 Thread Robert Haas

On Mon, Jun 6, 2022 at 12:02 PM Tom Lane  wrote:
> Thoughts?

This all seems pretty subjective. As someone who sometimes supports
customers, I usually kind of want the customer to give me all the
settings, just in case something that didn't seem important to them
(or to whoever coded up the \dconfig command) turns out to be
relevant. It's easier to ask once and then look for the information
you need than to go back and forth asking for more data, and I don't
want to have to try to infer things based on what version they are
running or how a certain set of binaries was built.

But if I already know that the configuration on a given system is
basically sane, I probably only care about the parameters with
non-default values, and a computed parameter can't be set, so I guess
it has a default value by definition. So if the charter for this
command is to show only non-default GUCs, then it seems reasonable to
leave these out.

I think part of the problem here, though, is that one can imagine a
variety of charters that might be useful. A user could want to see the
parameters that have values in their session that differ from the
system defaults, or parameters that have values which differ from the
compiled-in defaults, or parameters that can be changed without a
restart, or all the pre-computed parameters, or all the parameters
that contain "vacuum" in the name, or all the query-planner-related
parameters, or all the parameters on which any privileges have been
granted. And it's just a judgement call which of those things we ought
to try to accommodate in the psql syntax and which ones ought to be
done by writing an ad-hoc query against pg_settings.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-06 Thread Jean Landercy - BEEODIVERSITY

Dear Justin,

Thank you for your quick reply.
Unfortunately, the server having this issue is an Azure Flexible Server.
Upgrades are managed by Azure, I will have to wait until they release the 
version 13.7.

Is there a procedure to replicate the database and preserve the bug.
My attempts with pg_dump/psql failed (the bug vanishes).
If so then I can clone the faulty database and upgrade it on a newer version.

Best regards, 

Landercy Jean

Re: pg_buffercache: add sql test

2022-06-06 Thread Daniel Gustafsson

> On 6 Jun 2022, at 15:30, Dong Wook Lee  wrote:

> I just wrote a test code for the `pg_buffercache` extension which
> doesn't not have test code.

Please add this patch to the next commitfest to make sure it's not lost before
then.

https://commitfest.postgresql.org/38/

--
Daniel Gustafsson   https://vmware.com/

Re: How about a psql backslash command to show GUCs?

2022-06-06 Thread Tom Lane

Justin Pryzby  writes:
> I noticed this is showing "pre-computed" gucs, like:
>  shared_memory_size| 149MB
>  shared_memory_size_in_huge_pages  | 75
> I'm not opposed to that, but I wonder if that's what's intended / best.

I had suggested upthread that we might want to hide items with
source = 'override', but that idea didn't seem to be getting traction.
A different idea is to hide items with context = 'internal'.
Looking at the items selected by the current rule in a default
installation:

postgres=# SELECT s.name, source, context FROM pg_catalog.pg_settings s
WHERE s.source <> 'default' AND
  s.setting IS DISTINCT FROM s.boot_val
ORDER BY 1;
   name   |source|  context   
--+--+
 TimeZone | configuration file   | user
 application_name | client   | user
 client_encoding  | client   | user
 config_file  | override | postmaster
 data_directory   | override | postmaster
 default_text_search_config   | configuration file   | user
 hba_file | override | postmaster
 ident_file   | override | postmaster
 lc_messages  | configuration file   | superuser
 log_timezone | configuration file   | sighup
 max_stack_depth  | environment variable | superuser
 shared_memory_size   | override | internal
 shared_memory_size_in_huge_pages | override | internal
 wal_buffers  | override | postmaster
(14 rows)

So hiding internal-context items would hit exactly the two you mention,
but hiding override-source items would hit several more.

(I'm kind of wondering why wal_buffers is showing as "override";
that seems like a quirk.)

Thoughts?

regards, tom lane

Re: pgcon unconference / impact of block size on performance

2022-06-06 Thread Tomas Vondra




On 6/6/22 17:00, Tomas Vondra wrote:
> 
> On 6/6/22 16:27, Jakub Wartak wrote:
>> Hi Tomas,
>>
>>> Hi,
>>>
>>> At on of the pgcon unconference sessions a couple days ago, I presented a
>>> bunch of benchmark results comparing performance with different data/WAL
>>> block size. Most of the OLTP results showed significant gains (up to 50%) 
>>> with
>>> smaller (4k) data pages.
>>
>> Nice. I just saw this
> https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
> you have any plans for publishing those other graphs too (e.g. WAL block
> size impact)?
>>
> 
> Well, there's plenty of charts in the github repositories, including the
> charts I think you're asking for:
> 
> https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png
> 
> https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png
> 
> 
> I admit the charts may not be documented very clearly :-(
> 
>>> This opened a long discussion about possible explanations - I claimed one 
>>> of the
>>> main factors is the adoption of flash storage, due to pretty fundamental
>>> differences between HDD and SSD systems. But the discussion concluded with 
>>> an
>>> agreement to continue investigating this, so here's an attempt to support 
>>> the
>>> claim with some measurements/data.
>>>
>>> Let me present results of low-level fio benchmarks on a couple different HDD
>>> and SSD drives. This should eliminate any postgres-related influence (e.g. 
>>> FPW),
>>> and demonstrates inherent HDD/SSD differences.
>>> All the SSD results show this behavior - the Optane and Samsung nicely show
>>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make 
>>> it
>>> worse.
>>>
>> [..]
>> Can you share what Linux kernel version, what filesystem , it's
>> mount options and LVM setup were you using if any(?)
>>
> 
> The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
> xfs filesystems.
> 

I realized I mentioned just two of the devices, used for the postgres
test, but this thread is dealing mostly with about fio results. So let
me list info about all the devices/filesystems:

i5
--

Intel SSD 320 120GB SATA (SSDSA2CW12)
/dev/sdh1 on /mnt/data type ext4 (rw,noatime)

6x Intel SSD DC S3700 100GB SATA (SSDSC2BA10), LVM RAID0
/dev/md0 on /mnt/raid type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)


xeon


Samsung SSD 860 EVO 2TB SATA (RVT04B6Q)
/dev/sde1 on /mnt/samsung type ext4 (rw,relatime)

Intel Optane 900P 280GB NVMe (SSDPED1D280GA)
/dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)

3x Maxtor DiamondMax 21 500B 7.2k SATA (STM350063), LVM RAID0
/dev/md0 on /mnt/raid type ext4 (rw,relatime,stripe=48)

# mdadm --detail /dev/md0
/dev/md0:
   Version : 1.2
 Creation Time : Fri Aug 31 21:11:48 2018
Raid Level : raid0
Array Size : 1464763392 (1396.91 GiB 1499.92 GB)
  Raid Devices : 3
 Total Devices : 3
   Persistence : Superblock is persistent

   Update Time : Fri Aug 31 21:11:48 2018
 State : clean
Active Devices : 3
   Working Devices : 3
Failed Devices : 0
 Spare Devices : 0

Chunk Size : 64K

Consistency Policy : none

  Name : bench2:0  (local to host bench2)
  UUID : 72e48e7b:a75554ea:05952b34:810ed6bc
Events : 0

Number   Major   Minor   RaidDevice State
   0   8   170  active sync   /dev/sdb1
   1   8   331  active sync   /dev/sdc1
   2   8   492  active sync   /dev/sdd1


Hopefully this is more complete ...

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-06-06 Thread Jacob Champion

On Fri, Jun 3, 2022 at 7:36 PM Michael Paquier  wrote:
>
> On Fri, Jun 03, 2022 at 10:04:12AM -0400, Tom Lane wrote:
> > I agree with Robert's complaint that Parallel is far too generic
> > a term here.  Also, the fact that this data is currently in struct
> > Port seems like an artifact.
> >
> > Don't we have a term for the set of processes comprising a leader
> > plus parallel workers?  If we called that set FooGroup, then
> > something like FooGroupSharedInfo would be on-point.
>
> As far as I know, proc.h includes the term "group members", which
> includes the leader and its workers (see CLOG and lock part)?

lmgr/README also refers to "gangs of related processes" and "parallel
groups". So

- GroupSharedInfo
- ParallelGroupSharedInfo
- GangSharedInfo
- SharedLeaderInfo

?

--Jacob

Re: Assertion failure with barriers in parallel hash join

2022-06-06 Thread David Geier

Hi Thomas,

Correct. We're running with disabled parallel leader participation and we
have to do so, because another custom plan node we built relies on that.
That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro  wrote:

> On Thu, Jun 2, 2022 at 9:31 PM David Geier  wrote:
> > We recently encountered the same bug in the field. Oleksii Kozlov
> managed to come up with reproduction steps, which reliably trigger it.
> Interestingly, the bug does not only manifest as failing assertion, but
> also as segmentation fault; in builds with disabled and with enabled (!)
> assertions. So it can crash production environments. We applied the
> proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that
> with the patch neither the assertion nor the segmentation fault still occur.
>
> Thanks for the report, testing, and for creating the CF entry.
>
> I assume you are using parallel_leader_participation=off, and the
> reason we haven't heard more about this is because few people do that.
> By coincidence I was just about to restart a bunch of hash join
> projects and have been paging the topic area back into my brain, so
> I'll start with another round of testing/analysis of this bug/patch
> next week.
>

Re: pgcon unconference / impact of block size on performance

2022-06-06 Thread Tomas Vondra



On 6/6/22 16:27, Jakub Wartak wrote:
> Hi Tomas,
> 
>> Hi,
>>
>> At on of the pgcon unconference sessions a couple days ago, I presented a
>> bunch of benchmark results comparing performance with different data/WAL
>> block size. Most of the OLTP results showed significant gains (up to 50%) 
>> with
>> smaller (4k) data pages.
> 
> Nice. I just saw this
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
you have any plans for publishing those other graphs too (e.g. WAL block
size impact)?
> 

Well, there's plenty of charts in the github repositories, including the
charts I think you're asking for:

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png


I admit the charts may not be documented very clearly :-(

>> This opened a long discussion about possible explanations - I claimed one of 
>> the
>> main factors is the adoption of flash storage, due to pretty fundamental
>> differences between HDD and SSD systems. But the discussion concluded with an
>> agreement to continue investigating this, so here's an attempt to support the
>> claim with some measurements/data.
>>
>> Let me present results of low-level fio benchmarks on a couple different HDD
>> and SSD drives. This should eliminate any postgres-related influence (e.g. 
>> FPW),
>> and demonstrates inherent HDD/SSD differences.
>> All the SSD results show this behavior - the Optane and Samsung nicely show
>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
>> worse.
>>
> [..]
> Can you share what Linux kernel version, what filesystem , it's
> mount options and LVM setup were you using if any(?)
> 

The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
xfs filesystems.

i5 uses LVM on the 6x SATA SSD devices, with this config:

bench ~ # mdadm --detail /dev/md0
/dev/md0:
   Version : 0.90
 Creation Time : Thu Feb  8 15:05:49 2018
Raid Level : raid0
Array Size : 586106880 (558.96 GiB 600.17 GB)
  Raid Devices : 6
 Total Devices : 6
   Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Thu Feb  8 15:05:49 2018
 State : clean
Active Devices : 6
   Working Devices : 6
Failed Devices : 0
 Spare Devices : 0

Chunk Size : 512K

Consistency Policy : none

  UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host
bench)
Events : 0.1

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   8   332  active sync   /dev/sdc1
   3   8   493  active sync   /dev/sdd1
   4   8   654  active sync   /dev/sde1
   5   8   815  active sync   /dev/sdf1

bench ~ # mount | grep md0
/dev/md0 on /mnt/raid type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)


and the xeon just uses ext4 on the device directly:

/dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)


> I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ 
> ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio
> only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only.
> Results are attached, w/o graphs.
> 
>> Now, compare this to the SSD. There are some differences between 
>> the models, manufacturers, interface etc. but the impact of page
>> size on IOPS is pretty clear. On the Optane you can get +20-30% by
>> using 4K pages, on the Samsung it's even more, etc. This means that
>> workloads dominated by random I/O get significant benefit from
>> smaller pages.
> 
> Yup, same here, reproduced, 1.42x faster on writes:
> [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
> [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
> 1k/1.txt: bw=24162KB/s, iops=24161,
> 2k/1.txt: bw=47164KB/s, iops=23582,
> 4k/1.txt: bw=280450KB/s, iops=70112, <<<
> 8k/1.txt: bw=393082KB/s, iops=49135,
> 16k/1.txt: bw=393103KB/s, iops=24568,
> 32k/1.txt: bw=393283KB/s, iops=12290,
>
> BTW it's interesting to compare to your's Optane 900P result (same 
> two high bars for IOPS @ 4,8kB), but in my case it's even more import
> to select 4kB so it behaves more like Samsung 860 in your case
> 

Thanks. Interesting!

> # 1.41x on randreads
> [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
> [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
> 1k/1.txt: bw=169938KB/s, iops=169937,
> 2k/1.txt: bw=376653KB/s, iops=188326,
> 4k/1.txt: bw=691529KB/s, iops=172882, <<<
> 8k/1.txt: bw=976916KB/s, iops=122114,
> 16k/1.txt: bw=990524KB/s, iops=61907,
> 32k/1.txt: bw=974318KB/s, iops=30447,
> 
> I think that the above just a demonstration of

Re: oat_post_create expected behavior

2022-06-06 Thread Robert Haas

On Thu, Jun 2, 2022 at 6:37 PM Mary Xu  wrote:
> I was using an object access hook for oat_post_create access while creating 
> an extension and expected that I would be able to query for the newly created 
> extension with get_extension_oid(), but it was returning InvalidOid. However, 
> the same process works for triggers, so I was wondering what the expected 
> behavior is?
> From the documentation in objectaccess.h, it doesn't mention anything about 
> CommandCounterIncrement() for POST_CREATE which implied to me that it wasn't 
> something I would need to worry about.
> One option I thought of was this patch where CCI is called before the access 
> hook so that the new tuple is visible in the hook. Another option would be to 
> revise the documentation to reflect the expected behavior.

I don't think a proposal to add CommandCounterIncrement() calls just
for the convenience of object access hooks has much chance of being
accepted. Possibly there is some work that could be done to ensure
consistent placement of the calls to post-create hooks so that either
all of them happen before, or all of them happen after, a CCI has
occurred, but I'm not sure whether or not that is feasible. Currently,
I don't think we promise anything about whether a post-create hook
call will occur before or after a CCI, which is why
sepgsql_schema_post_create(), sepgsql_schema_post_create(), and
sepgsql_attribute_post_create() perform a catalog scan using
SnapshotSelf, while sepgsql_database_post_create() uses
get_database_oid(). You might want to adopt a similar technique.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add index scan progress to pg_stat_progress_vacuum

2022-06-06 Thread Robert Haas

On Thu, May 26, 2022 at 11:43 AM Masahiko Sawada  wrote:
> Another idea I came up with is that we can wait for all index vacuums
> to finish while checking and updating the progress information, and
> then calls WaitForParallelWorkersToFinish after confirming all index
> status became COMPLETED. That way, we don’t need to change the
> parallel query infrastructure. What do you think?

+1 from me. It doesn't seem to me that we should need to add something
like parallel_vacuum_progress_callback in order to solve this problem,
because the parallel index vacuum code could just do the waiting
itself, as you propose here.

The question Sami asks him his reply is a good one, though -- who is
to say that the leader only needs to update progress at the end, once
it's finished the index it's handling locally? There will need to be a
callback system of some kind to allow the leader to update progress as
other workers finish, even if the leader is still working. I am not
too sure that the idea of using the vacuum delay points is the best
plan. I think we should try to avoid piggybacking on such general
infrastructure if we can, and instead look for a way to tie this to
something that is specific to parallel vacuum. However, I haven't
studied the problem so I'm not sure whether there's a reasonable way
to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: pgcon unconference / impact of block size on performance

2022-06-06 Thread Jakub Wartak

Hi Tomas,

> Hi,
> 
> At on of the pgcon unconference sessions a couple days ago, I presented a
> bunch of benchmark results comparing performance with different data/WAL
> block size. Most of the OLTP results showed significant gains (up to 50%) with
> smaller (4k) data pages.

Nice. I just saw this 
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you 
have any plans for publishing those other graphs too (e.g. WAL block size 
impact)?

> This opened a long discussion about possible explanations - I claimed one of 
> the
> main factors is the adoption of flash storage, due to pretty fundamental
> differences between HDD and SSD systems. But the discussion concluded with an
> agreement to continue investigating this, so here's an attempt to support the
> claim with some measurements/data.
> 
> Let me present results of low-level fio benchmarks on a couple different HDD
> and SSD drives. This should eliminate any postgres-related influence (e.g. 
> FPW),
> and demonstrates inherent HDD/SSD differences.
> All the SSD results show this behavior - the Optane and Samsung nicely show
> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
> worse.
> 
[..]
Can you share what Linux kernel version, what filesystem , it's mount options 
and LVM setup were you using if any(?)

I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS 
i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: 
runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. 

> Now, compare this to the SSD. There are some differences between the models, 
> manufacturers, interface etc. but the impact of page size on IOPS is pretty 
> clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung 
> it's even more, etc. This means that workloads dominated by random I/O get 
> significant benefit from smaller pages.

Yup, same here, reproduced, 1.42x faster on writes:
[root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
[root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
1k/1.txt: bw=24162KB/s, iops=24161,
2k/1.txt: bw=47164KB/s, iops=23582,
4k/1.txt: bw=280450KB/s, iops=70112, <<<
8k/1.txt: bw=393082KB/s, iops=49135,
16k/1.txt: bw=393103KB/s, iops=24568,
32k/1.txt: bw=393283KB/s, iops=12290,
BTW it's interesting to compare to your's Optane 900P result (same two high 
bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so 
it behaves more like Samsung 860 in your case

# 1.41x on randreads
[root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
[root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
1k/1.txt: bw=169938KB/s, iops=169937,
2k/1.txt: bw=376653KB/s, iops=188326,
4k/1.txt: bw=691529KB/s, iops=172882, <<<
8k/1.txt: bw=976916KB/s, iops=122114,
16k/1.txt: bw=990524KB/s, iops=61907,
32k/1.txt: bw=974318KB/s, iops=30447,

I think that the above just a demonstration of device bandwidth saturation: 
32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for 
app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could 
only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most 
real systems are really OLTP but with some reporting/data exporting needs). One 
way or another it would be very nice to be able to select the tradeoff using 
initdb(1) without the need to recompile, which then begs for some initdb 
--calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...).

Do you envision any plans for this we still in a need to gather more info 
exactly why this happens? (perf reports?)

Also have you guys discussed on that meeting any long-term future plans on 
storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware 
sector size, wouldn't it be possible to win also disabling FPWs in the longer 
run using uring (assuming O_DIRECT | O_ATOMIC one day?)
I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that 
pretty nicely in [1] 

[1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO


libaio-2022-06-06.tgz
Description: libaio-2022-06-06.tgz

Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-06 Thread Justin Pryzby

On Mon, Jun 06, 2022 at 09:34:24AM +, Jean Landercy - BEEODIVERSITY wrote:
> Faulty setup is about:
> 
> SELECT version();
> 
> -- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Please check if the problem occurs in v13.7

https://www.postgresql.org/message-id/2197859.1644623...@sss.pgh.pa.us
https://www.postgresql.org/message-id/flat/2121219.1644607692%40sss.pgh.pa.us#190c43702a91dbd0509ba545dbbab58d

Re: How about a psql backslash command to show GUCs?

2022-06-06 Thread Justin Pryzby

On Tue, Apr 12, 2022 at 11:19:40AM -0400, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
> > On 4/11/22 4:11 PM, Tom Lane wrote:
> >> This idea does somewhat address my unhappiness upthread about printing
> >> values with source = 'internal', but I see that it gets confused by
> >> some GUCs with custom show hooks, like unix_socket_permissions.
> >> Maybe it needs to be "source != 'default' AND setting != boot_val"?
> 
> > Running through a few GUCs, that seems reasonable. Happy to test the 
> > patch out prior to commit to see if it renders better.
> 
> It'd just look like this, I think.  I see from looking at guc.c that
> boot_val can be NULL, so we'd better use IS DISTINCT FROM.

I noticed this is showing "pre-computed" gucs, like:

 shared_memory_size| 149MB
 shared_memory_size_in_huge_pages  | 75

I'm not opposed to that, but I wonder if that's what's intended / best.

-- 
Justin

pg_buffercache: add sql test

2022-06-06 Thread Dong Wook Lee

Hi, hackers
I just wrote a test code for the `pg_buffercache` extension which
doesn't not have test code.
I wrote the sql query to ensure that the buffer cache results are the
same when `make installcheck` is performed.

---
regards
Lee Dong Wook.


0001_add_test_pg_buffercache.patch
Description: Binary data

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-06 Thread James Coleman

On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi
 wrote:
>
> At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy 
>  wrote in
> > On Sat, Jun 4, 2022 at 6:29 PM James Coleman  wrote:
> > >
> > > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > > list, and I haven't seen any activity on it (maybe this is because I
> > > emailed directly instead of using the form?), but I got some time to
> > > take a look and concluded that a first-level fix is pretty simple.
> > >
> > > A quick background refresher: after promoting a standby rewinding the
> > > former primary requires that a checkpoint have been completed on the
> > > new primary after promotion. This is correctly documented. However
> > > pg_rewind incorrectly reports to the user that a rewind isn't
> > > necessary because the source and target are on the same timeline.
> ...
> > > Attached is a patch that detects this condition and reports it as an
> > > error to the user.
>
> I have some random thoughts on this.
>
> There could be a problem in the case of gracefully shutdowned
> old-primary, so I think it is worth doing something if it can be in a
> simple way.
>
> However, I don't think we can simply rely on minRecoveryPoint to
> detect that situation, since it won't be reset on a standby. A standby
> also still can be the upstream of a cascading standby.  So, as
> discussed in the thread for the comment [2], what we can do here would be
> simply waiting for the timelineID to advance, maybe having a timeout.

To confirm I'm following you correctly, you're envisioning a situation like:

- Primary A
- Replica B replicating from primary
- Replica C replicating from replica B

then on failover from A to B you end up with:

- Primary B
- Replica C replication from primary
- [needs rewind] A

and you try to rewind A from C as the source?

> In a case of single-step replication set, a checkpoint request to the
> primary makes the end-of-recovery checkpoint fast.  It won't work as
> expected in cascading replicas, but it might be acceptable.

"Won't work as expected" because there's no way to guarantee
replication is caught up or even advancing?

> > > In the spirit of the new-ish "ensure shutdown" functionality I could
> > > imagine extending this to automatically issue a checkpoint when this
> > > situation is detected. I haven't started to code that up, however,
> > > wanting to first get buy-in on that.
> > >
> > > 1: 
> > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com
> >
> > Thanks. I had a quick look over the issue and patch - just a thought -
> > can't we let pg_rewind issue a checkpoint on the new primary instead
> > of erroring out, maybe optionally? It might sound too much, but helps
> > pg_rewind to be self-reliant i.e. avoiding external actor to detect
> > the error and issue checkpoint the new primary to be able to
> > successfully run pg_rewind on the pld primary and repair it to use it
> > as a new standby.
>
> At the time of the discussion [2] for the it was the hinderance that
> that requires superuser privileges.  Now that has been narrowed down
> to the pg_checkpointer privileges.
>
> If we know that the timeline IDs are different, we don't need to wait
> for a checkpoint.

Correct.

> It seems to me that the exit status is significant. pg_rewind exits
> with 1 when an invalid option is given. I don't think it is great if
> we report this state by the same code.

I'm happy to change that; I only chose "1" as a placeholder for
"non-zero exit status".

> I don't think we always want to request a non-spreading checkpoint.

I'm not familiar with the terminology "non-spreading checkpoint".

> [2] 
> https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com

I read through that thread, and one interesting idea stuck out to me:
making "tiimeline IDs are the same" an error exit status. On the one
hand that makes a certain amount of sense because it's unexpected. But
on the other hand there are entirely legitimate situations where upon
failover the timeline IDs happen to match (e.g., for use it happens
some percentage of the time naturally as we are using sync replication
and failovers often involve STONITHing the original primary, so it's
entirely possible that the promoted replica begins with exactly the
same WAL ending LSN from the primary before it stopped).

Thanks,
James Coleman

Re: pg_rewind: warn when checkpoint hasn't happened after promotion

2022-06-06 Thread James Coleman

On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy
 wrote:
>
> On Sat, Jun 4, 2022 at 6:29 PM James Coleman  wrote:
> >
> > A few weeks back I sent a bug report [1] directly to the -bugs mailing
> > list, and I haven't seen any activity on it (maybe this is because I
> > emailed directly instead of using the form?), but I got some time to
> > take a look and concluded that a first-level fix is pretty simple.
> >
> > A quick background refresher: after promoting a standby rewinding the
> > former primary requires that a checkpoint have been completed on the
> > new primary after promotion. This is correctly documented. However
> > pg_rewind incorrectly reports to the user that a rewind isn't
> > necessary because the source and target are on the same timeline.
> >
> > Specifically, this happens when the control file on the newly promoted
> > server looks like:
> >
> > Latest checkpoint's TimeLineID:   4
> > Latest checkpoint's PrevTimeLineID:   4
> > ...
> > Min recovery ending loc's timeline:   5
> >
> > Attached is a patch that detects this condition and reports it as an
> > error to the user.
> >
> > In the spirit of the new-ish "ensure shutdown" functionality I could
> > imagine extending this to automatically issue a checkpoint when this
> > situation is detected. I haven't started to code that up, however,
> > wanting to first get buy-in on that.
> >
> > 1: 
> > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com
>
> Thanks. I had a quick look over the issue and patch - just a thought -
> can't we let pg_rewind issue a checkpoint on the new primary instead
> of erroring out, maybe optionally? It might sound too much, but helps
> pg_rewind to be self-reliant i.e. avoiding external actor to detect
> the error and issue checkpoint the new primary to be able to
> successfully run pg_rewind on the pld primary and repair it to use it
> as a new standby.

That's what I had suggested as a "further improvement" option in the
last paragraph :)

But I think agreement on this more basic solution would still be good
(even if I add the automatic checkpointing in this thread); given we
currently explicitly mis-inform the user of pg_rewind, I think this is
a bug that should be considered for backpatching, and the simpler
"fail if detected" patch is probably the only thing we could
backpatch.

Thanks for taking a look,
James Coleman

RE: Multi-Master Logical Replication

2022-06-06 Thread kuroda.hay...@fujitsu.com

Dear hackers,

I found another use-case for LRG. It might be helpful for migration.


LRG for migration
--
LRG may be helpful for machine migration, OS upgrade,
or PostgreSQL itself upgrade.

Assumes that users want to migrate database to other environment,
e.g., PG16 on RHEL7 to PG18 on RHEL8.
Users must copy all data into new server and catchup all changes.
In this case streaming replication cannot be used
because it requires same OS and same PostgreSQL major version.
Moreover, it is desirable to be able to return to the original environment at 
any time
in case of application or other environmental deficiencies.


Operation steps with LRG
--

LRG is appropriate for the situation. Following lines are the workflow that 
users must do:

1. Copy the table definition to the newer node(PG18), via pg_dump/pg_restore
2. Execute lrg_create() in the older node(PG16)
3. Execute lrg_node_attach() in PG18

=== data will be shared here===

4. Change the connection of the user application to PG18
5. Check whether ERROR is raised or not. If some ERRORs are raised,
  users can change back the connection to PG16.
6. Remove the created node group if application works well.

These operations may reduce system downtime
due to incompatibilities associated with version upgrades.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

2022-06-06 Thread Jean Landercy - BEEODIVERSITY

Dear PostgreSQL Hacker Community,

I am facing a tricky bug which makes the Query Planner crashes when using 
COUNT(*) function.
Without any upgrade suddenly a table of a database instance could not be 
queried this way:

SELECT COUNT(*) FROM items;

-- ERROR:  variable not found in subplan target list
-- SQL state: XX000

Message and behaviour seem related to the Query Planner:


EXPLAIN SELECT COUNT(*) FROM item;

-- ERROR:  variable not found in subplan target list

-- SQL state: XX000

Looks like a column name could not be found (see 
https://github.com/postgres/postgres/blob/ce4f46fdc814eb1b704d81640f6d8f03625d0f53/src/backend/optimizer/plan/setrefs.c#L2967-L2972)
 in some specific context that is somehow hard to reproduce.

Interesting facts:


SELECT COUNT(id) FROM items;  -- 213



SELECT COUNT(*) FROM items WHERE id > 0; -- 213

Work as expected.

I can see that other people are recently facing a similar problem 
(https://www.postgresql.org/message-id/flat/4c347490-d734-5fdd-d613-1327601b4e7e%40mit.edu).
If it is the same bug then it is not related to the PGroonga extension as I 
don't use it all.

Anyway, the bug is difficult to reproduce on my application.
At the time of writing, I could just isolate it on a specific database but I 
could not draw a MCVE from it.
I am looking for help to make it reproducible and feed your knowledge database.

My first guess was to open a post of SO (see for details 
https://stackoverflow.com/questions/72498741/how-can-i-reproduce-a-database-context-to-debug-a-tricky-postgresql-error-vari),
 but digging deeper in the investigation it seems it will require people with 
strong insights on how PostgreSQL actually works under the hood.
Therefore, I chose this specific mailing list.

The bug is tricky to reproduce, I could not succeed to replicate elsewhere 
(dump/restore does not preserve it).
Anyway it makes my database unusable and looks like a potential bug for your 
product and applications relying on it.

Faulty setup is about:


SELECT version();

-- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit



SELECT extname, extversion FROM pg_extension;

-- "plpgsql""1.0"

-- "postgis""3.1.1"

By now, the only workarounds I have found are:


  *   Dump database and recreate a new instance (problem seems to vanish but 
there is no guarantee it is solved or it will not happened later on);
  *   Add dummy filter on all queries (more a trick than a solution).

I am writing to this mailing list to raise you attention on it.
I'll be happy to help you investigate it deeper.

Best regards,

Landercy Jean

RE: Handle infinite recursion in logical replication setup

2022-06-06 Thread shiy.f...@fujitsu.com

On Mon, Jun 6, 2022 1:14 PM vignesh C  wrote:
> 
> The attached v18 patch has the fixes for the same.
> 

Thanks for updating the patch, here are some comments.

0002 patch
==
1. 
+   
+origin (string)
+
+ 

It maybe better if the type of  "origin" parameter is enum,  as it cannot be any
string and only has two valid values.

2.
@@ -607,6 +626,11 @@ CreateSubscription(ParseState *pstate, 
CreateSubscriptionStmt *stmt,
 LOGICALREP_TWOPHASE_STATE_PENDING :
 LOGICALREP_TWOPHASE_STATE_DISABLED);
values[Anum_pg_subscription_subdisableonerr - 1] = 
BoolGetDatum(opts.disableonerr);
+   if (opts.origin)
+   values[Anum_pg_subscription_suborigin - 1] =
+   CStringGetTextDatum(opts.origin);
+   else
+   nulls[Anum_pg_subscription_suborigin - 1] = true;
values[Anum_pg_subscription_subconninfo - 1] =
CStringGetTextDatum(conninfo);
if (opts.slot_name)

Document of "CREATE SUBSCRIPTION" says, the default value of "origin" is "any", 
so why not set
suborigin to "any" when user doesn't specify this parameter?

0003 patch
==
1.
@@ -300,6 +310,11 @@ CREATE SUBSCRIPTION subscription_namefalse.

+ 
+  There is some interaction between the origin
+  parameter and copy_data parameter. Refer to the
+   for details.
+ 

I think this change should be put together with "origin" parameter, instead of
"disable_on_error".

0004 patch
==
1.
+   
+Now the bidirectional logical replication setup is complete between
+node1, node2 and
+node2. Any subsequent changes in one node will
+replicate the changes to the other nodes.
+   

I think "node1, node2 and node2" should be "node1, node2 and node3".

Regards,
Shi yu

Re: [RFC] building postgres with meson -v8

2022-06-06 Thread Michael Paquier

On Tue, May 24, 2022 at 08:08:26PM +0200, Peter Eisentraut wrote:
> On 18.05.22 21:48, Andres Freund wrote:
>> - GETTIMEOFDAY_1ARG - test doesn't exist - I suspect it might not be 
>> necessary
> 
> Might be obsolete, consider removing.

I just came across this one independently of what you are doing for
meson, and based on a lookup of the buildfarm, I think that it can be
removed.  One reference about GETTIMEOFDAY_1ARG on the -hackers list
comes from here, from 20 years ago:
https://www.postgresql.org/message-id/a1eeu5$1koe$1...@news.tht.net
--
Michael


signature.asc
Description: PGP signature

Re: Ignoring BRIN for HOT udpates seems broken

2022-06-06 Thread Michael Paquier

On Mon, Jun 06, 2022 at 09:08:08AM +0200, Tomas Vondra wrote:
> Attached is a patch reverting both commits (5753d4ee32 and fe60b67250).
> This changes the IndexAmRoutine struct, so it's an ABI break. That's not
> great post-beta :-( In principle we might also leave amhotblocking in
> the struct but ignore it in the code (and treat it as false), but that
> seems weird and it's going to be a pain when backpatching. Opinions?

I don't think that you need to worry about ABI breakages now in beta,
because that's the period of time where we can still change things and
shape the code in its best way for prime time.  It depends on the
change, of course, but what you are doing, by removing the field,
looks right to me here.
--
Michael


signature.asc
Description: PGP signature

Re: should check interrupts in BuildRelationExtStatistics ?

2022-06-06 Thread Michael Paquier

On Sat, Jun 04, 2022 at 08:42:33PM -0500, Justin Pryzby wrote:
> The fix seems to be to CHECK_FOR_INTERRUPTS() within multi_sort_compare().
> That would supercede the other two CHECK_FOR_INTERRUPTS I'd proposed, and
> handle mcv, depends, and ndistinct all at once.

Hmm.  I have to admit that adding a CFI() in multi_sort_compare()
stresses me a bit as it is dependent on the number of rows involved,
and it can be used as a qsort routine.
--
Michael


signature.asc
Description: PGP signature

Re: Ignoring BRIN for HOT udpates seems broken

2022-06-06 Thread Tomas Vondra

On 6/1/22 22:38, Robert Haas wrote:
> On Sat, May 28, 2022 at 4:51 PM Tomas Vondra
>  wrote:
>> Yeah, I think that might/should work. We could still create the HOT
>> chain, but we'd have to update the BRIN indexes. But that seems like a
>> fairly complicated change to be done this late for PG15.
> 
> Yeah, I think a revert is better for now. But I agree that the basic
> idea seems salvageable. I think that the commit message is correct
> when it states that "When determining whether an index update may be
> skipped by using HOT, we can ignore attributes indexed only by BRIN
> indexes." However, that doesn't mean that we can ignore the need to
> update those indexes. In that regard, the commit message makes it
> sound like all is well, because it states that "the page range summary
> will be updated anyway" which reads to me like the indexes are in fact
> getting updated. Your example, however, seems to show that the indexes
> are not getting updated.
> 

Yeah, agreed :-( I agree we can probably salvage some of the idea, but
it's far too late for major reworks in PG15.

Attached is a patch reverting both commits (5753d4ee32 and fe60b67250).
This changes the IndexAmRoutine struct, so it's an ABI break. That's not
great post-beta :-( In principle we might also leave amhotblocking in
the struct but ignore it in the code (and treat it as false), but that
seems weird and it's going to be a pain when backpatching. Opinions?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 22911ff0df0284bd9c97d4971ab10332444c528d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 6 Jun 2022 08:21:36 +0200
Subject: [PATCH] Revert changes in HOT handling of BRIN indexes

This reverts commits 5753d4ee32 and fe60b67250 that modified HOT to
ignore BRIN indexes. The commit message for 5753d4ee32 claims that:

When determining whether an index update may be skipped by using
HOT, we can ignore attributes indexed only by BRIN indexes. There
are no index pointers to individual tuples in BRIN, and the page
range summary will be updated anyway as it relies on visibility
info.

This is partially incorrect - it's true BRIN indexes don't point to
individual tuples, so HOT chains are not an issue, but the visibitlity
info is not sufficient to keep the index up to date. This can easily
result in corrupted indexes, as demonstrated in the hackers thread.

This does not mean relaxing the HOT restrictions for BRIN is a lost
cause, but it needs to handle the two aspects (allowing HOT chains and
updating the page range summaries) as separate. But that requires a
major changes, and it's too late for that in the current dev cycle.

Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a5563...@enterprisedb.com
---
 doc/src/sgml/indexam.sgml |  11 -
 src/backend/access/brin/brin.c|   1 -
 src/backend/access/gin/ginutil.c  |   1 -
 src/backend/access/gist/gist.c|   1 -
 src/backend/access/hash/hash.c|   1 -
 src/backend/access/heap/heapam.c  |   2 +-
 src/backend/access/nbtree/nbtree.c|   1 -
 src/backend/access/spgist/spgutils.c  |   1 -
 src/backend/utils/cache/relcache.c|  53 ++--
 src/include/access/amapi.h|   2 -
 src/include/utils/rel.h   |   3 +-
 src/include/utils/relcache.h  |   4 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   1 -
 src/test/regress/expected/brin.out|  58 -
 src/test/regress/expected/stats.out   | 242 --
 src/test/regress/sql/brin.sql |  36 ---
 src/test/regress/sql/stats.sql| 111 
 17 files changed, 27 insertions(+), 502 deletions(-)

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index d4163c96e9f..cf359fa9ffd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -126,8 +126,6 @@ typedef struct IndexAmRoutine
 boolamcaninclude;
 /* does AM use maintenance_work_mem? */
 boolamusemaintenanceworkmem;
-/* does AM block HOT update? */
-boolamhotblocking;
 /* OR of parallel vacuum flags */
 uint8   amparallelvacuumoptions;
 /* type of data stored in index, or InvalidOid if variable */
@@ -248,15 +246,6 @@ typedef struct IndexAmRoutine
null, independently of amoptionalkey.

-  
-   The amhotblocking flag indicates whether the
-   access method blocks HOT when an indexed attribute is
-   updated. Access methods without pointers to individual tuples (like
-   BRIN) may allow HOT even in this
-   case. This does not apply to attributes referenced in index predicates;
-   an update of such attribute always disables HOT.
-  
-

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 0de1441dc6d..e88f7efa7e4 100644
--- a/src/b

61 matches

Mail list logo