An inverted index using roaring bitmaps
I've been doing some preliminary prep work to see how an inverted index using roaring bitmaps (https://roaringbitmap.org/) would perform. I'm presenting some early work using SQL code with the roaring bitmap Postgres extension (https://github.com/ChenHuajun/pg_roaringbitmap) to simulate a hypothetical index using this approach. I'd like to solicit feedback from the community to see if this is something worth pursuing or if there are potential issues that I'm not aware of (or if this approach has been considered in the past and discarded for whatever reason). For context, my experience with Postgres is primarily as a user and I'm not at all familiar with the code base, so please be gentle :). That said, here's a quick and dirty demo: I have a table "cities" Table "public.cities" Column | Type | Collation | Nullable | Default -+--+---+--+- city| text | | | country | text | | | Indexes: "cities_country_idx" btree (country) select count(*) from cities; count 139739 (1 row) And just some sample rows: select * from cities order by random() limit 10; city | country +-- Alcalá de la Selva | Spain Bekirhan | Turkey Ceggia | Italy Châtillon-en-Vendelais | France Hohenfelde | Germany Boedo | Argentina Saint-Vith | Belgium Gaggenau | Germany Lake Ozark | United States Igunga | Tanzania, United Republic of (10 rows) Since a bitmap requires you to convert your inputs into integers, I created a function as a hack to convert our TIDs to integers. It's ugly as hell, but it serves. 2048 is 2^11, which according to the GIN index source code is a safe assumption for the highest possible MaxHeapTuplesPerPage. create function ctid_to_int(ctid tid) returns integer as $$ select (ctid::text::point)[0] * 2048 + (ctid::text::point)[1]; $$ language sql returns null on null input; And the reverse: create function int_to_ctid(i integer) returns tid as $$ select point(i/2048, i%2048)::text::tid; $$ language sql returns null on null input; In addition, I created a table "cities_rb" to roughly represent an "index" on the country column: create table cities_rb as (select country, roaringbitmap(array_agg(ctid_to_int(ctid))::text) idx from cities group by country); Table "public.cities_rb" Column | Type | Collation | Nullable | Default -+---+---+--+- country | text | | | idx | roaringbitmap | | | Now for the fun stuff - to simulate the "index" I will be running some queries against the cities_rb table using bitmap aggregations and comparing them to functionally the same queries using the BTree index on cities. explain analyze select ctid from cities where country = 'Japan'; QUERY PLAN -- Bitmap Heap Scan on cities (cost=18.77..971.83 rows=1351 width=6) (actual time=0.041..0.187 rows=1322 loops=1) Recheck Cond: (country = 'Japan'::text) Heap Blocks: exact=65 -> Bitmap Index Scan on cities_country_idx (cost=0.00..18.43 rows=1351 width=0) (actual time=0.031..0.031 rows=1322 loops=1) Index Cond: (country = 'Japan'::text) Planning Time: 0.055 ms Execution Time: 0.233 ms (7 rows) explain analyze select rb_to_array(idx) from cities_rb where country = 'Japan'; QUERY PLAN - Seq Scan on cities_rb (cost=0.00..14.88 rows=1 width=32) (actual time=0.050..0.067 rows=1 loops=1) Filter: (country = 'Japan'::text) Rows Removed by Filter: 229 Planning Time: 0.033 ms Execution Time: 0.076 ms (5 rows) explain analyze select count(*) from cities where country = 'Japan'; QUERY PLAN - Aggregate (cost=35.31..35.32 rows=1 width=8) (actual time=0.151..0.151 rows=1 loops=1) -> Index Only Scan using cities_country_idx on cities (cost=0.29..31.94 rows=1351 width=0) (actual time=0.026..0.103 rows=1322 loops=1) Index Cond: (country = 'Japan'::text) Heap Fetches: 0 Planning Time: 0.056 ms Execution Time: 0.180 ms (6 rows) explain analyze select rb_cardinality(idx) from cities_rb where country = 'Japan'; QUERY PLAN
Re: proposal - psql - use pager for \watch command
út 7. 6. 2022 v 6:50 odesílatel Thomas Munro napsal: > On Tue, Jun 7, 2022 at 3:23 PM Tom Lane wrote: > > The code needs a comment about why it's emitting a newline, though. > > In particular, it had better explain why that should be conditional > > on !pagerpipe, because that makes no sense to me. > > Yeah. OK, here's my take: > > + /* > +* If the terminal driver echoed "^C", > libedit/libreadline might be > +* confused about the cursor position. Therefore, > inject a newline > +* before the next prompt is displayed. We only do > this when not using > +* a pager, because pagers are expected to restore the > screen to a sane > +* state on exit. > +*/ > > AFAIK pagers conventionally use something like termcap ti/te[1] to > restore the screen, or equivalents in tinfo etc (likely via curses). > If we were to inject an extra newline we'd just have a blank line for > nothing. I suppose there could be a hypothetical pager that doesn't > follow that convention, and in fact both less and pspg have a -X > option to preserve last output, but in any case I expect that pagers > disable echoing, so I don't think the ^C will make it to the screen, > and furthermore ^C isn't used for exit anyway. Rather than speculate > about the precise details, I just said "... sane state on exit". > Pavel, do you agree? > Applications designed to be used as pager are usually careful about the final cursor position. Without it, there can be no wanted artefacts. pspg should work in pgcli, which is a more sensitive environment than psql. I think modern pagers like less or pspg will work in all modes correctly. There can be some legacy pagers like "pg" or very old implementations of "more". But we don't consider it probably (more just in comment). Regards Pavel > Here's how it looks after I enter and then exit Pavel's streaming pager: > > $ PSQL_WATCH_PAGER='pspg --stream' ~/install/bin/psql postgres > psql (15beta1) > Type "help" for help. > > postgres=# select; > -- > (1 row) > > postgres=# \watch 1 > postgres=# > > FWIW it's the same with PSQL_WATCH_PAGER='less'. > > [1] > https://www.gnu.org/software/termutils/manual/termcap-1.3/html_node/termcap_39.html >
Re: proposal - psql - use pager for \watch command
On Tue, Jun 7, 2022 at 3:23 PM Tom Lane wrote: > The code needs a comment about why it's emitting a newline, though. > In particular, it had better explain why that should be conditional > on !pagerpipe, because that makes no sense to me. Yeah. OK, here's my take: + /* +* If the terminal driver echoed "^C", libedit/libreadline might be +* confused about the cursor position. Therefore, inject a newline +* before the next prompt is displayed. We only do this when not using +* a pager, because pagers are expected to restore the screen to a sane +* state on exit. +*/ AFAIK pagers conventionally use something like termcap ti/te[1] to restore the screen, or equivalents in tinfo etc (likely via curses). If we were to inject an extra newline we'd just have a blank line for nothing. I suppose there could be a hypothetical pager that doesn't follow that convention, and in fact both less and pspg have a -X option to preserve last output, but in any case I expect that pagers disable echoing, so I don't think the ^C will make it to the screen, and furthermore ^C isn't used for exit anyway. Rather than speculate about the precise details, I just said "... sane state on exit". Pavel, do you agree? Here's how it looks after I enter and then exit Pavel's streaming pager: $ PSQL_WATCH_PAGER='pspg --stream' ~/install/bin/psql postgres psql (15beta1) Type "help" for help. postgres=# select; -- (1 row) postgres=# \watch 1 postgres=# FWIW it's the same with PSQL_WATCH_PAGER='less'. [1] https://www.gnu.org/software/termutils/manual/termcap-1.3/html_node/termcap_39.html From a9389a8755379d4df2c52eb760d355aa5cf0ff96 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 7 Jun 2022 14:18:51 +1200 Subject: [PATCH v2] Fix \watch's interaction with libedit/libreadline. When you hit ^C, the terminal driver in Unix-like systems echos "^C" as well as sending an interrupt signal (depending on stty settings). The line editing library (libedit/libreadline) is then confused about the current cursor location, and might corrupt the display. Fix, by moving to a new line before the next prompt is displayed. Author: Pavel Stehule Reported-by: Tom Lane Discussion: https://postgr.es/m/3278793.1626198638%40sss.pgh.pa.us --- src/bin/psql/command.c | 12 1 file changed, 12 insertions(+) diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index b51d28780b..7f366c80e9 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -5169,6 +5169,18 @@ do_watch(PQExpBuffer query_buf, double sleep) pclose(pagerpipe); restore_sigpipe_trap(); } + else + { + /* + * If the terminal driver echoed "^C", libedit/libreadline might be + * confused about the cursor position. Therefore, inject a newline + * before the next prompt is displayed. We only do this when not using + * a pager, because pagers are expected to restore the screen to a sane + * state on exit. + */ + fprintf(stdout, "\n"); + fflush(stdout); + } #ifdef HAVE_POSIX_DECL_SIGWAIT /* Disable the interval timer. */ -- 2.35.1
tablesync copy ignores publication actions
The logical replication tablesync ignores the publication 'publish' operations during the initial data copy. This is current/known PG behaviour (e.g. as recently mentioned [1]) but it was not documented anywhere. This patch just documents the existing behaviour and gives some examples. -- [1] https://www.postgresql.org/message-id/CAA4eK1L_98LF7Db4yFY1PhKKRzoT83xtN41jTS5X%2B8OeGrAkLw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia v1-0001-PGDOCS-tablesync-ignores-publish-operation.patch Description: Binary data
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
At Mon, 6 Jun 2022 08:32:01 -0400, James Coleman wrote in > On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi > wrote: > > > > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy > > wrote in > > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman wrote: > > > > > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > > > list, and I haven't seen any activity on it (maybe this is because I > > > > emailed directly instead of using the form?), but I got some time to > > > > take a look and concluded that a first-level fix is pretty simple. > > > > > > > > A quick background refresher: after promoting a standby rewinding the > > > > former primary requires that a checkpoint have been completed on the > > > > new primary after promotion. This is correctly documented. However > > > > pg_rewind incorrectly reports to the user that a rewind isn't > > > > necessary because the source and target are on the same timeline. > > ... > > > > Attached is a patch that detects this condition and reports it as an > > > > error to the user. > > > > I have some random thoughts on this. > > > > There could be a problem in the case of gracefully shutdowned > > old-primary, so I think it is worth doing something if it can be in a > > simple way. > > > > However, I don't think we can simply rely on minRecoveryPoint to > > detect that situation, since it won't be reset on a standby. A standby > > also still can be the upstream of a cascading standby. So, as > > discussed in the thread for the comment [2], what we can do here would be > > simply waiting for the timelineID to advance, maybe having a timeout. > > To confirm I'm following you correctly, you're envisioning a situation like: > > - Primary A > - Replica B replicating from primary > - Replica C replicating from replica B > > then on failover from A to B you end up with: > > - Primary B > - Replica C replication from primary > - [needs rewind] A > > and you try to rewind A from C as the source? Yes. I think it is a legit use case. That being said, like other points, it might be acceptable. > > In a case of single-step replication set, a checkpoint request to the > > primary makes the end-of-recovery checkpoint fast. It won't work as > > expected in cascading replicas, but it might be acceptable. > > "Won't work as expected" because there's no way to guarantee > replication is caught up or even advancing? Maybe no. I meant that restartpoints don't run more frequently than the intervals of checkpoint_timeout even if checkpint records come more frequently. > > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > > > imagine extending this to automatically issue a checkpoint when this > > > > situation is detected. I haven't started to code that up, however, > > > > wanting to first get buy-in on that. > > > > > > > > 1: > > > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com > > > > > > Thanks. I had a quick look over the issue and patch - just a thought - > > > can't we let pg_rewind issue a checkpoint on the new primary instead > > > of erroring out, maybe optionally? It might sound too much, but helps > > > pg_rewind to be self-reliant i.e. avoiding external actor to detect > > > the error and issue checkpoint the new primary to be able to > > > successfully run pg_rewind on the pld primary and repair it to use it > > > as a new standby. > > > > At the time of the discussion [2] for the it was the hinderance that > > that requires superuser privileges. Now that has been narrowed down > > to the pg_checkpointer privileges. > > > > If we know that the timeline IDs are different, we don't need to wait > > for a checkpoint. > > Correct. > > > It seems to me that the exit status is significant. pg_rewind exits > > with 1 when an invalid option is given. I don't think it is great if > > we report this state by the same code. > > I'm happy to change that; I only chose "1" as a placeholder for > "non-zero exit status". > > > I don't think we always want to request a non-spreading checkpoint. > > I'm not familiar with the terminology "non-spreading checkpoint". Does "immediate checkpoint" works? That is, a checkpoint that runs at full-speed (i.e. with no delays between writes). > > [2] > > https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com > > I read through that thread, and one interesting idea stuck out to me: > making "tiimeline IDs are the same" an error exit status. On the one > hand that makes a certain amount of sense because it's unexpected. But > on the other hand there are entirely legitimate situations where upon > failover the timeline IDs happen to match (e.g., for use it happens > some percentage of the time naturally as we are using sync replication > and failovers often involve STONITHing the original primary, so it's > entirely possible that the promoted replica
Re: proposal - psql - use pager for \watch command
Thomas Munro writes: > On Mon, May 9, 2022 at 7:07 PM Pavel Stehule wrote: >> út 13. 7. 2021 v 19:50 odesílatel Tom Lane napsal: >>> ^C\watch cancelled >>> regression=# > Do we really need the extra text? What about just \n, so you get: > postgres=# \watch 1 > ...blah blah... > ^C > postgres=# Fine by me. > This affects all release branches too. Should we bother to fix this > there? For them, I think the fix is just: If we're doing something as nonintrusive as just adding a newline, it'd probably be OK to backpatch. The code needs a comment about why it's emitting a newline, though. In particular, it had better explain why that should be conditional on !pagerpipe, because that makes no sense to me. regards, tom lane
Re: proposal - psql - use pager for \watch command
On Mon, May 9, 2022 at 7:07 PM Pavel Stehule wrote: > út 13. 7. 2021 v 19:50 odesílatel Tom Lane napsal: >> ^Cregression=# >> >> then as you can see I get nothing but the "^C" echo before the next >> psql prompt. The problem with this is that now libreadline is >> misinformed about the cursor position, messing up any editing I >> might try to do on the next line of input. So I think it would >> be a good idea to have some explicit final output when the \watch >> command terminates, along the line of >> >> ... >> Tue Jul 13 13:44:46 2021 (every 2s) >> >> now >> --- >> 2021-07-13 13:44:46.396572-04 >> (1 row) >> >> ^C\watch cancelled >> regression=# >> >> This strikes me as a usability improvement even without the >> readline-confusion angle. > > here is an patch I played with this. On a libedit build (tested on my Mac), an easy way to see corruption is to run eg SELECT;, then \watch 1, then ^C, then up-arrow to see the previous command clobber the wrong columns. On a libreadline build (tested on my Debian box), that simple test doesn't fail in the same way. Though there may be some other way to make it misbehave that would take me longer to find, it's enough for me that libedit is befuddled by what we're doing. Do we really need the extra text? What about just \n, so you get: postgres=# \watch 1 ...blah blah... ^C postgres=# This affects all release branches too. Should we bother to fix this there? For them, I think the fix is just: diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index d1ee795cb6..3a88d5d6c4 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -4992,6 +4992,9 @@ do_watch(PQExpBuffer query_buf, double sleep) sigint_interrupt_enabled = false; } + fprintf(pset.queryFout, "\n"); + fflush(pset.queryFout); + pg_free(title); return (res >= 0); } From 409a5ce2d81c6fd976ff7cc0ed4de50197d4ce55 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 7 Jun 2022 14:18:51 +1200 Subject: [PATCH] Fix \watch's interaction with libedit/libreadline. When you hit ^C, the terminal driver in Unix-like systems echos "^C" as well as sending an interrupt signal (depending on stty settings). The line editing library (libedit/libreadline) is then confused about the current cursor location, and might corrupt the display. Fix, by moving to a new line before the next prompt is displayed. Author: Pavel Stehule Reported-by: Tom Lane Discussion: https://postgr.es/m/3278793.1626198638%40sss.pgh.pa.us --- src/bin/psql/command.c | 5 + 1 file changed, 5 insertions(+) diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index b51d28780b..5a974a5ad9 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -5169,6 +5169,11 @@ do_watch(PQExpBuffer query_buf, double sleep) pclose(pagerpipe); restore_sigpipe_trap(); } + else + { + fprintf(stdout, "\n"); + fflush(stdout); + } #ifdef HAVE_POSIX_DECL_SIGWAIT /* Disable the interval timer. */ -- 2.35.1
Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
tather => rather is charge => in charge On Mon, Jun 06, 2022 at 01:17:52PM +0900, Michael Paquier wrote: > but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would On Tue, Jun 07, 2022 at 08:30:47AM +0900, Michael Paquier wrote: > I would reduce the format to be MMDDTHHMMSS_ms (we could also use I think it's better with a dot (HHMMSS.ms) rather than underscore (HHMMSS_ms). On Tue, Jun 07, 2022 at 11:42:37AM +0900, Michael Paquier wrote: > + /* append milliseconds */ > + snprintf(timebuf, sizeof(timebuf), "%s_%03d", > + timebuf, (int) (time.tv_usec / 1000)); > + with a timestamp formatted as per ISO 8601 > + (%Y%m%dT%H%M%S) appended by an underscore and > + the timestamp's milliseconds, where all the generated files are stored. The ISO timestamp can include milliseconds (or apparently fractional parts of the "lowest-order" unit), so the "appended by" part doesn't need to be explained here. + snprintf(timebuf, sizeof(timebuf), "%s_%03d", +timebuf, (int) (time.tv_usec / 1000)); Is it really allowed to sprintf a buffer onto itself ? I can't find any existing cases doing that. It seems useless in any case - you could instead snprintf(timebuf+strlen(timebuf), or increment len+=snprintf()... Or append the milliseconds here: + len = snprintf(log_opts.basedir, MAXPGPATH, "%s/%s", log_opts.rootdir, + timebuf); -- Justin
Re: Error from the foreign RDBMS on a foreign table I have no privilege on
On Sat, 2022-06-04 at 21:18 +, Phil Florent wrote: > I opened an issue with an attached code on oracle_fdw git page : > https://github.com/laurenz/oracle_fdw/issues/534 > Basically I expected to obtain a "no privilege" error from PostgreSQL when I > have no read privilege > on the postgres foreign table but I obtained an Oracle error instead. > Laurenz investigated and closed the issue but he suggested perhaps I should > post that on > the hackers list since it also occurs with postgres-fdw on some occasion(I > have investigated some more, > and postgres_fdw does the same thing when you turn onuse_remote_estimate.). > Hence I do... To add more detais: permissions are checked at query execution time, but if "use_remote_estimate" is used, the planner already accesses the remote table, even if the user has no permissions on the foreign table. I feel that that is no bug, but I'd be curious to know if others disagree. Yours, Laurenz Albe
Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
On Tue, Jun 07, 2022 at 08:30:47AM +0900, Michael Paquier wrote: > If we don't split by the millisecond, we would come back to the > problems of the original report. On my laptop, the --check phase > that passes takes more than 1s, but the one that fails takes 0.1s, so > a follow-up run would complain with the path conflicts. So at the end > I would reduce the format to be MMDDTHHMMSS_ms (we could also use > a logic that checks for conflicts and appends an extra number if > needed, though the addition of the extra ms is a bit shorter). So, attached is the patch I would like to apply for all that (commit message included). One issue I missed previously is that the TAP test missed the log files on failure, so I had to tweak that with a find routine. I have fixed a few comments, and improved the docs to describe the directory structure. We are still need a refresh of the buildfarm client for the case where pg_upgrade is tested without TAP, like that I guess: --- a/PGBuild/Modules/TestUpgrade.pm +++ b/PGBuild/Modules/TestUpgrade.pm @@ -140,6 +140,7 @@ sub check $self->{pgsql}/src/bin/pg_upgrade/log/* $self->{pgsql}/src/bin/pg_upgrade/tmp_check/*/*.diffs $self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/* + $self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/*/log/* $self->{pgsql}/src/test/regress/*.diffs" ); $log->add_log($_) foreach (@logfiles); -- Michael From a865e736951814d0b7aeee2b93ac4e87d355af0f Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 7 Jun 2022 11:29:31 +0900 Subject: [PATCH v3] Restructure pg_upgrade output directories for better idempotence 38bfae3 has moved the contents written to files by pg_upgrade under a new directory called pg_upgrade_output.d/ located in the new cluster's data folder, and it used a simple structure made of two subdirectories leading to a fixed structure: log/ and dump/. This design has made weaker pg_upgrade on repeated calls, as we could get failures when creating one or more of those directories, while potentially losing the logs of a previous run (logs are retained automatically on failure, and cleaned up on success unless --retain is specified). So a user would need to clean up pg_upgrade_output.d/ as an extra step for any repeated calls of pg_upgrade. The most common scenario here is --check followed by the actual upgrade, but one could see the failure when specifying an incorrect argument value. Removing entirely the logs would have the disadvantage of removing all the past information, even if --retain was specified at some past step. This result is annoying for a lot of users and automated upgrade flows. So, tather than requiring a manual removal of pg_upgrade_output.d/, this redesigns the set of output directories in a more dynamic way, based on a suggestion from Tom Lane and Daniel Gustafsson. pg_upgrade_output.d/ is still the base path, but a second directory level is added, mostly named after an ISO-8601-formatted timestamp (in short human-readable, with milliseconds appended to the name to avoid any conflicts). The logs and dumps are saved within the same subdirectories as previously, as of log/ and dump/, but these are located inside the subdirectory named after the timestamp. The logs of a given run are removed only after a successful run if --retain is not used, and pg_upgrade_output.d/ is kept if there are any logs from a previous run. Note that previously, pg_upgrade would have kept the logs even after a successful --check but that was inconsistent compared to the case without --check. The code is charge of the removal of the output directories is now refactored into a single routine. Two TAP tests are added with a --check command (one failure and one success), to look after the case fixed here. Note that the test had to be tweaked a bit to fit with the new directory structure so as it can find any logs generated on failure. This is still going to require a change in the buildfarm client for the case where pg_upgrade is tested without the TAP test, though. Reported-by: Tushar Ahuja Author: Michael Paquier Reviewed-by: Daniel Gustafsson, Justin Pryzby Discussion: https://postgr.es/m/77e6ecaa-2785-97aa-f229-4b6e047cb...@enterprisedb.com --- src/bin/pg_upgrade/check.c | 2 + src/bin/pg_upgrade/pg_upgrade.c| 65 +- src/bin/pg_upgrade/pg_upgrade.h| 14 -- src/bin/pg_upgrade/t/002_pg_upgrade.pl | 49 ++- src/bin/pg_upgrade/util.c | 42 + doc/src/sgml/ref/pgupgrade.sgml| 5 +- 6 files changed, 148 insertions(+), 29 deletions(-) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 6114303b52..ace7387eda 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -210,6 +210,8 @@ report_clusters_compatible(void) pg_log(PG_REPORT, "\n*Clusters are compatible*\n");
Re: pg_buffercache: add sql test
I removed it on your advice. Thanks. 2022년 6월 7일 (화) 오전 2:04, Stephen Frost 님이 작성: > > Greetings, > > * Daniel Gustafsson (dan...@yesql.se) wrote: > > > On 6 Jun 2022, at 15:30, Dong Wook Lee wrote: > > > > > I just wrote a test code for the `pg_buffercache` extension which > > > doesn't not have test code. > > > > Please add this patch to the next commitfest to make sure it's not lost > > before > > then. > > > > https://commitfest.postgresql.org/38/ > > Seems to be there now, at least: > > https://commitfest.postgresql.org/38/3674/ > > However, I don't think we should have a 'target version' set for this > (and in particular it shouldn't be 15). I'd suggest removing that. > > Thanks, > > Stephen
Re: Reducing Memory Consumption (aset and generation)
Em seg., 6 de jun. de 2022 às 21:14, Ranier Vilela escreveu: > Em seg., 6 de jun. de 2022 às 20:37, David Rowley > escreveu: > >> On Mon, 6 Jun 2022 at 07:28, Ranier Vilela wrote: >> > 4) 004-generation-reduces-memory-consumption-BUG.patch >> > Same to the (2), but with BUG. >> > It only takes a few tweaks to completely remove the field block. >> >> > This fails with make check. >> > I couldn't figure out why it doesn't work with 16 bits (struct >> GenerationChunk). >> >> Hi David, thanks for taking a look at this. > > >> I think you're misunderstanding how blocks and chunks work here. A >> block can have many chunks. You can't find the block that a chunk is >> on by subtracting Generation_BLOCKHDRSZ from the pointer given to >> GenerationFree(). That would only work if the chunk happened to be the >> first chunk on a block. If it's anything apart from that then you'll >> be making adjustments to the memory of some prior chunk on the block. >> I imagine this is the reason you can't get the tests to pass. >> > Ok, I am still learning about this. > Can you explain why subtracting Generation_BLOCKHDRSZ from the pointer, > works for sizeof(struct GenerationChunk) = 24 bits, > When all references for the block field have been removed. > This pass check-world. > > >> >> Can you also explain why you think moving code around randomly or >> adding unlikely() macros helps reduce the memory consumption overheads >> of generation contexts? > > Of course, those changes do not reduce memory consumption. > But, IMO, I think those changes improve the access to memory regions, > because of the locality of the data. > > About "unlikely macros", this helps the branchs prediction, when most of > the time, > malloc and related functions, will not fail. > > >> I imagine you think that's helping to further >> improve performance, but you've not offered any evidence of that >> separately from the other changes you've made. If you think those are >> useful changes then I recommend you run individual benchmarks and >> offer those as proof that those changes are worthwhile. >> > Ok, I can understand, are changes unrelated. > Let's restart this, to simplify the review and commit work. Regarding the patches now, we have: 1) v1-001-aset-reduces-memory-consumption.patch Reduces memory used by struct AllocBlockData by minus 8 bits, reducing the total size to 32 bits, which leads to "fitting" two structs in a 64bit cache. Remove tests elog(ERROR, "could not find block containing chunk %p" and elog(ERROR, "could not find block containing chunk %p", moving them to MEMORY_CONTEXT_CHECKING context. Since 8.2 versions, nobody complains about these tests. But if is not acceptable, have the option (3) v1-003-aset-reduces-memory-consumption.patch 2) v1-002-generation-reduces-memory-consumption.patch Reduces memory used by struct GenerationBlock, by minus 8 bits, reducing the total size to 32 bits, which leads to "fitting" two structs in a 64bit cache. 3) v1-003-aset-reduces-memory-consumption.patch Same to the (1), but without remove the tests: elog(ERROR, "could not find block containing chunk %p" and elog(ERROR, "could not find block containing chunk %p", But at the cost of removing a one tiny part of the tests and moving them to MEMORY_CONTEXT_CHECKING context. Since 8.2 versions, nobody complains about these tests. regards, Ranier Vilela diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index ec3e264a73..901a954508 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -150,11 +150,13 @@ typedef AllocSetContext *AllocSet; */ typedef struct AllocBlockData { - AllocSet aset; /* aset that owns this block */ AllocBlock prev; /* prev block in aset's blocks list, if any */ AllocBlock next; /* next block in aset's blocks list, if any */ char *freeptr; /* start of free space in this block */ char *endptr; /* end of space in this block */ +#ifdef MEMORY_CONTEXT_CHECKING + AllocSet aset; /* aset that owns this block */ +#endif } AllocBlockData; /* @@ -485,11 +487,13 @@ AllocSetContextCreateInternal(MemoryContext parent, /* Fill in the initial block's block header */ block = (AllocBlock) (((char *) set) + MAXALIGN(sizeof(AllocSetContext))); - block->aset = set; block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block->endptr = ((char *) set) + firstBlockSize; block->prev = NULL; block->next = NULL; +#ifdef MEMORY_CONTEXT_CHECKING + block->aset = set; +#endif /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, block->endptr - block->freeptr); @@ -743,8 +747,10 @@ AllocSetAlloc(MemoryContext context, Size size) context->mem_allocated += blksize; - block->aset = set; block->freeptr = block->endptr = ((char *) block) + blksize; +#ifdef MEMORY_CONTEXT_CHECKING + block->aset = set; +#endif chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
Re: Collation version tracking for macOS
On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby wrote: > On 6/3/22 3:58 PM, Tom Lane wrote > > Thomas Munro writes: > >> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider > >> wrote: > >>> It feels to me like we're still not really thinking clearly about this > >>> within the PG community, and that the seriousness of this issue is not > >>> fully understood. > >> FWIW A couple of us tried quite hard to make smarter warnings, and > >> that thread and others discussed a lot of those topics, like the > >> relevance to constraints and so forth. > > I think the real problem here is that the underlying software mostly > > doesn't take this issue seriously. Unfortunately, that leads one to > > the conclusion that we need to maintain our own collation code and > > data (e.g., our own fork of ICU), and that isn't happening. Unlike > > say Oracle, we do not have the manpower; nor do we want to bloat our > > code base that much. > > > > Short of maintaining our own fork, ranting about the imperfections > > of the situation is a waste of time. > The first step to a solution is admitting that the problem exists. We've been discussing this topic for years and I don't think anyone thinks the case is closed... > Ignoring broken backups, segfaults and data corruption as a "rant" > implies that we simply throw in the towel and tell users to suck it up > or switch engines. There are other ways to address this short of the > community doing all the work itself. One simple example would be to > refuse to start if the collation provider has changed since initdb > (which we'd need to allow users to override). Yeah, it's been discussed, but never proposed. The problem is that you need to start up to fix the problem. Another option is not to use affected indexes, but that doesn't help with other forms of the problem (partition constraints, etc). > A more sophisticated > option would be to provide the machinery for supporting multiple > collation libraries. Earlier I mentioned distinct "providers" but I take that back, that's too complicated. Reprising an old idea that comes up each time we talk about this, this time with some more straw-man detail: what about teaching our ICU support to understand "libicu18n.so.71:en" to mean that it should dlopen() that library and use its functions? Or some cleverer, shorter notation. Then it's the user's problem to make sure the right libraries are installed, and it'll fail if they're not. For example, on Debian bookworm right now you can install libicu63, libicu67, libicu71, though only the "current" -dev package, but which I'm sure we can cope with. You're at the mercy of the distro or add-on package repos to keep a lot of versions around, but that seems OK. Maintaining our own fork(s) of ICU would seem like massive overkill and I don't think anyone has suggested that; the question on my mind is whether we could rely on existing packages. Then you'd be exposed only to changes that happen within (say) the ICU 63 package's lifetime... I recall looking into whether that can happen but ... I don't recall the answer.
Re: Collation version tracking for macOS
> On Jun 6, 2022, at 17:10, Jim Nasby wrote: > Ignoring broken backups, segfaults and data corruption as a "rant" implies > that we simply throw in the towel and tell users to suck it up or switch > engines. Well now, let’s be clear, I was the one who called my email a “rant”. And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong that rants don’t usually help move things forward. Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material; I haven’t read the whole thread yet. If you have some others that would also be particularly good background, let me know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I was pulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years, so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in the conversation here. Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn on version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is pretty damn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things to happen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades of large & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I have a lot of reading to do, to really understand how that happened and where the dialogue is today. Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but I still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?) Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for the non-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me ). -Jeremy Sent from my TI-83
Re: Collation version tracking for macOS
Jim Nasby writes: >> I think the real problem here is that the underlying software mostly >> doesn't take this issue seriously. > The first step to a solution is admitting that the problem exists. > Ignoring broken backups, segfaults and data corruption as a "rant" > implies that we simply throw in the towel and tell users to suck it up > or switch engines. There are other ways to address this short of the > community doing all the work itself. One simple example would be to > refuse to start if the collation provider has changed since initdb > (which we'd need to allow users to override). You're conveniently skipping over the hard part, which is to tell whether the collation provider has changed behavior (which we'd better do with pretty darn high accuracy, if we're going to refuse to start on the basis of thinking it has). Unfortunately, giving a reliable indication of collation behavioral changes is *exactly* the thing that the providers aren't taking seriously. regards, tom lane
Re: Reducing Memory Consumption (aset and generation)
Em seg., 6 de jun. de 2022 às 20:37, David Rowley escreveu: > On Mon, 6 Jun 2022 at 07:28, Ranier Vilela wrote: > > 4) 004-generation-reduces-memory-consumption-BUG.patch > > Same to the (2), but with BUG. > > It only takes a few tweaks to completely remove the field block. > > > This fails with make check. > > I couldn't figure out why it doesn't work with 16 bits (struct > GenerationChunk). > > Hi David, thanks for taking a look at this. > I think you're misunderstanding how blocks and chunks work here. A > block can have many chunks. You can't find the block that a chunk is > on by subtracting Generation_BLOCKHDRSZ from the pointer given to > GenerationFree(). That would only work if the chunk happened to be the > first chunk on a block. If it's anything apart from that then you'll > be making adjustments to the memory of some prior chunk on the block. > I imagine this is the reason you can't get the tests to pass. > Ok, I am still learning about this. Can you explain why subtracting Generation_BLOCKHDRSZ from the pointer, works for sizeof(struct GenerationChunk) = 24 bits, When all references for the block field have been removed. This pass check-world. > > Can you also explain why you think moving code around randomly or > adding unlikely() macros helps reduce the memory consumption overheads > of generation contexts? Of course, those changes do not reduce memory consumption. But, IMO, I think those changes improve the access to memory regions, because of the locality of the data. About "unlikely macros", this helps the branchs prediction, when most of the time, malloc and related functions, will not fail. > I imagine you think that's helping to further > improve performance, but you've not offered any evidence of that > separately from the other changes you've made. If you think those are > useful changes then I recommend you run individual benchmarks and > offer those as proof that those changes are worthwhile. > Ok, I can understand, are changes unrelated. regards, Ranier Vilela
Re: pg_auth_members.grantor is bunk
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Jun 2, 2022 at 3:51 PM Tom Lane wrote: > > > I sort of thought http://postgr.es/m/3981966.1646429...@sss.pgh.pa.us > > > constituted a completed investigation of this sort. No? > > > > I didn't think so. It's clear that the spec expects us to track the > > grantor, but I didn't chase down what it expects us to *do* with that > > information, nor what it thinks the rules are for merging multiple > > authorizations. > > Hmm, OK. Well, one problem is that I've never had any luck > interpreting what the spec says about anything, and I've sort of given > up. But even if that were not so, I'm a little unclear what other > conclusion is possible here. The spec either wants the same behavior > that we already have for other object types, which is what I am here > proposing that we do, or it wants something different. If it wants > something different, it probably wants that for all object types, not > just roles. Since I doubt we would want the behavior for roles to be > inconsistent with what we do for all other object types, in that case > we would probably either change the behavior for all other object > types to something new, and then clean up the role stuff afterwards, > or else first do what I proposed here and then later change it all at > once. In which case the proposal that I've made is as good a way to > start as any. > > Now, if it happens to be the case that the spec proposes a different > behavior for roles than for non-role objects, and if the behavior for > roles is something other than the only we currently have for non-role > objects, then I'd agree that the plan I propose here needs revision. I > suspect that's unlikely but I can't make anything of the spec so > maybe? Thankfully, at least from my reading, the spec isn't all that complicated on this particular point. The spec talks about "role authorization descriptor"s and those are "created with role name, grantee, and grantor" and then further says "redundant duplicate role authorization descriptors are destroyed", presumably meaning that the entire thing has to be identical. In other words, yeah, the PK should include the grantor. There's a further comment that the 'set of involved grantees' is the union of all the 'grantees', clearly indicating that you can have multiple GRANT 'foo' to 'bar's with distinct grantees. In terms of how that's then used, yeah, it's during REVOKE because a REVOKE is only able to 'find' role authorization descriptors which match the triple of role revoked, grantee, grantor (though there's a caveat in that the 'grantor' role could be the current role, or the current user). Interestingly, at least in my looking it over today, it doesn't seem that the 'grantor' could be 'any applicable role' (which is what's usually used to indicate that it could be any role that the current role inherits), meaning you have to include the GRANTED BY in the REVOKE statement or do a SET ROLE first when doing a REVOKE if it's for a role that you aren't currently running as (but which you are a member of). Anyhow, in other words, I do think Robert's got it right here. Happy to discuss further though if there are doubts. Thanks, Stephen signature.asc Description: PGP signature
Re: Reducing Memory Consumption (aset and generation)
On Mon, 6 Jun 2022 at 07:28, Ranier Vilela wrote: > 4) 004-generation-reduces-memory-consumption-BUG.patch > Same to the (2), but with BUG. > It only takes a few tweaks to completely remove the field block. > This fails with make check. > I couldn't figure out why it doesn't work with 16 bits (struct > GenerationChunk). I think you're misunderstanding how blocks and chunks work here. A block can have many chunks. You can't find the block that a chunk is on by subtracting Generation_BLOCKHDRSZ from the pointer given to GenerationFree(). That would only work if the chunk happened to be the first chunk on a block. If it's anything apart from that then you'll be making adjustments to the memory of some prior chunk on the block. I imagine this is the reason you can't get the tests to pass. Can you also explain why you think moving code around randomly or adding unlikely() macros helps reduce the memory consumption overheads of generation contexts? I imagine you think that's helping to further improve performance, but you've not offered any evidence of that separately from the other changes you've made. If you think those are useful changes then I recommend you run individual benchmarks and offer those as proof that those changes are worthwhile. David
Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
On Mon, Jun 06, 2022 at 01:53:35PM -0500, Justin Pryzby wrote: > It seems important to use a format in most-significant-parts-first which sorts > nicely by filename, but otherwise anything could be okay. Agreed. > Apparently, that's ISO 8601, which can optionally use separators > (-MM-DDTHH:MM:SS). OK, let's use a T, with the basic format and a minimal number of separators then, we get 20220603T082255. > I was thinking this would not include fractional seconds. Maybe that would > mean that the TAP tests would need to sleep(1) at some points... If we don't split by the millisecond, we would come back to the problems of the original report. On my laptop, the --check phase that passes takes more than 1s, but the one that fails takes 0.1s, so a follow-up run would complain with the path conflicts. So at the end I would reduce the format to be MMDDTHHMMSS_ms (we could also use a logic that checks for conflicts and appends an extra number if needed, though the addition of the extra ms is a bit shorter). -- Michael signature.asc Description: PGP signature
Re: replacing role-level NOINHERIT with a grant-level option
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Feb 7, 2022 at 11:13 AM Joe Conway wrote: > > > It seems to me that the INHERIT role flag isn't very well-considered. > > > Inheritance, or the lack of it, ought to be decided separately for > > > each inherited role. However, that would be a major architectural > > > change. > > > > Agreed -- that would be useful. > > Is this a kind of change people would support? Here's a quick sketch: On the whole, moving things to pg_auth_members when they're about relationships between roles makes more sense to me too (ISTR mentioning that somewhere or at least thinking about it but not sure where and it doesn't really matter). > It's been proposed elsewhere by Stephen and others that we ought to > have the ability to grant ADMIN OPTION on a role without granting > membership in that role. There's some overlap between these two > proposals. If your concern is just about accident prevention, you will > be happy to use this instead of that. If you want real access > restrictions, you will want that, not this. I think that's OK. Doing > this first doesn't mean we can't do that later. What about the > reverse? Could we add ADMIN-without-membership first, and then decide > whether to do this later? Maybe so, but I have come to feel that > NOINHERIT is such an ugly wart that we'll be happier the sooner we > kill it. It seems to make every discussion that we have about any > other potential change in this area more difficult, and I venture that > ADMIN-without-membership wouldn't turn out to be an exception. Being able to segregate the control over privileges from the control over objects is definitely helpful in some environments. I don't see any strong reason that one must happen before the other and am happy to defer to whomever has cycles to work on these things to sort that out. > Based on previous discussions I think that, long term, we're probably > headed toward a future where role grants have a bunch of different > flags, each of which controls a single behavior: whether you can > implicitly use the privileges of the role, whether you can access them > via SET ROLE, whether you can grant the role to others, or revoke it > from them, etc. I don't know exactly what the final list will look > like, and hopefully it won't be so long that it makes us all wish we > were dead, but there doesn't seem to be any possibility that implicit > inheritance of permissions won't be in that list. I spent a few > minutes considering whether I ought to instead propose that we just > nuke NOINHERIT from orbit and replace it with absolutely nothing, and > concluded that such a proposal had no hope of attracting consensus. I agree that a future where there's more granularity in terms of role grants would be a better place than where we are now. I'd be -1 on just removing 'NOINHERIT', to no one's surprise, I'm sure. > Perhaps the idea of replacing it with that is more powerful and at > least IMHO cleaner will. -EPARSEFAIL (tho I'm guessing you were intending to say that replacing the role-attribute noinherit with something more powerful would be a generally agreeable way forward, which I would generally support) * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Jun 2, 2022 at 12:26 PM Robert Haas wrote: > > 1. Extend the GRANT role_name TO role_name [ WITH ADMIN OPTION ] with > > a new, optional clause, something like WITH NO INHERIT or WITH > > NOINHERIT or WITHOUT INHERIT. > > I just realized that, with ADMIN OPTION, you can change your mind > after the initial grant, and we probably would want to allow the same > kind of thing here. The existing syntax is: Yes. > 1. GRANT foo TO bar [WITH ADMIN OPTION]; > 2. REVOKE foo FROM bar; > 3. REVOKE ADMIN OPTION FOR foo FROM bar; > > To grant the admin option later, you just use (1) again and this time > include WITH ADMIN OPTION. To revoke it without removing the grant > entirely, you use (3). This could easily be generalized to any number > of options all of which are Boolean properties and all of which have a > default value of false, but INHERIT is a Boolean property with a > default value of true, and calling the property NOINHERIT to dodge > that issue seems dumb. I'd like to find a way to extend the syntax > that can accommodate not only INHERIT as an option, but any other > options we might want to add in the future, and maybe even leave room > for non-Boolean properties, just in case. The question of which > options we think it valuable to add is separate from what the syntax > ought to be, so for syntax discussion purposes let's suppose we want > to add three new options: FRUIT, which can be strawberry, banana, or > kiwi; CHOCOLATE, a Boolean whose default value is true; and SARDINES, > another Boolean whose default value is false. Then: > > GRANT foo TO bar WITH FRUIT STRAWBERRY; > GRANT foo TO bar WITH CHOCOLATE FALSE; > GRANT foo TO bar WITH SARDINES TRUE; > GRANT foo TO bar WITH SARDINES; --
Re: pgcon unconference / impact of block size on performance
On Sun, 5 Jun 2022 at 11:23, Tomas Vondra wrote: > At on of the pgcon unconference sessions a couple days ago, I presented > a bunch of benchmark results comparing performance with different > data/WAL block size. Most of the OLTP results showed significant gains > (up to 50%) with smaller (4k) data pages. A few years ago when you and I were doing analysis into the TPC-H benchmark, we found that larger page sizes helped various queries, especially Q1. It would be good to see what the block size changes in performance in a query such as: SELECT sum(value) FROM table_with_tuples_of_several_hundred_bytes;. I don't recall the reason why 32k pages helped there, but it seems reasonable that doing more work for each lookup in shared buffers might be 1 reason. Maybe some deeper analysis into various workloads might convince us that it might be worth having an initdb option to specify the blocksize. There'd be various hurdles to get over in the code to make that work. I doubt we could ever make the default smaller than it is today as it would nobody would be able to insert rows larger than 4 kilobytes into a table anymore. Plus pg_upgrade issues. David [1] https://www.postgresql.org/docs/current/limits.html
Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
On Mon, 6 Jun 2022 at 21:34, Jean Landercy - BEEODIVERSITY wrote: > SELECT COUNT(*) FROM items; > -- ERROR: variable not found in subplan target list > -- SQL state: XX000 Can you share some more details about what "items" is. psql's "\d items" output would be useful. From what you've reported we can't tell if this is a table or a view. > The bug is tricky to reproduce, I could not succeed to replicate elsewhere > (dump/restore does not preserve it). Can you share the output of: select attname,atttypid::regtype,attnum,atthasdef,atthasmissing,attgenerated,attisdropped from pg_attribute where attrelid = 'items'::regclass order by attnum; This will let us see if there's something strange going on with dropped or has missing columns. There may be some sequence of ALTER TABLE ADD COLUMN ... DEFAULT / DROP COLUMN that is causing this. The output of that might help us see if that could be a factor. David
Re: oat_post_create expected behavior
Robert Haas writes: > On Mon, Jun 6, 2022 at 1:35 PM Jeff Davis wrote: >> Out of curiosity, why not? The proposed patch only runs it if the >> object access hook is set. Do you see a situation where it would be >> confusing that an earlier DDL change is visible? And if so, would it >> make more sense to call CCI unconditionally? > Well, I think that a fair amount of work has been done previously to > cut down on unnecessary CCIs. I suspect Tom in particular is likely to > object to adding a whole bunch more of them, and I think that > objection would have some merit. We've gotten things to the point where a no-op CCI is pretty cheap, so I'm not sure there is a performance concern here. I do wonder though if there are semantic or bug-hazard considerations. A CCI that occurs only if a particular hook is loaded seems pretty scary from a testability standpoint. > I definitely think if we were going to do it, it would need to be > unconditional. Otherwise I think we'll end up with bugs, because > nobody's going to go test all of that code with and without an object > access hook installed every time they tweak some DDL-related code. Right, same thing I'm saying. I also think we should discourage people from doing cowboy CCIs inside their OAT hooks, because that makes the testability problem even worse. Maybe that means we need to uniformly move the CREATE hooks to after a system-provided CCI, but I've not thought hard about the implications of that. regards, tom lane
Re: How about a psql backslash command to show GUCs?
Robert Haas writes: > I think part of the problem here, though, is that one can imagine a > variety of charters that might be useful. A user could want to see the > parameters that have values in their session that differ from the > system defaults, or parameters that have values which differ from the > compiled-in defaults, or parameters that can be changed without a > restart, or all the pre-computed parameters, or all the parameters > that contain "vacuum" in the name, or all the query-planner-related > parameters, or all the parameters on which any privileges have been > granted. And it's just a judgement call which of those things we ought > to try to accommodate in the psql syntax and which ones ought to be > done by writing an ad-hoc query against pg_settings. Sure. Nonetheless, having decided to introduce this command, we have to make that judgement call. psql-ref.sgml currently explains that If pattern is specified, only parameters whose names match the pattern are listed. Without a pattern, only parameters that are set to non-default values are listed. (Use \dconfig * to see all parameters.) so we have the "all of them" and "ones whose names match a pattern" cases covered. And the definition of the default behavior as "only ones that are set to non-default values" seems reasonable enough, but the question is what counts as a "non-default value", or for that matter what counts as "setting". I think a reasonable case can be made for excluding "internal" GUCs on the grounds that (a) they cannot be set, and therefore (b) whatever value they have might as well be considered the default. regards, tom lane
Re: pgcon unconference / impact of block size on performance
Hello Tomas, At on of the pgcon unconference sessions a couple days ago, I presented a bunch of benchmark results comparing performance with different data/WAL block size. Most of the OLTP results showed significant gains (up to 50%) with smaller (4k) data pages. You wrote something about SSD a long time ago, but the link is now dead: http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ See also: http://www.cybertec.at/postgresql-block-sizes-getting-started/ http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD.html [...] The other important factor is the native SSD page, which is similar to sectors on HDD. SSDs however don't allow in-place updates, and have to reset/rewrite of the whole native page. It's actually more complicated, because the reset happens at a much larger scale (~8MB block), so it does matter how quickly we "dirty" the data. The consequence is that using data pages smaller than the native page (depends on the device, but seems 4K is the common value) either does not help or actually hurts the write performance. All the SSD results show this behavior - the Optane and Samsung nicely show that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it worse. Yep. ISTM that uou should also consider the underlying FS block size. Ext4 uses 4 KiB by default, so if you write 2 KiB it will write 4 KiB anyway. There is no much doubt that with SSD we should reduce the default page size. There are some negative impacts (eg more space is lost because of headers and the number of tuples that can be fitted), but I guess the should be an overall benefit. It would help a lot if it would be possible to initdb with a different block size, without recompiling. -- Fabien.
Re: oat_post_create expected behavior
On Mon, Jun 6, 2022 at 3:46 PM Jeff Davis wrote: > On Mon, 2022-06-06 at 13:43 -0400, Robert Haas wrote: > > Yeah, that comment could be made more clear. > > I still don't understand what the rule is. > > Is the rule that OAT_POST_CREATE must always use SnapshotSelf for any > catalog access? And if so, do we need to update code in contrib > extensions to follow that rule? I don't think there is a rule in the sense that you want there to be one. We sometimes call the object access hook before the CCI, and sometimes after, and the sepgsql code knows which cases are handled which ways and proceeds differently on that basis. If we went and changed the sepgsql code that uses system catalog lookups to use SnapshotSelf instead, I think it would still work, but it would be less efficient, so that doesn't seem like a desirable change to me. If it's possible to make the hook placement always happen after a CCI, then we could change the sepgsql code to always use catalog lookups, which would probably be more efficient but likely require adding some CCI calls, which may attract objections from Tom --- or maybe it won't. Absent either of those things, I'm inclined to just make the comment clearly state that we're not consistent about it. That's not great, but it may be the best we can do. -- Robert Haas EDB: http://www.enterprisedb.com
Re: oat_post_create expected behavior
On Mon, 2022-06-06 at 13:43 -0400, Robert Haas wrote: > Yeah, that comment could be made more clear. I still don't understand what the rule is. Is the rule that OAT_POST_CREATE must always use SnapshotSelf for any catalog access? And if so, do we need to update code in contrib extensions to follow that rule? Regards, Jeff Davis
Re: replacing role-level NOINHERIT with a grant-level option
On Thu, Jun 2, 2022 at 12:26 PM Robert Haas wrote: > 1. Extend the GRANT role_name TO role_name [ WITH ADMIN OPTION ] with > a new, optional clause, something like WITH NO INHERIT or WITH > NOINHERIT or WITHOUT INHERIT. I just realized that, with ADMIN OPTION, you can change your mind after the initial grant, and we probably would want to allow the same kind of thing here. The existing syntax is: 1. GRANT foo TO bar [WITH ADMIN OPTION]; 2. REVOKE foo FROM bar; 3. REVOKE ADMIN OPTION FOR foo FROM bar; To grant the admin option later, you just use (1) again and this time include WITH ADMIN OPTION. To revoke it without removing the grant entirely, you use (3). This could easily be generalized to any number of options all of which are Boolean properties and all of which have a default value of false, but INHERIT is a Boolean property with a default value of true, and calling the property NOINHERIT to dodge that issue seems dumb. I'd like to find a way to extend the syntax that can accommodate not only INHERIT as an option, but any other options we might want to add in the future, and maybe even leave room for non-Boolean properties, just in case. The question of which options we think it valuable to add is separate from what the syntax ought to be, so for syntax discussion purposes let's suppose we want to add three new options: FRUIT, which can be strawberry, banana, or kiwi; CHOCOLATE, a Boolean whose default value is true; and SARDINES, another Boolean whose default value is false. Then: GRANT foo TO bar WITH FRUIT STRAWBERRY; GRANT foo TO bar WITH CHOCOLATE FALSE; GRANT foo TO bar WITH SARDINES TRUE; GRANT foo TO bar WITH SARDINES; -- same as previous GRANT foo TO bar WITH SARDINES OPTION; -- also same as previous GRANT foo TO bar WITH FRUIT KIWI, SARDINES; -- multiple options GRANT foo TO bar WITH CHOCOLATE OPTION, SARDINES OPTION; -- dubious combination In other words, you write a comma-separated list of option names. Each option name can be followed by an option value or by the word OPTION. If it is followed by the word OPTION or by nothing, the option value is taken to be true. This would mean that, going forward, any of the following would work: (a) GRANT foo TO bar WITH ADMIN OPTION, (b) GRANT foo TO BAR WITH ADMIN, (c) GRANT foo TO BAR WITH ADMIN TRUE. To revoke a grant entirely, you just say REVOKE foo FROM bar, as now. To change an option for an existing grant, you can re-execute the grant statement with a different WITH clause. Any options that are explicitly mentioned will be changed to have the associated values; unmentioned options will retain their existing values. If you want to change the value of a Boolean option to false, you have a second option, which is to write "REVOKE option_name OPTION FOR foo FROM bar," which means exactly the same thing as "GRANT foo TO bar WITH option_name FALSE". In terms of what options to offer, the most obvious idea is to just add INHERIT as a Boolean option which is true by default. We could go further and also add a SET option, with the idea that INHERIT OPTION controls whether you can exercise the privileges of the role without SET ROLE, and SET OPTION controls whether you can switch to that role using the SET ROLE command. Those two things together would give us a way to get to the admin-without-membership concept that we have previously discussed: GRANT foo TO BAR WITH ADMIN TRUE, INHERIT FALSE, SET FALSE sounds like it should do the trick. I briefly considered suggesting that the way to set a Boolean-valued option to false ought to be to write "NO option_name" rather than "option_name FALSE", since it reads more naturally, but I proposed this instead because it's more like what we do for other options lists (cf. EXPLAIN, VACUUM, COPY). Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
On Mon, Jun 06, 2022 at 04:50:55PM +, Jean Landercy - BEEODIVERSITY wrote: > Dear Justin, > > Thank you for your quick reply. > Unfortunately, the server having this issue is an Azure Flexible Server. > Upgrades are managed by Azure, I will have to wait until they release the > version 13.7. I don't know what to suggest other than to open a support ticket with the vendor. -- Justin
Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
On Mon, Jun 06, 2022 at 07:43:53PM +0200, Daniel Gustafsson wrote: > > On 6 Jun 2022, at 06:17, Michael Paquier wrote: > > On Mon, Jun 06, 2022 at 02:38:03AM +0200, Daniel Gustafsson wrote: > >> On 5 Jun 2022, at 11:19, Michael Paquier wrote: > > >>> I have been toying with the idea of a sub-directory named with a > >>> timestamp (Unix time, like log_line_prefix's %n but this could be > >>> any format) under pg_upgrade_output.d/ and finished with the > >>> attached. > >> > >> I was thinking more along the lines of %m to make it (more) human > >> readable, but > >> I'm certainly not wedded to any format. It seems important to use a format in most-significant-parts-first which sorts nicely by filename, but otherwise anything could be okay. > > Neither am I. I would not map exactly to %m as it uses whitespaces, > > but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would > > be fine? If there are other ideas for the format, just let me know. > > I think this makes more sense from an end-user perspective. Is it better to use "T" instead of "_" ? Apparently, that's ISO 8601, which can optionally use separators (-MM-DDTHH:MM:SS). https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations I was thinking this would not include fractional seconds. Maybe that would mean that the TAP tests would need to sleep(1) at some points... -- Justin
Re: pgsql: Use pre-fetching for ANALYZE
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2022-06-02 19:30:16 -0700, Andres Freund wrote: > > On 2021-03-16 18:48:08 +, Stephen Frost wrote: > > > Use pre-fetching for ANALYZE > > > > > > When we have posix_fadvise() available, we can improve the performance > > > of an ANALYZE by quite a bit by using it to inform the kernel of the > > > blocks that we're going to be asking for. Similar to bitmap index > > > scans, the number of buffers pre-fetched is based off of the > > > maintenance_io_concurrency setting (for the particular tablespace or, > > > if not set, globally, via get_tablespace_maintenance_io_concurrency()). > > > > I just looked at this as part of debugging a crash / hang in the AIO patch. > > > > The code does: > > > > block_accepted = table_scan_analyze_next_block(scan, targblock, > > vac_strategy); > > > > #ifdef USE_PREFETCH > > > > /* > > * When pre-fetching, after we get a block, tell the kernel > > about the > > * next one we will want, if there's any left. > > * > > * We want to do this even if the > > table_scan_analyze_next_block() call > > * above decides against analyzing the block it picked. > > */ > > if (prefetch_maximum && prefetch_targblock != > > InvalidBlockNumber) > > PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, > > prefetch_targblock); > > #endif > > > > I.e. we lock a buffer and *then* we prefetch another buffer. That seems > > like a > > quite bad idea to me. Why are we doing IO while holding a content lock, if > > we > > can avoid it? At the end, we're doing a posix_fadvise() which is a kernel call but hopefully wouldn't do actual IO when we call it. Still, agreed that it'd be better to do that without holding locks and no objection to making such a change. > It also seems decidedly not great from a layering POV to do the IO in > analyze.c. There's no guarantee that the tableam maps blocks in a way that's > compatible with PrefetchBuffer(). Yes, the bitmap heap scan code does > something similar, but a) that is opt in by the AM, b) there's a comment > saying it's quite crufty and should be fixed. Certainly open to suggestions. Are you thinking it'd make sense to add a 'prefetch_block' method to TableAmRoutine? Or did you have another thought? Thanks! Stephen signature.asc Description: PGP signature
Re: [PATCH] Expose port->authn_id to extensions and triggers
On Fri, Jun 3, 2022 at 10:04 AM Tom Lane wrote: > I agree with Robert's complaint that Parallel is far too generic > a term here. Also, the fact that this data is currently in struct > Port seems like an artifact. Why do we call this thing a Port, anyway? I think I'd feel more comfortable here if we were defining what went into which struct on some semantic basis rather than being like, OK, so all the stuff we want to serialize goes into struct #1, and the stuff we don't want to serialize goes into struct #2. I suppose if it's just based on whether or not we want to serialize it, then the placement of future additions will just be based on how people happen to feel about the thing they're adding right at that moment, and there won't be any consistency. One could imagine dividing the Port struct into a couple of different structs, e.g. AuthenticationState: stuff that is needed only during authentication and can be discarded thereafter (e.g. the HBA line, at least if the comment is to be believed) ClientCommunicationState: stuff that is used to communicate with the client but doesn't need to be or can't be shared (e.g. the SSL object itself) ClientConnectionInfo: stuff that someone might want to look at for information purposes at any time (e.g. authn_id, apparently) Then we could serialize the third of these, keep the second around but not serialize it, and free the first once connection setup is complete. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch
> On 6 Jun 2022, at 06:17, Michael Paquier wrote: > On Mon, Jun 06, 2022 at 02:38:03AM +0200, Daniel Gustafsson wrote: >> On 5 Jun 2022, at 11:19, Michael Paquier wrote: >>> I have been toying with the idea of a sub-directory named with a >>> timestamp (Unix time, like log_line_prefix's %n but this could be >>> any format) under pg_upgrade_output.d/ and finished with the >>> attached. >> >> I was thinking more along the lines of %m to make it (more) human readable, >> but >> I'm certainly not wedded to any format. > > Neither am I. I would not map exactly to %m as it uses whitespaces, > but something like %Y%m%d_%H%M%S.%03d (3-digit ms for last part) would > be fine? If there are other ideas for the format, just let me know. I think this makes more sense from an end-user perspective. >> As a user I would expect the logs from this current invocation to be removed >> without --retain, and any other older log entries be kept. I think we should >> remove log_opts.logdir and only remove log_opts.rootdir if it is left empty >> after .logdir is removed. > > Okay, however I think you mean log_opts.basedir rather than logdir? > That's simple enough to switch around as pg_check_dir() does this > job. Correct, I mistyped. The cleanup in this version of the patch looks sane to me. -- Daniel Gustafsson https://vmware.com/
Re: oat_post_create expected behavior
On Mon, Jun 6, 2022 at 1:35 PM Jeff Davis wrote: > Out of curiosity, why not? The proposed patch only runs it if the > object access hook is set. Do you see a situation where it would be > confusing that an earlier DDL change is visible? And if so, would it > make more sense to call CCI unconditionally? Well, I think that a fair amount of work has been done previously to cut down on unnecessary CCIs. I suspect Tom in particular is likely to object to adding a whole bunch more of them, and I think that objection would have some merit. I definitely think if we were going to do it, it would need to be unconditional. Otherwise I think we'll end up with bugs, because nobody's going to go test all of that code with and without an object access hook installed every time they tweak some DDL-related code. > Also, would it ever be reasonable for such a hook to call CCI itself? > As you say, it could use SnapshotSelf, but sometimes you might want to > call routines that assume they can use an MVCC snapshot. This question > applies to the OAT_POST_ALTER hook as well as OAT_POST_CREATE. I definitely wouldn't want to warrant that it doesn't break anything. I think the extension can do it at its own risk, but I wouldn't recommend it. OAT_POST_ALTER is unlike OAT_POST_CREATE in that OAT_POST_ALTER documents that it should be called after a CCI, whereas OAT_POST_CREATE does not make a representation either way. > > Possibly there is some work that could be done to ensure > > consistent placement of the calls to post-create hooks so that either > > all of them happen before, or all of them happen after, a CCI has > > occurred, but I'm not sure whether or not that is feasible. > > I like the idea of having a test in place so that we at least know when > something changes. Otherwise it would be pretty easy to break an > extension by adjusting some code. Sure. I find writing meaningful tests for this kind of stuff hard, but there are plenty of people around here who are better at figuring out how to test obscure scenarios than I. > > Currently, > > I don't think we promise anything about whether a post-create hook > > call will occur before or after a CCI, which is why > > sepgsql_schema_post_create(), sepgsql_schema_post_create(), and > > sepgsql_attribute_post_create() perform a catalog scan using > > SnapshotSelf, while sepgsql_database_post_create() uses > > get_database_oid(). You might want to adopt a similar technique. > > It would be good to document this a little better though: > > * OAT_POST_CREATE should be invoked just after the object is created. > * Typically, this is done after inserting the primary catalog records > and > * associated dependencies. > > doesn't really give any guidance, while the comment for alter does: > > * OAT_POST_ALTER should be invoked just after the object is altered, > * but before the command counter is incremented. An extension using > the > * hook can use a current MVCC snapshot to get the old version of the > tuple, > * and can use SnapshotSelf to get the new version of the tuple. Yeah, that comment could be made more clear. -- Robert Haas EDB: http://www.enterprisedb.com
Re: oat_post_create expected behavior
On Mon, 2022-06-06 at 10:51 -0400, Robert Haas wrote: > I don't think a proposal to add CommandCounterIncrement() calls just > for the convenience of object access hooks has much chance of being > accepted. Out of curiosity, why not? The proposed patch only runs it if the object access hook is set. Do you see a situation where it would be confusing that an earlier DDL change is visible? And if so, would it make more sense to call CCI unconditionally? Also, would it ever be reasonable for such a hook to call CCI itself? As you say, it could use SnapshotSelf, but sometimes you might want to call routines that assume they can use an MVCC snapshot. This question applies to the OAT_POST_ALTER hook as well as OAT_POST_CREATE. > Possibly there is some work that could be done to ensure > consistent placement of the calls to post-create hooks so that either > all of them happen before, or all of them happen after, a CCI has > occurred, but I'm not sure whether or not that is feasible. I like the idea of having a test in place so that we at least know when something changes. Otherwise it would be pretty easy to break an extension by adjusting some code. > Currently, > I don't think we promise anything about whether a post-create hook > call will occur before or after a CCI, which is why > sepgsql_schema_post_create(), sepgsql_schema_post_create(), and > sepgsql_attribute_post_create() perform a catalog scan using > SnapshotSelf, while sepgsql_database_post_create() uses > get_database_oid(). You might want to adopt a similar technique. It would be good to document this a little better though: * OAT_POST_CREATE should be invoked just after the object is created. * Typically, this is done after inserting the primary catalog records and * associated dependencies. doesn't really give any guidance, while the comment for alter does: * OAT_POST_ALTER should be invoked just after the object is altered, * but before the command counter is incremented. An extension using the * hook can use a current MVCC snapshot to get the old version of the tuple, * and can use SnapshotSelf to get the new version of the tuple. Regards, Jeff Davis PS: I added this to the July CF: https://commitfest.postgresql.org/38/3676/
Re: Logging query parmeters in auto_explain
Dagfinn Ilmari Mannsåker writes: > Hi hackers, > > Inspired by a question on IRC, I noticed that while the core statement > logging system gained the option to log statement parameters in PG 13, > auto_explain was left out. > > Here's a patch that adds a corresponding > auto_explain.log_parameter_max_length config setting, which controls the > "Query Parameters" node in the logged plan. Just like in core, the > default is -1, which logs the parameters in full, and 0 disables > parameter logging, while any other value truncates each parameter to > that many bytes. I've added added it to the upcoming commitfest: https://commitfest.postgresql.org/38/3660/ - ilmari
Re: How about a psql backslash command to show GUCs?
> On Jun 6, 2022, at 9:02 AM, Tom Lane wrote: > > Thoughts? I think it depends on your mental model of what \dconfig is showing you. Is it showing you the list of configs that you can SET, or just the list of all configs?
Re: pg_buffercache: add sql test
Greetings, * Daniel Gustafsson (dan...@yesql.se) wrote: > > On 6 Jun 2022, at 15:30, Dong Wook Lee wrote: > > > I just wrote a test code for the `pg_buffercache` extension which > > doesn't not have test code. > > Please add this patch to the next commitfest to make sure it's not lost before > then. > > https://commitfest.postgresql.org/38/ Seems to be there now, at least: https://commitfest.postgresql.org/38/3674/ However, I don't think we should have a 'target version' set for this (and in particular it shouldn't be 15). I'd suggest removing that. Thanks, Stephen
Re: How about a psql backslash command to show GUCs?
On Mon, Jun 6, 2022 at 12:02 PM Tom Lane wrote: > Thoughts? This all seems pretty subjective. As someone who sometimes supports customers, I usually kind of want the customer to give me all the settings, just in case something that didn't seem important to them (or to whoever coded up the \dconfig command) turns out to be relevant. It's easier to ask once and then look for the information you need than to go back and forth asking for more data, and I don't want to have to try to infer things based on what version they are running or how a certain set of binaries was built. But if I already know that the configuration on a given system is basically sane, I probably only care about the parameters with non-default values, and a computed parameter can't be set, so I guess it has a default value by definition. So if the charter for this command is to show only non-default GUCs, then it seems reasonable to leave these out. I think part of the problem here, though, is that one can imagine a variety of charters that might be useful. A user could want to see the parameters that have values in their session that differ from the system defaults, or parameters that have values which differ from the compiled-in defaults, or parameters that can be changed without a restart, or all the pre-computed parameters, or all the parameters that contain "vacuum" in the name, or all the query-planner-related parameters, or all the parameters on which any privileges have been granted. And it's just a judgement call which of those things we ought to try to accommodate in the psql syntax and which ones ought to be done by writing an ad-hoc query against pg_settings. -- Robert Haas EDB: http://www.enterprisedb.com
RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Dear Justin, Thank you for your quick reply. Unfortunately, the server having this issue is an Azure Flexible Server. Upgrades are managed by Azure, I will have to wait until they release the version 13.7. Is there a procedure to replicate the database and preserve the bug. My attempts with pg_dump/psql failed (the bug vanishes). If so then I can clone the faulty database and upgrade it on a newer version. Best regards, Landercy Jean
Re: pg_buffercache: add sql test
> On 6 Jun 2022, at 15:30, Dong Wook Lee wrote: > I just wrote a test code for the `pg_buffercache` extension which > doesn't not have test code. Please add this patch to the next commitfest to make sure it's not lost before then. https://commitfest.postgresql.org/38/ -- Daniel Gustafsson https://vmware.com/
Re: How about a psql backslash command to show GUCs?
Justin Pryzby writes: > I noticed this is showing "pre-computed" gucs, like: > shared_memory_size| 149MB > shared_memory_size_in_huge_pages | 75 > I'm not opposed to that, but I wonder if that's what's intended / best. I had suggested upthread that we might want to hide items with source = 'override', but that idea didn't seem to be getting traction. A different idea is to hide items with context = 'internal'. Looking at the items selected by the current rule in a default installation: postgres=# SELECT s.name, source, context FROM pg_catalog.pg_settings s WHERE s.source <> 'default' AND s.setting IS DISTINCT FROM s.boot_val ORDER BY 1; name |source| context --+--+ TimeZone | configuration file | user application_name | client | user client_encoding | client | user config_file | override | postmaster data_directory | override | postmaster default_text_search_config | configuration file | user hba_file | override | postmaster ident_file | override | postmaster lc_messages | configuration file | superuser log_timezone | configuration file | sighup max_stack_depth | environment variable | superuser shared_memory_size | override | internal shared_memory_size_in_huge_pages | override | internal wal_buffers | override | postmaster (14 rows) So hiding internal-context items would hit exactly the two you mention, but hiding override-source items would hit several more. (I'm kind of wondering why wal_buffers is showing as "override"; that seems like a quirk.) Thoughts? regards, tom lane
Re: pgcon unconference / impact of block size on performance
On 6/6/22 17:00, Tomas Vondra wrote: > > On 6/6/22 16:27, Jakub Wartak wrote: >> Hi Tomas, >> >>> Hi, >>> >>> At on of the pgcon unconference sessions a couple days ago, I presented a >>> bunch of benchmark results comparing performance with different data/WAL >>> block size. Most of the OLTP results showed significant gains (up to 50%) >>> with >>> smaller (4k) data pages. >> >> Nice. I just saw this > https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do > you have any plans for publishing those other graphs too (e.g. WAL block > size impact)? >> > > Well, there's plenty of charts in the github repositories, including the > charts I think you're asking for: > > https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png > > https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png > > > I admit the charts may not be documented very clearly :-( > >>> This opened a long discussion about possible explanations - I claimed one >>> of the >>> main factors is the adoption of flash storage, due to pretty fundamental >>> differences between HDD and SSD systems. But the discussion concluded with >>> an >>> agreement to continue investigating this, so here's an attempt to support >>> the >>> claim with some measurements/data. >>> >>> Let me present results of low-level fio benchmarks on a couple different HDD >>> and SSD drives. This should eliminate any postgres-related influence (e.g. >>> FPW), >>> and demonstrates inherent HDD/SSD differences. >>> All the SSD results show this behavior - the Optane and Samsung nicely show >>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make >>> it >>> worse. >>> >> [..] >> Can you share what Linux kernel version, what filesystem , it's >> mount options and LVM setup were you using if any(?) >> > > The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or > xfs filesystems. > I realized I mentioned just two of the devices, used for the postgres test, but this thread is dealing mostly with about fio results. So let me list info about all the devices/filesystems: i5 -- Intel SSD 320 120GB SATA (SSDSA2CW12) /dev/sdh1 on /mnt/data type ext4 (rw,noatime) 6x Intel SSD DC S3700 100GB SATA (SSDSC2BA10), LVM RAID0 /dev/md0 on /mnt/raid type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota) xeon Samsung SSD 860 EVO 2TB SATA (RVT04B6Q) /dev/sde1 on /mnt/samsung type ext4 (rw,relatime) Intel Optane 900P 280GB NVMe (SSDPED1D280GA) /dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime) 3x Maxtor DiamondMax 21 500B 7.2k SATA (STM350063), LVM RAID0 /dev/md0 on /mnt/raid type ext4 (rw,relatime,stripe=48) # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Aug 31 21:11:48 2018 Raid Level : raid0 Array Size : 1464763392 (1396.91 GiB 1499.92 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Update Time : Fri Aug 31 21:11:48 2018 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Chunk Size : 64K Consistency Policy : none Name : bench2:0 (local to host bench2) UUID : 72e48e7b:a75554ea:05952b34:810ed6bc Events : 0 Number Major Minor RaidDevice State 0 8 170 active sync /dev/sdb1 1 8 331 active sync /dev/sdc1 2 8 492 active sync /dev/sdd1 Hopefully this is more complete ... -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] Expose port->authn_id to extensions and triggers
On Fri, Jun 3, 2022 at 7:36 PM Michael Paquier wrote: > > On Fri, Jun 03, 2022 at 10:04:12AM -0400, Tom Lane wrote: > > I agree with Robert's complaint that Parallel is far too generic > > a term here. Also, the fact that this data is currently in struct > > Port seems like an artifact. > > > > Don't we have a term for the set of processes comprising a leader > > plus parallel workers? If we called that set FooGroup, then > > something like FooGroupSharedInfo would be on-point. > > As far as I know, proc.h includes the term "group members", which > includes the leader and its workers (see CLOG and lock part)? lmgr/README also refers to "gangs of related processes" and "parallel groups". So - GroupSharedInfo - ParallelGroupSharedInfo - GangSharedInfo - SharedLeaderInfo ? --Jacob
Re: Assertion failure with barriers in parallel hash join
Hi Thomas, Correct. We're running with disabled parallel leader participation and we have to do so, because another custom plan node we built relies on that. That would be great. Anything else I can help with to get this patch in? Thanks! -- David (ServiceNow) On Fri, 3 Jun 2022 at 00:06, Thomas Munro wrote: > On Thu, Jun 2, 2022 at 9:31 PM David Geier wrote: > > We recently encountered the same bug in the field. Oleksii Kozlov > managed to come up with reproduction steps, which reliably trigger it. > Interestingly, the bug does not only manifest as failing assertion, but > also as segmentation fault; in builds with disabled and with enabled (!) > assertions. So it can crash production environments. We applied the > proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that > with the patch neither the assertion nor the segmentation fault still occur. > > Thanks for the report, testing, and for creating the CF entry. > > I assume you are using parallel_leader_participation=off, and the > reason we haven't heard more about this is because few people do that. > By coincidence I was just about to restart a bunch of hash join > projects and have been paging the topic area back into my brain, so > I'll start with another round of testing/analysis of this bug/patch > next week. >
Re: pgcon unconference / impact of block size on performance
On 6/6/22 16:27, Jakub Wartak wrote: > Hi Tomas, > >> Hi, >> >> At on of the pgcon unconference sessions a couple days ago, I presented a >> bunch of benchmark results comparing performance with different data/WAL >> block size. Most of the OLTP results showed significant gains (up to 50%) >> with >> smaller (4k) data pages. > > Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishing those other graphs too (e.g. WAL block size impact)? > Well, there's plenty of charts in the github repositories, including the charts I think you're asking for: https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png I admit the charts may not be documented very clearly :-( >> This opened a long discussion about possible explanations - I claimed one of >> the >> main factors is the adoption of flash storage, due to pretty fundamental >> differences between HDD and SSD systems. But the discussion concluded with an >> agreement to continue investigating this, so here's an attempt to support the >> claim with some measurements/data. >> >> Let me present results of low-level fio benchmarks on a couple different HDD >> and SSD drives. This should eliminate any postgres-related influence (e.g. >> FPW), >> and demonstrates inherent HDD/SSD differences. >> All the SSD results show this behavior - the Optane and Samsung nicely show >> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it >> worse. >> > [..] > Can you share what Linux kernel version, what filesystem , it's > mount options and LVM setup were you using if any(?) > The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or xfs filesystems. i5 uses LVM on the 6x SATA SSD devices, with this config: bench ~ # mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Thu Feb 8 15:05:49 2018 Raid Level : raid0 Array Size : 586106880 (558.96 GiB 600.17 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Feb 8 15:05:49 2018 State : clean Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Consistency Policy : none UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host bench) Events : 0.1 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 332 active sync /dev/sdc1 3 8 493 active sync /dev/sdd1 4 8 654 active sync /dev/sde1 5 8 815 active sync /dev/sdf1 bench ~ # mount | grep md0 /dev/md0 on /mnt/raid type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota) and the xeon just uses ext4 on the device directly: /dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime) > I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ > ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio > only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. > Results are attached, w/o graphs. > >> Now, compare this to the SSD. There are some differences between >> the models, manufacturers, interface etc. but the impact of page >> size on IOPS is pretty clear. On the Optane you can get +20-30% by >> using 4K pages, on the Samsung it's even more, etc. This means that >> workloads dominated by random I/O get significant benefit from >> smaller pages. > > Yup, same here, reproduced, 1.42x faster on writes: > [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth > [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n > 1k/1.txt: bw=24162KB/s, iops=24161, > 2k/1.txt: bw=47164KB/s, iops=23582, > 4k/1.txt: bw=280450KB/s, iops=70112, <<< > 8k/1.txt: bw=393082KB/s, iops=49135, > 16k/1.txt: bw=393103KB/s, iops=24568, > 32k/1.txt: bw=393283KB/s, iops=12290, > > BTW it's interesting to compare to your's Optane 900P result (same > two high bars for IOPS @ 4,8kB), but in my case it's even more import > to select 4kB so it behaves more like Samsung 860 in your case > Thanks. Interesting! > # 1.41x on randreads > [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth > [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n > 1k/1.txt: bw=169938KB/s, iops=169937, > 2k/1.txt: bw=376653KB/s, iops=188326, > 4k/1.txt: bw=691529KB/s, iops=172882, <<< > 8k/1.txt: bw=976916KB/s, iops=122114, > 16k/1.txt: bw=990524KB/s, iops=61907, > 32k/1.txt: bw=974318KB/s, iops=30447, > > I think that the above just a demonstration of
Re: oat_post_create expected behavior
On Thu, Jun 2, 2022 at 6:37 PM Mary Xu wrote: > I was using an object access hook for oat_post_create access while creating > an extension and expected that I would be able to query for the newly created > extension with get_extension_oid(), but it was returning InvalidOid. However, > the same process works for triggers, so I was wondering what the expected > behavior is? > From the documentation in objectaccess.h, it doesn't mention anything about > CommandCounterIncrement() for POST_CREATE which implied to me that it wasn't > something I would need to worry about. > One option I thought of was this patch where CCI is called before the access > hook so that the new tuple is visible in the hook. Another option would be to > revise the documentation to reflect the expected behavior. I don't think a proposal to add CommandCounterIncrement() calls just for the convenience of object access hooks has much chance of being accepted. Possibly there is some work that could be done to ensure consistent placement of the calls to post-create hooks so that either all of them happen before, or all of them happen after, a CCI has occurred, but I'm not sure whether or not that is feasible. Currently, I don't think we promise anything about whether a post-create hook call will occur before or after a CCI, which is why sepgsql_schema_post_create(), sepgsql_schema_post_create(), and sepgsql_attribute_post_create() perform a catalog scan using SnapshotSelf, while sepgsql_database_post_create() uses get_database_oid(). You might want to adopt a similar technique. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add index scan progress to pg_stat_progress_vacuum
On Thu, May 26, 2022 at 11:43 AM Masahiko Sawada wrote: > Another idea I came up with is that we can wait for all index vacuums > to finish while checking and updating the progress information, and > then calls WaitForParallelWorkersToFinish after confirming all index > status became COMPLETED. That way, we don’t need to change the > parallel query infrastructure. What do you think? +1 from me. It doesn't seem to me that we should need to add something like parallel_vacuum_progress_callback in order to solve this problem, because the parallel index vacuum code could just do the waiting itself, as you propose here. The question Sami asks him his reply is a good one, though -- who is to say that the leader only needs to update progress at the end, once it's finished the index it's handling locally? There will need to be a callback system of some kind to allow the leader to update progress as other workers finish, even if the leader is still working. I am not too sure that the idea of using the vacuum delay points is the best plan. I think we should try to avoid piggybacking on such general infrastructure if we can, and instead look for a way to tie this to something that is specific to parallel vacuum. However, I haven't studied the problem so I'm not sure whether there's a reasonable way to do that. -- Robert Haas EDB: http://www.enterprisedb.com
RE: pgcon unconference / impact of block size on performance
Hi Tomas, > Hi, > > At on of the pgcon unconference sessions a couple days ago, I presented a > bunch of benchmark results comparing performance with different data/WAL > block size. Most of the OLTP results showed significant gains (up to 50%) with > smaller (4k) data pages. Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishing those other graphs too (e.g. WAL block size impact)? > This opened a long discussion about possible explanations - I claimed one of > the > main factors is the adoption of flash storage, due to pretty fundamental > differences between HDD and SSD systems. But the discussion concluded with an > agreement to continue investigating this, so here's an attempt to support the > claim with some measurements/data. > > Let me present results of low-level fio benchmarks on a couple different HDD > and SSD drives. This should eliminate any postgres-related influence (e.g. > FPW), > and demonstrates inherent HDD/SSD differences. > All the SSD results show this behavior - the Optane and Samsung nicely show > that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it > worse. > [..] Can you share what Linux kernel version, what filesystem , it's mount options and LVM setup were you using if any(?) I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. > Now, compare this to the SSD. There are some differences between the models, > manufacturers, interface etc. but the impact of page size on IOPS is pretty > clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung > it's even more, etc. This means that workloads dominated by random I/O get > significant benefit from smaller pages. Yup, same here, reproduced, 1.42x faster on writes: [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n 1k/1.txt: bw=24162KB/s, iops=24161, 2k/1.txt: bw=47164KB/s, iops=23582, 4k/1.txt: bw=280450KB/s, iops=70112, <<< 8k/1.txt: bw=393082KB/s, iops=49135, 16k/1.txt: bw=393103KB/s, iops=24568, 32k/1.txt: bw=393283KB/s, iops=12290, BTW it's interesting to compare to your's Optane 900P result (same two high bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so it behaves more like Samsung 860 in your case # 1.41x on randreads [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n 1k/1.txt: bw=169938KB/s, iops=169937, 2k/1.txt: bw=376653KB/s, iops=188326, 4k/1.txt: bw=691529KB/s, iops=172882, <<< 8k/1.txt: bw=976916KB/s, iops=122114, 16k/1.txt: bw=990524KB/s, iops=61907, 32k/1.txt: bw=974318KB/s, iops=30447, I think that the above just a demonstration of device bandwidth saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most real systems are really OLTP but with some reporting/data exporting needs). One way or another it would be very nice to be able to select the tradeoff using initdb(1) without the need to recompile, which then begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...). Do you envision any plans for this we still in a need to gather more info exactly why this happens? (perf reports?) Also have you guys discussed on that meeting any long-term future plans on storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware sector size, wouldn't it be possible to win also disabling FPWs in the longer run using uring (assuming O_DIRECT | O_ATOMIC one day?) I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that pretty nicely in [1] [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO libaio-2022-06-06.tgz Description: libaio-2022-06-06.tgz
Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
On Mon, Jun 06, 2022 at 09:34:24AM +, Jean Landercy - BEEODIVERSITY wrote: > Faulty setup is about: > > SELECT version(); > > -- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit Please check if the problem occurs in v13.7 https://www.postgresql.org/message-id/2197859.1644623...@sss.pgh.pa.us https://www.postgresql.org/message-id/flat/2121219.1644607692%40sss.pgh.pa.us#190c43702a91dbd0509ba545dbbab58d
Re: How about a psql backslash command to show GUCs?
On Tue, Apr 12, 2022 at 11:19:40AM -0400, Tom Lane wrote: > "Jonathan S. Katz" writes: > > On 4/11/22 4:11 PM, Tom Lane wrote: > >> This idea does somewhat address my unhappiness upthread about printing > >> values with source = 'internal', but I see that it gets confused by > >> some GUCs with custom show hooks, like unix_socket_permissions. > >> Maybe it needs to be "source != 'default' AND setting != boot_val"? > > > Running through a few GUCs, that seems reasonable. Happy to test the > > patch out prior to commit to see if it renders better. > > It'd just look like this, I think. I see from looking at guc.c that > boot_val can be NULL, so we'd better use IS DISTINCT FROM. I noticed this is showing "pre-computed" gucs, like: shared_memory_size| 149MB shared_memory_size_in_huge_pages | 75 I'm not opposed to that, but I wonder if that's what's intended / best. -- Justin
pg_buffercache: add sql test
Hi, hackers I just wrote a test code for the `pg_buffercache` extension which doesn't not have test code. I wrote the sql query to ensure that the buffer cache results are the same when `make installcheck` is performed. --- regards Lee Dong Wook. 0001_add_test_pg_buffercache.patch Description: Binary data
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
On Mon, Jun 6, 2022 at 1:26 AM Kyotaro Horiguchi wrote: > > At Sat, 4 Jun 2022 19:09:41 +0530, Bharath Rupireddy > wrote in > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman wrote: > > > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > > list, and I haven't seen any activity on it (maybe this is because I > > > emailed directly instead of using the form?), but I got some time to > > > take a look and concluded that a first-level fix is pretty simple. > > > > > > A quick background refresher: after promoting a standby rewinding the > > > former primary requires that a checkpoint have been completed on the > > > new primary after promotion. This is correctly documented. However > > > pg_rewind incorrectly reports to the user that a rewind isn't > > > necessary because the source and target are on the same timeline. > ... > > > Attached is a patch that detects this condition and reports it as an > > > error to the user. > > I have some random thoughts on this. > > There could be a problem in the case of gracefully shutdowned > old-primary, so I think it is worth doing something if it can be in a > simple way. > > However, I don't think we can simply rely on minRecoveryPoint to > detect that situation, since it won't be reset on a standby. A standby > also still can be the upstream of a cascading standby. So, as > discussed in the thread for the comment [2], what we can do here would be > simply waiting for the timelineID to advance, maybe having a timeout. To confirm I'm following you correctly, you're envisioning a situation like: - Primary A - Replica B replicating from primary - Replica C replicating from replica B then on failover from A to B you end up with: - Primary B - Replica C replication from primary - [needs rewind] A and you try to rewind A from C as the source? > In a case of single-step replication set, a checkpoint request to the > primary makes the end-of-recovery checkpoint fast. It won't work as > expected in cascading replicas, but it might be acceptable. "Won't work as expected" because there's no way to guarantee replication is caught up or even advancing? > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > > imagine extending this to automatically issue a checkpoint when this > > > situation is detected. I haven't started to code that up, however, > > > wanting to first get buy-in on that. > > > > > > 1: > > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com > > > > Thanks. I had a quick look over the issue and patch - just a thought - > > can't we let pg_rewind issue a checkpoint on the new primary instead > > of erroring out, maybe optionally? It might sound too much, but helps > > pg_rewind to be self-reliant i.e. avoiding external actor to detect > > the error and issue checkpoint the new primary to be able to > > successfully run pg_rewind on the pld primary and repair it to use it > > as a new standby. > > At the time of the discussion [2] for the it was the hinderance that > that requires superuser privileges. Now that has been narrowed down > to the pg_checkpointer privileges. > > If we know that the timeline IDs are different, we don't need to wait > for a checkpoint. Correct. > It seems to me that the exit status is significant. pg_rewind exits > with 1 when an invalid option is given. I don't think it is great if > we report this state by the same code. I'm happy to change that; I only chose "1" as a placeholder for "non-zero exit status". > I don't think we always want to request a non-spreading checkpoint. I'm not familiar with the terminology "non-spreading checkpoint". > [2] > https://www.postgresql.org/message-id/flat/CABUevEz5bpvbwVsYCaSMV80CBZ5-82nkMzbb%2BBu%3Dh1m%3DrLdn%3Dg%40mail.gmail.com I read through that thread, and one interesting idea stuck out to me: making "tiimeline IDs are the same" an error exit status. On the one hand that makes a certain amount of sense because it's unexpected. But on the other hand there are entirely legitimate situations where upon failover the timeline IDs happen to match (e.g., for use it happens some percentage of the time naturally as we are using sync replication and failovers often involve STONITHing the original primary, so it's entirely possible that the promoted replica begins with exactly the same WAL ending LSN from the primary before it stopped). Thanks, James Coleman
Re: pg_rewind: warn when checkpoint hasn't happened after promotion
On Sat, Jun 4, 2022 at 9:39 AM Bharath Rupireddy wrote: > > On Sat, Jun 4, 2022 at 6:29 PM James Coleman wrote: > > > > A few weeks back I sent a bug report [1] directly to the -bugs mailing > > list, and I haven't seen any activity on it (maybe this is because I > > emailed directly instead of using the form?), but I got some time to > > take a look and concluded that a first-level fix is pretty simple. > > > > A quick background refresher: after promoting a standby rewinding the > > former primary requires that a checkpoint have been completed on the > > new primary after promotion. This is correctly documented. However > > pg_rewind incorrectly reports to the user that a rewind isn't > > necessary because the source and target are on the same timeline. > > > > Specifically, this happens when the control file on the newly promoted > > server looks like: > > > > Latest checkpoint's TimeLineID: 4 > > Latest checkpoint's PrevTimeLineID: 4 > > ... > > Min recovery ending loc's timeline: 5 > > > > Attached is a patch that detects this condition and reports it as an > > error to the user. > > > > In the spirit of the new-ish "ensure shutdown" functionality I could > > imagine extending this to automatically issue a checkpoint when this > > situation is detected. I haven't started to code that up, however, > > wanting to first get buy-in on that. > > > > 1: > > https://www.postgresql.org/message-id/CAAaqYe8b2DBbooTprY4v=bized9qbqvlq+fd9j617eqfjk1...@mail.gmail.com > > Thanks. I had a quick look over the issue and patch - just a thought - > can't we let pg_rewind issue a checkpoint on the new primary instead > of erroring out, maybe optionally? It might sound too much, but helps > pg_rewind to be self-reliant i.e. avoiding external actor to detect > the error and issue checkpoint the new primary to be able to > successfully run pg_rewind on the pld primary and repair it to use it > as a new standby. That's what I had suggested as a "further improvement" option in the last paragraph :) But I think agreement on this more basic solution would still be good (even if I add the automatic checkpointing in this thread); given we currently explicitly mis-inform the user of pg_rewind, I think this is a bug that should be considered for backpatching, and the simpler "fail if detected" patch is probably the only thing we could backpatch. Thanks for taking a look, James Coleman
RE: Multi-Master Logical Replication
Dear hackers, I found another use-case for LRG. It might be helpful for migration. LRG for migration -- LRG may be helpful for machine migration, OS upgrade, or PostgreSQL itself upgrade. Assumes that users want to migrate database to other environment, e.g., PG16 on RHEL7 to PG18 on RHEL8. Users must copy all data into new server and catchup all changes. In this case streaming replication cannot be used because it requires same OS and same PostgreSQL major version. Moreover, it is desirable to be able to return to the original environment at any time in case of application or other environmental deficiencies. Operation steps with LRG -- LRG is appropriate for the situation. Following lines are the workflow that users must do: 1. Copy the table definition to the newer node(PG18), via pg_dump/pg_restore 2. Execute lrg_create() in the older node(PG16) 3. Execute lrg_node_attach() in PG18 === data will be shared here=== 4. Change the connection of the user application to PG18 5. Check whether ERROR is raised or not. If some ERRORs are raised, users can change back the connection to PG16. 6. Remove the created node group if application works well. These operations may reduce system downtime due to incompatibilities associated with version upgrades. Best Regards, Hayato Kuroda FUJITSU LIMITED
Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Dear PostgreSQL Hacker Community, I am facing a tricky bug which makes the Query Planner crashes when using COUNT(*) function. Without any upgrade suddenly a table of a database instance could not be queried this way: SELECT COUNT(*) FROM items; -- ERROR: variable not found in subplan target list -- SQL state: XX000 Message and behaviour seem related to the Query Planner: EXPLAIN SELECT COUNT(*) FROM item; -- ERROR: variable not found in subplan target list -- SQL state: XX000 Looks like a column name could not be found (see https://github.com/postgres/postgres/blob/ce4f46fdc814eb1b704d81640f6d8f03625d0f53/src/backend/optimizer/plan/setrefs.c#L2967-L2972) in some specific context that is somehow hard to reproduce. Interesting facts: SELECT COUNT(id) FROM items; -- 213 SELECT COUNT(*) FROM items WHERE id > 0; -- 213 Work as expected. I can see that other people are recently facing a similar problem (https://www.postgresql.org/message-id/flat/4c347490-d734-5fdd-d613-1327601b4e7e%40mit.edu). If it is the same bug then it is not related to the PGroonga extension as I don't use it all. Anyway, the bug is difficult to reproduce on my application. At the time of writing, I could just isolate it on a specific database but I could not draw a MCVE from it. I am looking for help to make it reproducible and feed your knowledge database. My first guess was to open a post of SO (see for details https://stackoverflow.com/questions/72498741/how-can-i-reproduce-a-database-context-to-debug-a-tricky-postgresql-error-vari), but digging deeper in the investigation it seems it will require people with strong insights on how PostgreSQL actually works under the hood. Therefore, I chose this specific mailing list. The bug is tricky to reproduce, I could not succeed to replicate elsewhere (dump/restore does not preserve it). Anyway it makes my database unusable and looks like a potential bug for your product and applications relying on it. Faulty setup is about: SELECT version(); -- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit SELECT extname, extversion FROM pg_extension; -- "plpgsql""1.0" -- "postgis""3.1.1" By now, the only workarounds I have found are: * Dump database and recreate a new instance (problem seems to vanish but there is no guarantee it is solved or it will not happened later on); * Add dummy filter on all queries (more a trick than a solution). I am writing to this mailing list to raise you attention on it. I'll be happy to help you investigate it deeper. Best regards, Landercy Jean
RE: Handle infinite recursion in logical replication setup
On Mon, Jun 6, 2022 1:14 PM vignesh C wrote: > > The attached v18 patch has the fixes for the same. > Thanks for updating the patch, here are some comments. 0002 patch == 1. + +origin (string) + + It maybe better if the type of "origin" parameter is enum, as it cannot be any string and only has two valid values. 2. @@ -607,6 +626,11 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, LOGICALREP_TWOPHASE_STATE_PENDING : LOGICALREP_TWOPHASE_STATE_DISABLED); values[Anum_pg_subscription_subdisableonerr - 1] = BoolGetDatum(opts.disableonerr); + if (opts.origin) + values[Anum_pg_subscription_suborigin - 1] = + CStringGetTextDatum(opts.origin); + else + nulls[Anum_pg_subscription_suborigin - 1] = true; values[Anum_pg_subscription_subconninfo - 1] = CStringGetTextDatum(conninfo); if (opts.slot_name) Document of "CREATE SUBSCRIPTION" says, the default value of "origin" is "any", so why not set suborigin to "any" when user doesn't specify this parameter? 0003 patch == 1. @@ -300,6 +310,11 @@ CREATE SUBSCRIPTION subscription_namefalse. + + There is some interaction between the origin + parameter and copy_data parameter. Refer to the + for details. + I think this change should be put together with "origin" parameter, instead of "disable_on_error". 0004 patch == 1. + +Now the bidirectional logical replication setup is complete between +node1, node2 and +node2. Any subsequent changes in one node will +replicate the changes to the other nodes. + I think "node1, node2 and node2" should be "node1, node2 and node3". Regards, Shi yu
Re: [RFC] building postgres with meson -v8
On Tue, May 24, 2022 at 08:08:26PM +0200, Peter Eisentraut wrote: > On 18.05.22 21:48, Andres Freund wrote: >> - GETTIMEOFDAY_1ARG - test doesn't exist - I suspect it might not be >> necessary > > Might be obsolete, consider removing. I just came across this one independently of what you are doing for meson, and based on a lookup of the buildfarm, I think that it can be removed. One reference about GETTIMEOFDAY_1ARG on the -hackers list comes from here, from 20 years ago: https://www.postgresql.org/message-id/a1eeu5$1koe$1...@news.tht.net -- Michael signature.asc Description: PGP signature
Re: Ignoring BRIN for HOT udpates seems broken
On Mon, Jun 06, 2022 at 09:08:08AM +0200, Tomas Vondra wrote: > Attached is a patch reverting both commits (5753d4ee32 and fe60b67250). > This changes the IndexAmRoutine struct, so it's an ABI break. That's not > great post-beta :-( In principle we might also leave amhotblocking in > the struct but ignore it in the code (and treat it as false), but that > seems weird and it's going to be a pain when backpatching. Opinions? I don't think that you need to worry about ABI breakages now in beta, because that's the period of time where we can still change things and shape the code in its best way for prime time. It depends on the change, of course, but what you are doing, by removing the field, looks right to me here. -- Michael signature.asc Description: PGP signature
Re: should check interrupts in BuildRelationExtStatistics ?
On Sat, Jun 04, 2022 at 08:42:33PM -0500, Justin Pryzby wrote: > The fix seems to be to CHECK_FOR_INTERRUPTS() within multi_sort_compare(). > That would supercede the other two CHECK_FOR_INTERRUPTS I'd proposed, and > handle mcv, depends, and ndistinct all at once. Hmm. I have to admit that adding a CFI() in multi_sort_compare() stresses me a bit as it is dependent on the number of rows involved, and it can be used as a qsort routine. -- Michael signature.asc Description: PGP signature
Re: Ignoring BRIN for HOT udpates seems broken
On 6/1/22 22:38, Robert Haas wrote: > On Sat, May 28, 2022 at 4:51 PM Tomas Vondra > wrote: >> Yeah, I think that might/should work. We could still create the HOT >> chain, but we'd have to update the BRIN indexes. But that seems like a >> fairly complicated change to be done this late for PG15. > > Yeah, I think a revert is better for now. But I agree that the basic > idea seems salvageable. I think that the commit message is correct > when it states that "When determining whether an index update may be > skipped by using HOT, we can ignore attributes indexed only by BRIN > indexes." However, that doesn't mean that we can ignore the need to > update those indexes. In that regard, the commit message makes it > sound like all is well, because it states that "the page range summary > will be updated anyway" which reads to me like the indexes are in fact > getting updated. Your example, however, seems to show that the indexes > are not getting updated. > Yeah, agreed :-( I agree we can probably salvage some of the idea, but it's far too late for major reworks in PG15. Attached is a patch reverting both commits (5753d4ee32 and fe60b67250). This changes the IndexAmRoutine struct, so it's an ABI break. That's not great post-beta :-( In principle we might also leave amhotblocking in the struct but ignore it in the code (and treat it as false), but that seems weird and it's going to be a pain when backpatching. Opinions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 22911ff0df0284bd9c97d4971ab10332444c528d Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 6 Jun 2022 08:21:36 +0200 Subject: [PATCH] Revert changes in HOT handling of BRIN indexes This reverts commits 5753d4ee32 and fe60b67250 that modified HOT to ignore BRIN indexes. The commit message for 5753d4ee32 claims that: When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed only by BRIN indexes. There are no index pointers to individual tuples in BRIN, and the page range summary will be updated anyway as it relies on visibility info. This is partially incorrect - it's true BRIN indexes don't point to individual tuples, so HOT chains are not an issue, but the visibitlity info is not sufficient to keep the index up to date. This can easily result in corrupted indexes, as demonstrated in the hackers thread. This does not mean relaxing the HOT restrictions for BRIN is a lost cause, but it needs to handle the two aspects (allowing HOT chains and updating the page range summaries) as separate. But that requires a major changes, and it's too late for that in the current dev cycle. Reported-by: Tomas Vondra Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a5563...@enterprisedb.com --- doc/src/sgml/indexam.sgml | 11 - src/backend/access/brin/brin.c| 1 - src/backend/access/gin/ginutil.c | 1 - src/backend/access/gist/gist.c| 1 - src/backend/access/hash/hash.c| 1 - src/backend/access/heap/heapam.c | 2 +- src/backend/access/nbtree/nbtree.c| 1 - src/backend/access/spgist/spgutils.c | 1 - src/backend/utils/cache/relcache.c| 53 ++-- src/include/access/amapi.h| 2 - src/include/utils/rel.h | 3 +- src/include/utils/relcache.h | 4 +- .../modules/dummy_index_am/dummy_index_am.c | 1 - src/test/regress/expected/brin.out| 58 - src/test/regress/expected/stats.out | 242 -- src/test/regress/sql/brin.sql | 36 --- src/test/regress/sql/stats.sql| 111 17 files changed, 27 insertions(+), 502 deletions(-) diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index d4163c96e9f..cf359fa9ffd 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -126,8 +126,6 @@ typedef struct IndexAmRoutine boolamcaninclude; /* does AM use maintenance_work_mem? */ boolamusemaintenanceworkmem; -/* does AM block HOT update? */ -boolamhotblocking; /* OR of parallel vacuum flags */ uint8 amparallelvacuumoptions; /* type of data stored in index, or InvalidOid if variable */ @@ -248,15 +246,6 @@ typedef struct IndexAmRoutine null, independently of amoptionalkey. - - The amhotblocking flag indicates whether the - access method blocks HOT when an indexed attribute is - updated. Access methods without pointers to individual tuples (like - BRIN) may allow HOT even in this - case. This does not apply to attributes referenced in index predicates; - an update of such attribute always disables HOT. - - diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 0de1441dc6d..e88f7efa7e4 100644 ---
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Hi, Here is the update patch which fixes the previous comments discussed in this thread. I am sorry for the long gap in the discussion. Kindly let me know if I have missed any of the comments or anything new. Thanks & Regards, Nitin Jadhav On Fri, Mar 18, 2022 at 4:52 PM Nitin Jadhav wrote: > > > > > I don't get it. The checkpoint flags and the view flags (set by > > > > pgstat_progrss_update*) are different, so why can't we add this flag to > > > > the > > > > view flags? The fact that checkpointer.c doesn't update the passed > > > > flag and > > > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set > > > > since is > > > > an implementation detail, and the view shouldn't focus on which flags > > > > were > > > > initially passed to the checkpointer but instead which flags the > > > > checkpointer > > > > is actually enforcing, as that's what the user should be interested in. > > > > If you > > > > want to store it in another field internally but display it in the view > > > > with > > > > the rest of the flags, I'm fine with it. > > > > > > Just to be in sync with the way code behaves, it is better not to > > > update the next checkpoint request's CHECKPOINT_IMMEDIATE with the > > > current checkpoint 'flags' field. Because the current checkpoint > > > starts with a different set of flags and when there is a new request > > > (with CHECKPOINT_IMMEDIATE), it just processes the pending operations > > > quickly to take up next requests. If we update this information in the > > > 'flags' field of the view, it says that the current checkpoint is > > > started with CHECKPOINT_IMMEDIATE which is not true. > > > > Which is why I suggested to only take into account CHECKPOINT_REQUESTED (to > > be able to display that a new checkpoint was requested) > > I will take care in the next patch. > > > > Hence I had > > > thought of adding a new field ('next flags' or 'upcoming flags') which > > > contain all the flag values of new checkpoint requests. This field > > > indicates whether the current checkpoint is throttled or not and also > > > it indicates there are new requests. > > > > I'm not opposed to having such a field, I'm opposed to having a view with > > "the > > current checkpoint is throttled but if there are some flags in the next > > checkpoint flags and those flags contain checkpoint immediate then the > > current > > checkpoint isn't actually throttled anymore" behavior. > > I understand your point and I also agree that it becomes difficult for > the user to understand the context. > > > and > > CHECKPOINT_IMMEDIATE, to be able to display that the current checkpoint > > isn't > > throttled anymore if it were. > > > > I still don't understand why you want so much to display "how the checkpoint > > was initially started" rather than "how the checkpoint is really behaving > > right > > now". The whole point of having a progress view is to have something > > dynamic > > that reflects the current activity. > > As of now I will not consider adding this information to the view. If > required and nobody opposes having this included in the 'flags' field > of the view, then I will consider adding. > > Thanks & Regards, > Nitin Jadhav > > On Mon, Mar 14, 2022 at 5:16 PM Julien Rouhaud wrote: > > > > On Mon, Mar 14, 2022 at 03:16:50PM +0530, Nitin Jadhav wrote: > > > > > I am not suggesting > > > > > removing the existing 'flags' field of pg_stat_progress_checkpoint > > > > > view and adding a new field 'throttled'. The content of the 'flags' > > > > > field remains the same. I was suggesting replacing the 'next_flags' > > > > > field with 'throttled' field since the new request with > > > > > CHECKPOINT_IMMEDIATE flag enabled will affect the current checkpoint. > > > > > > > > Are you saying that this new throttled flag will only be set by the > > > > overloaded > > > > flags in ckpt_flags? > > > > > > Yes. you are right. > > > > > > > So you can have a checkpoint with a CHECKPOINT_IMMEDIATE > > > > flags that's throttled, and a checkpoint without the > > > > CHECKPOINT_IMMEDIATE flag > > > > that's not throttled? > > > > > > I think it's the reverse. A checkpoint with a CHECKPOINT_IMMEDIATE > > > flags that's not throttled (disables delays between writes) and a > > > checkpoint without the CHECKPOINT_IMMEDIATE flag that's throttled > > > (enables delays between writes) > > > > Yes that's how it's supposed to work, but my point was that your suggested > > 'throttled' flag could say the opposite, which is bad. > > > > > > I don't get it. The checkpoint flags and the view flags (set by > > > > pgstat_progrss_update*) are different, so why can't we add this flag to > > > > the > > > > view flags? The fact that checkpointer.c doesn't update the passed > > > > flag and > > > > instead look in the shmem to see if CHECKPOINT_IMMEDIATE has been set > > > > since is > > > > an implementation detail, and the view shouldn't focus on which flags > > > > were > > > > initially