date:20240502

Re: Document NULL

2024-05-02 Thread Kashif Zeeshan

Hi David

I reviewed the documentation and it's very detailed.

Thanks
Kashif Zeeshan
Bitnine Global

On Thu, May 2, 2024 at 8:24 PM David G. Johnston 
wrote:

> On Wed, May 1, 2024 at 9:47 PM Tom Lane  wrote:
>
>> David Rowley  writes:
>> > Let's bash it into shape a bit more before going any further on actual
>> wording.
>>
>> FWIW, I want to push back on the idea of making it a tutorial section.
>> I too considered that, but in the end I think it's a better idea to
>> put it into the "main" docs, for two reasons:
>>
>>
> Version 2 attached.  Still a draft, focused on topic picking and overall
> structure.  Examples and links planned plus the usual semantic markup stuff.
>
> I chose to add a new sect1 in the user guide (The SQL Language) chapter,
> "Data".  Don't tell Robert.
>
> The "Data Basics" sub-section lets us readily slide this Chapter into the
> main flow and here the NULL discussion feels like a natural fit.  In
> hindsight, the lack of a Data chapter in a Database manual seems like an
> oversight.  One easily made because we assume if you are here you "know"
> what data is, but there is still stuff to be discussed, if nothing else to
> establish a common understanding between us and our users.
>
> David J.
>
>
>

Incorrect Assert in BufFileSize()?

2024-05-02 Thread David Rowley

I'm trying to figure out why BufFileSize() Asserts that file->fileset
isn't NULL, per 1a990b207.

The discussion for that commit is in [1], but I don't see any
explanation of the Assert in the discussion or commit message and
there's no comment explaining why it's there.

The code that comes after the Assert does not look at the fileset
field.  With the code as it is, it doesn't seem possible to get the
file size of a non-shared BufFile and I don't see any reason for that.

Should the Assert be checking file->files != NULL?

David

[1] 
https://postgr.es/m/CAH2-Wzn0ZNLZs3DhCYdLMv4xn1fnM8ugVHPvWz67dSUh1s_%3D2Q%40mail.gmail.com

Re: cataloguing NOT NULL constraints

2024-05-02 Thread Alexander Lakhin


02.05.2024 19:21, Alvaro Herrera wrote:


Now, you could claim that the standard doesn't mention
INCLUDING/EXCLUDING CONSTRAINTS, therefore since we have come up with
its definition then we should make it affect not-null constraints.
However, there's also this note:

   NOTE 520 — s, except for NOT NULL, are not included in
   CDi; s are effectively transformed to s and are thereby also excluded.

which is explicitly saying that not-null constraints are treated
differently; in essence, with INCLUDING CONSTRAINTS we choose to affect
the constraints that the standard says to ignore.


Thank you for very detailed and convincing explanation!

Now I see what the last sentence here (from [1]) means:
INCLUDING CONSTRAINTS

    CHECK constraints will be copied. No distinction is made between
    column constraints and table constraints. _Not-null constraints are
    always copied to the new table._

(I hadn't paid enough attention to it, because this exact paragraph is
also presented in previous versions...)

[1] https://www.postgresql.org/docs/devel/sql-createtable.html

Best regards,
Alexander

Re: Removing unneeded self joins

2024-05-02 Thread Andrei Lepikhov


On 5/3/24 06:19, Tom Lane wrote:

Alexander Korotkov  writes:

I would appreciate your review of this patchset, and review from Andrei as well.


I hate to say this ... but if we're still finding bugs this
basic in SJE, isn't it time to give up on it for v17?

I might feel better about it if there were any reason to think
these were the last major bugs.  But you have already committed
around twenty separate fixes for the original SJE patch, and
now here you come with several more; so it doesn't seem like
the defect rate has slowed materially.  There can be no doubt
whatever that the original patch was far from commit-ready.

I think we should revert SJE for v17 and do a thorough design
review before trying again in v18.

I need to say I don't see any evidence of bad design.
I think this feature follows the example of 2489d76 [1], 1349d27 [2], 
and partitionwise join features — we get some issues from time to time, 
but these strengths and frequencies are significantly reduced.
First and foremost, this feature is highly isolated: like the PWJ 
feature, you can just disable (not enable?) SJE and it guarantees you 
will avoid the problems.
Secondly, this feature reflects the design decisions the optimiser has 
made before. It raises some questions: Do we really control the 
consistency of our paths and the plan tree? Maybe we hide our 
misunderstanding of its logic by extensively copying expression trees, 
sometimes without fundamental necessity. Perhaps the optimiser needs 
some abstraction layers or reconstruction to reduce the quickly growing 
complexity.
A good example here is [1]. IMO, the new promising feature it has 
introduced isn't worth the complexity it added to the planner.
SJE, much like OR <-> ANY transformation, introduces a fresh perspective 
into the planner: if we encounter a complex, redundant query, it may be 
more beneficial to invest in simplifying the internal query 
representation rather than adding new optimisations that will grapple 
with this complexity.
Also, SJE raised questions I've never seen before, like: Could we 
control the consistency of the PlannerInfo by changing something in the 
logic?
Considering the current state, I don't see any concrete outcomes or 
evidence that a redesign of the feature will lead us to a new path. 
However, I believe the presence of SJE in the core could potentially 
trigger improvements in the planner. As a result, I vote to stay with 
the feature. But remember, as an author, I'm not entirely objective, so 
let's wait for alternative opinions.


[1] Make Vars be outer-join-aware
[2] Improve performance of ORDER BY / DISTINCT aggregates

--
regards,
Andrei Lepikhov
Postgres Professional

improve performance of pg_dump with many sequences

2024-05-02 Thread Nathan Bossart

Similar to 'pg_dump --binary-upgrade' [0], we can speed up pg_dump with
many sequences by gathering the required information in a single query
instead of two queries per sequence.  The attached patches are
works-in-progress, but here are the results I see on my machine for
'pg_dump --schema-only --binary-upgrade' with a million sequences:

  HEAD : 6m22.809s
   [0] : 1m54.701s
[0] + attached : 0m38.233s

I'm not sure I have all the details correct in 0003, and we might want to
separate the table into two tables which are only populated when the
relevant section is dumped.  Furthermore, the query in 0003 is a bit goofy
because it needs to dance around a bug reported elsewhere [1].

[0] https://postgr.es/m/20240418041712.GA3441570%40nathanxps13
[1] https://postgr.es/m/20240501005730.GA594666%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 32d1e994f31be002d4490f9239f819d6ee55cb36 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 30 Apr 2024 14:41:36 -0500
Subject: [PATCH v1 1/3] parse sequence information

---
 src/bin/pg_dump/pg_dump.c | 64 ---
 1 file changed, 26 insertions(+), 38 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 379debac24..b53c17aace 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -17566,18 +17566,16 @@ dumpSequence(Archive *fout, const TableInfo *tbinfo)
 {
 	DumpOptions *dopt = fout->dopt;
 	PGresult   *res;
-	char	   *startv,
-			   *incby,
-			   *maxv,
-			   *minv,
-			   *cache,
-			   *seqtype;
+	char		seqtype[10];
 	bool		cycled;
 	bool		is_ascending;
 	int64		default_minv,
-default_maxv;
-	char		bufm[32],
-bufx[32];
+default_maxv,
+minv,
+maxv,
+startv,
+incby,
+cache;
 	PQExpBuffer query = createPQExpBuffer();
 	PQExpBuffer delqry = createPQExpBuffer();
 	char	   *qseqname;
@@ -17619,16 +17617,21 @@ dumpSequence(Archive *fout, const TableInfo *tbinfo)
 		  PQntuples(res)),
  tbinfo->dobj.name, PQntuples(res));
 
-	seqtype = PQgetvalue(res, 0, 0);
-	startv = PQgetvalue(res, 0, 1);
-	incby = PQgetvalue(res, 0, 2);
-	maxv = PQgetvalue(res, 0, 3);
-	minv = PQgetvalue(res, 0, 4);
-	cache = PQgetvalue(res, 0, 5);
+	Assert(strlen(PQgetvalue(res, 0, 0)) < sizeof(seqtype));
+	strncpy(seqtype, PQgetvalue(res, 0, 0), sizeof(seqtype));
+	seqtype[sizeof(seqtype) - 1] = '\0';
+
+	startv = strtoi64(PQgetvalue(res, 0, 1), NULL, 10);
+	incby = strtoi64(PQgetvalue(res, 0, 2), NULL, 10);
+	maxv = strtoi64(PQgetvalue(res, 0, 3), NULL, 10);
+	minv = strtoi64(PQgetvalue(res, 0, 4), NULL, 10);
+	cache = strtoi64(PQgetvalue(res, 0, 5), NULL, 10);
 	cycled = (strcmp(PQgetvalue(res, 0, 6), "t") == 0);
 
+	PQclear(res);
+
 	/* Calculate default limits for a sequence of this type */
-	is_ascending = (incby[0] != '-');
+	is_ascending = (incby >= 0);
 	if (strcmp(seqtype, "smallint") == 0)
 	{
 		default_minv = is_ascending ? 1 : PG_INT16_MIN;
@@ -17650,19 +17653,6 @@ dumpSequence(Archive *fout, const TableInfo *tbinfo)
 		default_minv = default_maxv = 0;	/* keep compiler quiet */
 	}
 
-	/*
-	 * 64-bit strtol() isn't very portable, so convert the limits to strings
-	 * and compare that way.
-	 */
-	snprintf(bufm, sizeof(bufm), INT64_FORMAT, default_minv);
-	snprintf(bufx, sizeof(bufx), INT64_FORMAT, default_maxv);
-
-	/* Don't print minv/maxv if they match the respective default limit */
-	if (strcmp(minv, bufm) == 0)
-		minv = NULL;
-	if (strcmp(maxv, bufx) == 0)
-		maxv = NULL;
-
 	/*
 	 * Identity sequences are not to be dropped separately.
 	 */
@@ -17714,22 +17704,22 @@ dumpSequence(Archive *fout, const TableInfo *tbinfo)
 			appendPQExpBuffer(query, "AS %s\n", seqtype);
 	}
 
-	appendPQExpBuffer(query, "START WITH %s\n", startv);
+	appendPQExpBuffer(query, "START WITH " INT64_FORMAT "\n", startv);
 
-	appendPQExpBuffer(query, "INCREMENT BY %s\n", incby);
+	appendPQExpBuffer(query, "INCREMENT BY " INT64_FORMAT "\n", incby);
 
-	if (minv)
-		appendPQExpBuffer(query, "MINVALUE %s\n", minv);
+	if (minv != default_minv)
+		appendPQExpBuffer(query, "MINVALUE " INT64_FORMAT "\n", minv);
 	else
 		appendPQExpBufferStr(query, "NO MINVALUE\n");
 
-	if (maxv)
-		appendPQExpBuffer(query, "MAXVALUE %s\n", maxv);
+	if (maxv != default_maxv)
+		appendPQExpBuffer(query, "MAXVALUE " INT64_FORMAT "\n", maxv);
 	else
 		appendPQExpBufferStr(query, "NO MAXVALUE\n");
 
 	appendPQExpBuffer(query,
-	  "CACHE %s%s",
+	  "CACHE " INT64_FORMAT "%s",
 	  cache, (cycled ? "\nCYCLE" : ""));
 
 	if (tbinfo->is_identity_sequence)
@@ -17816,8 +17806,6 @@ dumpSequence(Archive *fout, const TableInfo *tbinfo)
 	 tbinfo->dobj.namespace->dobj.name, tbinfo->rolname,
 	 tbinfo->dobj.catId, 0, tbinfo->dobj.dumpId);
 
-	PQclear(res);
-
 	destroyPQExpBuffer(query);
 	destroyPQExpBuffer(delqry);
 	free(qseqname);
-- 
2.25.1

>From a58f6c33029659aef0a2139293da5ba9fd7f178f

Re: allow changing autovacuum_max_workers without restarting

2024-05-02 Thread Nathan Bossart

On Mon, Apr 15, 2024 at 05:41:04PM +, Imseih (AWS), Sami wrote:
>> Another option could be to just remove the restart-only GUC and hard-code
>> the upper limit of autovacuum_max_workers to 64 or 128 or something. While
>> that would simplify matters, I suspect it would be hard to choose an
>> appropriate limit that won't quickly become outdated.
> 
> Hardcoded values are usually hard to deal with because they are hidden either
> In code or in docs.

That's true, but using a hard-coded limit means we no longer need to add a
new GUC.  Always allocating, say, 256 slots might require a few additional
kilobytes of shared memory, most of which will go unused, but that seems
unlikely to be a problem for the systems that will run Postgres v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Weird "null" errors during DROP TYPE (pg_upgrade)

2024-05-02 Thread Tom Lane

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?=  writes:
> pg_ugprade from v15 to v16 failed in an environment. Often we get a
> reasonable message, but this time it was a bit weird. First, error
> message:

It seems like the source database must have been in quite a corrupt
state already --- here we have the array type _packagestoptemp, but
there's apparently no underlying type packagestoptemp, because
format_type seems to have failed for it:

> ELEMENT = ???,

Can you show a way to get into this state without manual catalog
hacking?

> ===
> DROP TYPE "foobar"."_packagestoptemp";
> ERROR:  cannot drop (null) because (null) requires it
> HINT:  You can drop (null) instead.
> ===

> What do these "null" mean here? Any hints?,

Probably some catalog lookup function failed and returned NULL,
and then sprintf decided to print "(null)" instead of dumping
core (cf 3779ac62d).  This is more evidence in favor of catalog
corruption, but it doesn't tell us exactly what kind.

Maybe reindexing pg_type would improve matters?

regards, tom lane

Re: Support LIKE with nondeterministic collations

2024-05-02 Thread Robert Haas

On Thu, May 2, 2024 at 9:38 AM Peter Eisentraut  wrote:
> On 30.04.24 14:39, Daniel Verite wrote:
> >postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
> > ?column?
> >--
> > f
> >(1 row)
> >
> > The first two results look fine, but the next one is inconsistent.
>
> This is correct, because '_' means "any single character".  This is
> independent of the collation.

Seems really counterintuitive. I had to think for a long time to be
able to guess what was happening here. Finally I came up with this
guess:

If the collation-aware matching tries to match up f with the initial
period, the period is skipped and the f matches f. But when the
wildcard is matched to the initial period, that uses up the wildcard
and then we're left trying to match o with f, which doesn't work.

Is that right?

It'd probably be good to use something like this as an example in the
documentation. My intuition is that if foo matches a string, then _oo
f_o and fo_ should also match that string. Apparently that's not the
case, but I doubt I'll be the last one who thinks it should be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Proposal: Early providing of PGDG repositories for the major Linux distributions like Fedora or Debian

2024-05-02 Thread Devrim Gündüz

Hi,

On Wed, 2024-04-24 at 11:02 +, Hans Buschmann wrote:
> Today (24.4.2024) I upgraded my laptop to Fedora 40, but there where
> no repository available, so I ended with a mix of Fedora 40 and Fedora
> 39 installation.

This was caused by an unexpected and very long network outage that
blocked my access to the build instances. I released packages last
Thursday (2 days after F40 was released, which is 26 April 2024)

I sent emails to the RPM mailing list about the issue:

https://www.postgresql.org/message-id/aec36aec623741ae314692b318c890c646498ca6.camel%40gunduz.org
https://www.postgresql.org/message-id/1fe99b0def5d7539939421fa5b35db2c8f2a40ad.camel%40gunduz.org
https://www.postgresql.org/message-id/3a1b0f58673d35fae9979ed2b149972195c7b8bc.camel%40gunduz.org

> To mitigate this situation, I propose to provide these repositories
> much earlier in the beta phase of the distributions:

This is what I do for the Fedora releases. I'm sure you've noticed that
in the past.

Regards,

-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR

Re: HEAD build error on Fedora 39

2024-05-02 Thread Devrim Gündüz

Hi Andrew,

On Mon, 2024-04-15 at 06:35 -0400, Andrew Dunstan wrote:
> It's working on my Fedora 39. This error suggests to me that you don't
> have docbook-dtds installed. If you do, then I don't know :-)

Sorry for the noise. I got a new laptop, and apparently some of the
packages (like this) were missing.

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR

Weird "null" errors during DROP TYPE (pg_upgrade)

2024-05-02 Thread Devrim Gündüz



Hi,

pg_ugprade from v15 to v16 failed in an environment. Often we get a
reasonable message, but this time it was a bit weird. First, error
message:

=
pg_restore: creating TYPE "foobar._packagestoptemp"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 2418; 1247 20529 TYPE _packagestoptemp developer
pg_restore: error: could not execute query: ERROR:  type "_packagestoptemp" 
does not exist
HINT:  Create the type as a shell type, then create its I/O functions, then do 
a full CREATE TYPE.
Command was: 
-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('20529'::pg_catalog.oid);

CREATE TYPE "foobar"."_packagestoptemp" (
INTERNALLENGTH = variable,
INPUT = "array_in",
OUTPUT = "array_out",
RECEIVE = "array_recv",
SEND = "array_send",
ANALYZE = "array_typanalyze",
SUBSCRIPT = "array_subscript_handler",
ELEMENT = ???,
CATEGORY = 'A',
ALIGNMENT = double,
STORAGE = extended
);
=

We noticed that this data type was actually not used, so decided to drop it:

===
DROP TYPE "foobar"."_packagestoptemp";
ERROR:  cannot drop (null) because (null) requires it
HINT:  You can drop (null) instead.

===

What do these "null" mean here? Any hints?,

Thanks!

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR

Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-02 Thread Jacob Champion

On Wed, May 1, 2024 at 8:40 PM Michael Paquier  wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts.  Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind..  I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable.  Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context.  That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob

Re: Removing unneeded self joins

2024-05-02 Thread Tom Lane

Alexander Korotkov  writes:
> I would appreciate your review of this patchset, and review from Andrei as 
> well.

I hate to say this ... but if we're still finding bugs this
basic in SJE, isn't it time to give up on it for v17?

I might feel better about it if there were any reason to think
these were the last major bugs.  But you have already committed
around twenty separate fixes for the original SJE patch, and
now here you come with several more; so it doesn't seem like
the defect rate has slowed materially.  There can be no doubt
whatever that the original patch was far from commit-ready.

I think we should revert SJE for v17 and do a thorough design
review before trying again in v18.

regards, tom lane

Re: Removing unneeded self joins

2024-05-02 Thread Alexander Korotkov

Hi, Richard!

On Thu, May 2, 2024 at 4:14 PM Richard Guo  wrote:
> On Thu, May 2, 2024 at 6:08 PM Alexander Korotkov  
> wrote:
>> On Thu, May 2, 2024 at 12:45 PM Andrei Lepikhov
>>  wrote:
>> > One question for me is: Do we anticipate other lateral self-references
>> > except the TABLESAMPLE case? Looking into the extract_lateral_references
>> > implementation, I see the only RTE_SUBQUERY case to be afraid of. But we
>> > pull up subqueries before extracting lateral references. So, if we have
>> > a reference to a subquery, it means we will not flatten this subquery
>> > and don't execute SJE. Do we need more code, as you have written in the
>> > first patch?
>>
>> I think my first patch was crap anyway.  Your explanation seems
>> reasonable to me.  I'm not sure this requires any more code.  Probably
>> it would be enough to add more comments about this.
>
>
> The tablesample case is not the only factor that can cause a relation to
> have a lateral dependency on itself after self-join removal.  It can
> also happen with PHVs.  As an example, consider
>
> explain (costs off)
> select * from t t1
> left join lateral
>   (select t1.a as t1a, * from t t2) t2
> on true
> where t1.a = t2.a;
> server closed the connection unexpectedly
>
> This is because after self-join removal, a PlaceHolderInfo's ph_lateral
> might contain rels mentioned in ph_eval_at, which we should get rid of.
>
> For the tablesample case, I agree that we should not consider relations
> with TABLESAMPLE clauses as candidates to be removed.  Removing such a
> relation could potentially change the syntax of the query, as shown by
> Alexander's example.  It seems to me that we can just check that in
> remove_self_joins_recurse, while we're collecting the base relations
> that are considered to be candidates for removal.
>
> This leads to the attached patch.  This patch also includes some code
> refactoring for the surrounding code.

Great, thank you for your work on this!

I'd like to split this into separate patches for better granularity of
git history.  I also added 0001 patch, which makes first usage of the
SJE acronym in file to come with disambiguation.  Also, I've added
assert that ph_lateral and ph_eval_at didn't overlap before the
changes.  I think this should help from the potential situation when
the changes we do could mask another bug.

I would appreciate your review of this patchset, and review from Andrei as well.

--
Regards,
Alexander Korotkov
Supabase


v2-0002-Minor-refactoring-for-self-join-elimination-code.patch
Description: Binary data


v2-0003-Forbid-self-join-elimination-on-table-sampling-sc.patch
Description: Binary data


v2-0004-Fix-self-join-elimination-work-with-PlaceHolderIn.patch
Description: Binary data


v2-0001-Clarify-the-SJE-self-join-elimination-acronym.patch
Description: Binary data

Re: enhance the efficiency of migrating particularly large tables

2024-05-02 Thread David Rowley

On Fri, 3 May 2024 at 09:33, David Zhang  wrote:
> Is there a simple way to get the min of ctid faster than using min(), but 
> similar to get the max of ctid using pg_relation_size?

The equivalent approximation is always '(0,1)'.

David

wrong comment in libpq.h

2024-05-02 Thread David Zhang


Hi Hackers,

There is a comment like below in src/include/libpq/libpq.h,

 /*
  * prototypes for functions in be-secure.c
  */
extern PGDLLIMPORT char *ssl_library;
extern PGDLLIMPORT char *ssl_cert_file;

...

However, 'ssl_library', 'ssl_cert_file' and the rest are global 
parameter settings, not functions. To address this confusion, it would 
be better to move all global configuration settings to the proper 
section, such as /* GUCs */, to maintain consistency.


I have attached an attempt to help address this issue.


Thank you,

David
From 9154102c34ec0c4c956d25942b8ea600dd740a07 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Thu, 2 May 2024 15:15:13 -0700
Subject: [PATCH] fix the wrong comment to keep the consistency

---
 src/include/libpq/libpq.h | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 83e338f604..bece50b69c 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -86,19 +86,6 @@ extern bool pq_check_connection(void);
 /*
  * prototypes for functions in be-secure.c
  */
-extern PGDLLIMPORT char *ssl_library;
-extern PGDLLIMPORT char *ssl_cert_file;
-extern PGDLLIMPORT char *ssl_key_file;
-extern PGDLLIMPORT char *ssl_ca_file;
-extern PGDLLIMPORT char *ssl_crl_file;
-extern PGDLLIMPORT char *ssl_crl_dir;
-extern PGDLLIMPORT char *ssl_dh_params_file;
-extern PGDLLIMPORT char *ssl_passphrase_command;
-extern PGDLLIMPORT bool ssl_passphrase_command_supports_reload;
-#ifdef USE_SSL
-extern PGDLLIMPORT bool ssl_loaded_verify_locations;
-#endif
-
 extern int secure_initialize(bool isServerStart);
 extern bool secure_loaded_verify_locations(void);
 extern void secure_destroy(void);
@@ -117,6 +104,19 @@ extern ssize_t secure_open_gssapi(Port *port);
 #endif
 
 /* GUCs */
+extern PGDLLIMPORT char *ssl_library;
+extern PGDLLIMPORT char *ssl_cert_file;
+extern PGDLLIMPORT char *ssl_key_file;
+extern PGDLLIMPORT char *ssl_ca_file;
+extern PGDLLIMPORT char *ssl_crl_file;
+extern PGDLLIMPORT char *ssl_crl_dir;
+extern PGDLLIMPORT char *ssl_dh_params_file;
+extern PGDLLIMPORT char *ssl_passphrase_command;
+extern PGDLLIMPORT bool ssl_passphrase_command_supports_reload;
+#ifdef USE_SSL
+extern PGDLLIMPORT bool ssl_loaded_verify_locations;
+#endif
+
 extern PGDLLIMPORT char *SSLCipherSuites;
 extern PGDLLIMPORT char *SSLECDHCurve;
 extern PGDLLIMPORT bool SSLPreferServerCiphers;
-- 
2.34.1

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-02 Thread Tomas Vondra




On 4/24/24 22:46, Melanie Plageman wrote:
> On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra
>  wrote:
>>
>> On 4/23/24 18:05, Melanie Plageman wrote:
>>> The patch with a fix is attached. I put the test in
>>> src/test/regress/sql/join.sql. It isn't the perfect location because
>>> it is testing something exercisable with a join but not directly
>>> related to the fact that it is a join. I also considered
>>> src/test/regress/sql/select.sql, but it also isn't directly related to
>>> the query being a SELECT query. If there is a better place for a test
>>> of a bitmap heap scan edge case, let me know.
>>
>> I don't see a problem with adding this to join.sql - why wouldn't this
>> count as something related to a join? Sure, it's not like this code
>> matters only for joins, but if you look at join.sql that applies to a
>> number of other tests (e.g. there are a couple btree tests).
> 
> I suppose it's true that other tests in this file use joins to test
> other code. I guess if we limited join.sql to containing tests of join
> implementation, it would be rather small. I just imagined it would be
> nice if tests were grouped by what they were testing -- not how they
> were testing it.
> 
>> That being said, it'd be good to explain in the comment why we're
>> testing this particular plan, not just what the plan looks like.
> 
> You mean I should explain in the test comment why I included the
> EXPLAIN plan output? (because it needs to contain a bitmapheapscan to
> actually be testing anything)
> 

No, I meant that the comment before the test describes a couple
requirements the plan needs to meet (no need to process all inner
tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it
does not explain why we're testing that plan.

I could get to that by doing git-blame to see what commit added this
code, and then read the linked discussion. Perhaps that's enough, but
maybe the comment could say something like "verify we properly discard
tuples on rescans" or something like that?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: enhance the efficiency of migrating particularly large tables

2024-05-02 Thread David Zhang


Thanks a lot David Rowley for your suggestion in details.

On 2024-04-08 3:23 p.m., David Rowley wrote:

On Tue, 9 Apr 2024 at 09:52, David Zhang  wrote:
Finding the exact ctid seems overkill for what you need.  Why you
could just find the maximum block with:

N = pg_relation_size('name_of_your_table'::regclass) /
current_Setting('block_size')::int;

and do WHERE ctid < '(N,1)';
We experienced this approach using pg_relation_size and tried to compare 
the performance. Below are some simple timing results for 100 million 
records in a table:


Using system function max():
SELECT max(ctid) from t;
Time: 2126.680 ms (00:02.127)

Using pg_relation_size and where condition:
SELECT pg_relation_size('t'::regclass) / current_setting('block_size')::int;
Time: 0.561 ms

Using the experimental function introduced in previous patch:
SELECT ctid from get_ctid('t', 1);
Time: 0.452 ms


Delete about 1/3 records from the end of the table:
SELECT max(ctid) from t;
Time: 1552.975 ms (00:01.553)

SELECT pg_relation_size('t'::regclass) / current_setting('block_size')::int;
Time: 0.533 m
But before vacuum, pg_relation_size always return the same value as 
before and this relation_size may not be so accurate.


SELECT ctid from get_ctid('t', 1);
Time: 251.105 m

After vacuum:
SELECT ctid from get_ctid('t', 1);
Time: 0.478 ms


Below are the comparison between system function min() and the 
experimental function:


SELECT min(ctid) from t;
Time: 1932.554 ms (00:01.933)

SELECT ctid from get_ctid('t', 0);
Time: 0.478 ms

After deleted about 1/3 records from the beginning of the table:
SELECT min(ctid) from t;
Time: 1305.799 ms (00:01.306)

SELECT ctid from get_ctid('t', 0);
Time: 244.336 ms

After vacuum:
SELECT ctid from get_ctid('t', 0);
Time: 0.468 ms


If we wanted to optimise this in PostgreSQL, the way to do it would
be, around set_plain_rel_pathlist(), check if the relation's ctid is a
required PathKey by the same means as create_index_paths() does, then
if found, create another seqscan path without synchronize_seqscans *
and tag that with the ctid PathKey sending the scan direction
according to the PathKey direction. nulls_first does not matter since
ctid cannot be NULL.

Min(ctid) query should be able to make use of this as the planner
should rewrite those to subqueries with a ORDER BY ctid LIMIT 1.


Is there a simple way to get the min of ctid faster than using min(), 
but similar to get the max of ctid using pg_relation_size?



Thank you,

David Zhang

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-05-02 Thread Tomas Vondra

On 4/30/24 14:07, Daniel Gustafsson wrote:
>> On 26 Apr 2024, at 15:04, Melanie Plageman  wrote:
> 
>> If this seems correct to you, are you okay with the rest of the fix
>> and test? We could close this open item once the patch is acceptable.
> 
> From reading the discussion and the patch this seems like the right fix to me.

I agree.

> Does the test added here aptly cover 04e72ed617be in terms its functionality?
> 

AFAIK the test fails without the fix and works with it, so I believe it
does cover the relevant functionality.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

2024-05-02 Thread Chris Cleveland

> > ... there's no reason the system needs to know the actual float value
> > returned by rank_match(), the ordering operator distance function. In any
> > case, that value can only be calculated based on information in the index
> > itself, and can't be calculated by rank_match().
>
> This seems to me to be a very poorly designed concept.  An index
> ordering operator is an optimization that the planner may or may
> not choose to employ.  If you've designed your code on the assumption
> that that's the only possible plan, it will break for any but the
> most trivial queries.
>

So the use-case here is an enterprise search engine built using an index
access method. A search engine ranks its result set based on many, many
factors. Among them: token-specific weights, token statistics calculated
across the entire index, field lens, average field lens calculated across
the index, various kinds of linguistic analysis (verbs? nouns?), additional
terms added to the query drawn from other parts of the index, fuzzy terms
based on ngrams from across the index, and a great many other magic tricks.
There are also index-specific parameters that are specified at index time,
both as parameters to the op classes attached to each column, and options
specified by CREATE INDEX ... WITH (...).

If the system must generate a score for ranking using a simple ORDER BY
column_op_constant, it can't, because all that information within the index
itself isn't available.

In any case, I'm uninterested in making the world safe for a
> design that's going to fail if the planner doesn't choose an
> indexscan on a specific index.  That's too fragile.
>
>
But that's the reality of search engines. It's also the reason that the
built-in pg full-text search has real limitations.

This isn't just a search engine problem. *Any* index type that depends on
whole-table statistics, or index options, or knowledge about other items in
the index will not be able to calculate a proper score without access to
the index. This applies to certain types of vector search. It could also
apply to a recommendation engine.

In general, it hasn't been a problem for my project because I adjust costs
to force the use of the index. (Yes, I know that doing this is
controversial, but I have little choice, and it works.)

The only area where it has become a problem is in the creation of the junk
column.

I do understand the need for the index to report the value of the sort key
up the food chain, because yes, all kinds of arbitrary re-ordering could
occur. We already have a mechanism for that, though:
IndexScanDesc.xs_orderbyvals. If there were a way for the system to use
that instead of a call to the ordering function, we'd be all set.

It would also be nice if the orderbyval could be made available in the
projection. That way we could report the score() in the result set.

Matthias' response and links touch on some of these issues.

-- 
Chris Cleveland
312-339-2677 mobile

Re: Weird test mixup

2024-05-02 Thread Noah Misch

On Thu, May 02, 2024 at 04:27:12PM +0900, Michael Paquier wrote:
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> > While writing an injection point test, I encountered a variant of the race
> > condition that f4083c4 fixed.  It had three sessions and this sequence of
> > events:
> > 
> > s1: local-attach to POINT
> > s2: enter InjectionPointRun(POINT), yield CPU just before 
> > injection_callback()
> > s3: detach POINT, deleting the InjectionPointCondition record
> > s2: wake up and run POINT as though it had been non-local

I should have given a simpler example:

s1: local-attach to POINT
s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
s1: exit
s2: wake up and run POINT as though it had been non-local

> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

(That's orthogonal to the race condition.)  When s1 would wait at the
injection point multiple times in one SQL statement, I like issuing the detach
from s3 so s1 waits at just the first encounter with the injection point.
This mimics setting a gdb breakpoint and deleting that breakpoint before
"continue".  The alternative, waking s1 repeatedly until it finishes the SQL
statement, is less convenient.  (I also patched _detach() to wake the waiter,
and I plan to propose that.)

> > Separately, injection_points_cleanup() breaks the rules by calling
> > InjectionPointDetach() while holding a spinlock.  The latter has an
> > elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I 
> > haven't
> > given as much thought to solutions for this one.
> 
> Indeed.  That's a brain fade.  This one could be fixed by collecting
> the point names when cleaning up the conditions and detach after
> releasing the spinlock.  This opens a race condition between the
> moment when the spinlock is released and the detach, where another
> backend could come in and detach a point before the shmem_exit
> callback has the time to do its cleanup, even if detach() is
> restricted for local points.  So we could do the callback cleanup in

That race condition seems fine.  The test can be expected to control the
timing of backend exits vs. detach calls.  Unlike the InjectionPointRun()
race, it wouldn't affect backends unrelated to the test.

> three steps in the shmem exit callback: 
> - Collect the names of the points to detach, while holding the
> spinlock.
> - Do the Detach.
> - Take again the spinlock, clean up the conditions.
> 
> Please see the attached.

The injection_points_cleanup() parts look good.  Thanks.

> @@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
>  {
>   char   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
>  
> + if (!injection_point_allowed(name))
> + elog(ERROR, "cannot detach injection point \"%s\" not allowed 
> to run",
> +  name);
> +

As above, I disagree with the injection_points_detach() part.

Re: Idea Feedback: psql \h misses -> Offers Links?

2024-05-02 Thread Pavel Stehule

čt 2. 5. 2024 v 19:50 odesílatel Andrey M. Borodin 
napsal:

>
>
> > On 17 Apr 2024, at 22:47, Kirk Wolak  wrote:
> >
> > Thoughts?
>
> Today we had a hacking session with Nik and Kirk. We produced a patch to
> assess how these links might look like.
>
> Also we needed a url_encode() and found none in a codebase. It would be
> nice to have this as an SQL-callable function.
>

+1

it was requested more times

Pavel


> Thanks!
>
>
> Best regards, Andrey Borodin.
>
>

Re: Parallel CREATE INDEX for GIN indexes

2024-05-02 Thread Tomas Vondra




On 5/2/24 19:12, Matthias van de Meent wrote:
> On Thu, 2 May 2024 at 17:19, Tomas Vondra  
> wrote:
>>
>> Hi,
>>
>> In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
>> when working on that I was thinking how difficult would it be to do
>> something similar to do that for other index types, like GIN. I even had
>> that on my list of ideas to pitch to potential contributors, as I was
>> fairly sure it's doable and reasonably isolated / well-defined.
>>
>> However, I was not aware of any takers, so a couple days ago on a slow
>> weekend I took a stab at it. And yes, it's doable - attached is a fairly
>> complete, tested and polished version of the feature, I think. It turned
>> out to be a bit more complex than I expected, for reasons that I'll get
>> into when discussing the patches.
> 
> This is great. I've been thinking about approximately the same issue
> recently, too, but haven't had time to discuss/implement any of this
> yet. I think some solutions may even be portable to the btree parallel
> build: it also has key deduplication (though to a much smaller degree)
> and could benefit from deduplication during the scan/ssup load phase,
> rather than only during insertion.
> 

Perhaps, although I'm not that familiar with the details of btree
builds, and I haven't thought about it when working on this over the
past couple days.

>> First, let's talk about the benefits - how much faster is that than the
>> single-process build we have for GIN indexes? I do have a table with the
>> archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
>> (raw dump is about 28GB). This does include simple text data (message
>> body), JSONB (headers) and tsvector (full-text on message body).
> 
> Sidenote: Did you include the tsvector in the table to reduce time
> spent during index creation? I would have used an expression in the
> index definition, rather than a direct column.
> 

Yes, it's a materialized column, not computed during index creation.

>> If I do CREATE index with different number of workers (0 means serial
>> build), I get this timings (in seconds):
> 
> [...]
> 
>> This shows the benefits are pretty nice, depending on the opclass. For
>> most indexes it's maybe ~3-4x faster, which is nice, and I don't think
>> it's possible to do much better - the actual index inserts can happen
>> from a single process only, which is the main limit.
> 
> Can we really not insert with multiple processes? It seems to me that
> GIN could be very suitable for that purpose, with its clear double
> tree structure distinction that should result in few buffer conflicts
> if different backends work on known-to-be-very-different keys.
> We'd probably need multiple read heads on the shared tuplesort, and a
> way to join the generated top-level subtrees, but I don't think that
> is impossible. Maybe it's work for later effort though.
> 

Maybe, but I took it as a restriction and it seemed too difficult to
relax (or at least I assume that).

> Have you tested and/or benchmarked this with multi-column GIN indexes?
> 

I did test that, and I'm not aware of any bugs/issues. Performance-wise
it depends on which opclasses are used by the columns - if you take the
speedup for each of them independently, the speedup for the whole index
is roughly the average of that.

>> For some of the opclasses it can regress (like the jsonb_path_ops). I
>> don't think that's a major issue. Or more precisely, I'm not surprised
>> by it. It'd be nice to be able to disable the parallel builds in these
>> cases somehow, but I haven't thought about that.
> 
> Do you know why it regresses?
> 

No, but one thing that stands out is that the index is much smaller than
the other columns/opclasses, and the compression does not save much
(only about 5% for both phases). So I assume it's the overhead of
writing writing and reading a bunch of GB of data without really gaining
much from doing that.

>> I do plan to do some tests with btree_gin, but I don't expect that to
>> behave significantly differently.
>>
>> There are small variations in the index size, when built in the serial
>> way and the parallel way. It's generally within ~5-10%, and I believe
>> it's due to the serial build adding the TIDs incrementally, while the
>> build adds them in much larger chunks (possibly even in one chunk with
>> all the TIDs for the key).
> 
> I assume that was '[...] while the [parallel] build adds them [...]', right?
> 

Right. The parallel build adds them in larger chunks.

>> I believe the same size variation can happen
>> if the index gets built in a different way, e.g. by inserting the data
>> in a different order, etc. I did a number of tests to check if the index
>> produces the correct results, and I haven't found any issues. So I think
>> this is OK, and neither a problem nor an advantage of the patch.
>>
>>
>> Now, let's talk about the code - the series has 7 patches, with 6
>> non-trivial parts doing changes in focused and

Re: Idea Feedback: psql \h misses -> Offers Links?

2024-05-02 Thread Andrey M. Borodin

> On 17 Apr 2024, at 22:47, Kirk Wolak  wrote:
> 
> Thoughts?

Today we had a hacking session with Nik and Kirk. We produced a patch to assess 
how these links might look like.

Also we needed a url_encode() and found none in a codebase. It would be nice to 
have this as an SQL-callable function.

Thanks!

Best regards, Andrey Borodin.

v1-0001-Add-URLs-to-h-in-psql-when-there-is-no-match.patch
Description: Binary data

Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

2024-05-02 Thread Tom Lane

Chris Cleveland  writes:
> I'm building an index access method which supports an ordering operator:

> CREATE OPERATOR pg_catalog.<<=>> (
> FUNCTION = rdb.rank_match,
> LEFTARG = record,
> RIGHTARG = rdb.RankSpec
> );

Okay ...

> ... there's no reason the system needs to know the actual float value
> returned by rank_match(), the ordering operator distance function. In any
> case, that value can only be calculated based on information in the index
> itself, and can't be calculated by rank_match().

This seems to me to be a very poorly designed concept.  An index
ordering operator is an optimization that the planner may or may
not choose to employ.  If you've designed your code on the assumption
that that's the only possible plan, it will break for any but the
most trivial queries.

> Nevertheless, the system calls rank_match() after every call to
> amgettuple(), and I can't figure out why.

The ORDER BY value is included in the set of values that the plan
is expected to output.  This is so because it's set up to still
work if the planner needs to use an explicit sort step.  For
instance, using a trivial table with a couple of bigint columns:

regression=# explain verbose select * from int8_tbl order by q1/2;
  QUERY PLAN  
--
 Sort  (cost=1.12..1.13 rows=5 width=24)
   Output: q1, q2, ((q1 / 2))
   Sort Key: ((int8_tbl.q1 / 2))
   ->  Seq Scan on public.int8_tbl  (cost=0.00..1.06 rows=5 width=24)
 Output: q1, q2, (q1 / 2)
(5 rows)

The q1/2 column is marked "resjunk", so it doesn't actually get
sent to the client, but it's computed so that the sort step can
use it.  Even if I do this:

regression=# create index on int8_tbl ((q1/2));
CREATE INDEX
regression=# set enable_seqscan TO 0;
SET
regression=# explain verbose select * from int8_tbl order by q1/2;
QUERY PLAN  

---
 Index Scan using int8_tbl_expr_idx on public.int8_tbl  (cost=0.13..12.22 
rows=5 width=24)
   Output: q1, q2, (q1 / 2)
(2 rows)

... it's still there, because the set of columns that the scan node is
expected to emit doesn't change based on the plan type.  We could make
it change perhaps, if we tried hard enough, but so far nobody has
wanted to invest work in that.  Note that even if this indexscan
is chosen, that doesn't ensure that we won't need an explicit sort
later, since the query might need joins or aggregation on top of
the scan node.  So it's far from trivial to decide that the scan
node doesn't need to emit the sort column.

In any case, I'm uninterested in making the world safe for a
design that's going to fail if the planner doesn't choose an
indexscan on a specific index.  That's too fragile.

regards, tom lane

Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

2024-05-02 Thread Matthias van de Meent

On Thu, 2 May 2024 at 18:50, Chris Cleveland  wrote:
>
> Sorry to have to ask for help here, but no amount of stepping through code is 
> giving me the answer.
>
> I'm building an index access method which supports an ordering operator:
[...]
> so there's no reason the system needs to know the actual float value returned 
> by rank_match(), the ordering operator distance function. In any case, that 
> value can only be calculated based on information in the index itself, and 
> can't be calculated by rank_match().
>
> Nevertheless, the system calls rank_match() after every call to amgettuple(), 
> and I can't figure out why. I've stepped through the code, and it looks like 
> it has something to do with ScanState.ps.ps.ps_ProjInfo, but I can't figure 
> out where or why it's getting set.
>
> Here's a sample query. I have not found a query that does *not* call 
> rank_match():
[...]
> I'd be grateful for any help or insights.

The ordering clause produces a junk column that's used to keep track
of the ordering, and because it's a projected column (not the indexed
value, but an expression over that column) the executor will execute
that projection. This happens regardless of it's use in downstream
nodes due to planner or executor limitations.

See also Heikki's thread over at [0], and a comment of me about the
same issue over at pgvector's issue board at [1].

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/2ca5865b-4693-40e5-8f78-f3b45d5378fb%40iki.fi
[1] https://github.com/pgvector/pgvector/issues/359#issuecomment-1840786021

Re: Parallel CREATE INDEX for GIN indexes

2024-05-02 Thread Matthias van de Meent

On Thu, 2 May 2024 at 17:19, Tomas Vondra  wrote:
>
> Hi,
>
> In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
> when working on that I was thinking how difficult would it be to do
> something similar to do that for other index types, like GIN. I even had
> that on my list of ideas to pitch to potential contributors, as I was
> fairly sure it's doable and reasonably isolated / well-defined.
>
> However, I was not aware of any takers, so a couple days ago on a slow
> weekend I took a stab at it. And yes, it's doable - attached is a fairly
> complete, tested and polished version of the feature, I think. It turned
> out to be a bit more complex than I expected, for reasons that I'll get
> into when discussing the patches.

This is great. I've been thinking about approximately the same issue
recently, too, but haven't had time to discuss/implement any of this
yet. I think some solutions may even be portable to the btree parallel
build: it also has key deduplication (though to a much smaller degree)
and could benefit from deduplication during the scan/ssup load phase,
rather than only during insertion.

> First, let's talk about the benefits - how much faster is that than the
> single-process build we have for GIN indexes? I do have a table with the
> archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
> (raw dump is about 28GB). This does include simple text data (message
> body), JSONB (headers) and tsvector (full-text on message body).

Sidenote: Did you include the tsvector in the table to reduce time
spent during index creation? I would have used an expression in the
index definition, rather than a direct column.

> If I do CREATE index with different number of workers (0 means serial
> build), I get this timings (in seconds):

[...]

> This shows the benefits are pretty nice, depending on the opclass. For
> most indexes it's maybe ~3-4x faster, which is nice, and I don't think
> it's possible to do much better - the actual index inserts can happen
> from a single process only, which is the main limit.

Can we really not insert with multiple processes? It seems to me that
GIN could be very suitable for that purpose, with its clear double
tree structure distinction that should result in few buffer conflicts
if different backends work on known-to-be-very-different keys.
We'd probably need multiple read heads on the shared tuplesort, and a
way to join the generated top-level subtrees, but I don't think that
is impossible. Maybe it's work for later effort though.

Have you tested and/or benchmarked this with multi-column GIN indexes?

> For some of the opclasses it can regress (like the jsonb_path_ops). I
> don't think that's a major issue. Or more precisely, I'm not surprised
> by it. It'd be nice to be able to disable the parallel builds in these
> cases somehow, but I haven't thought about that.

Do you know why it regresses?

> I do plan to do some tests with btree_gin, but I don't expect that to
> behave significantly differently.
>
> There are small variations in the index size, when built in the serial
> way and the parallel way. It's generally within ~5-10%, and I believe
> it's due to the serial build adding the TIDs incrementally, while the
> build adds them in much larger chunks (possibly even in one chunk with
> all the TIDs for the key).

I assume that was '[...] while the [parallel] build adds them [...]', right?

> I believe the same size variation can happen
> if the index gets built in a different way, e.g. by inserting the data
> in a different order, etc. I did a number of tests to check if the index
> produces the correct results, and I haven't found any issues. So I think
> this is OK, and neither a problem nor an advantage of the patch.
>
>
> Now, let's talk about the code - the series has 7 patches, with 6
> non-trivial parts doing changes in focused and easier to understand
> pieces (I hope so).

The following comments are generally based on the descriptions; I
haven't really looked much at the patches yet except to validate some
assumptions.

> 1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch
>
> This is the initial feature, adding the "basic" version, implemented as
> pretty much 1:1 copy of the BRIN parallel build and minimal changes to
> make it work for GIN (mostly about how to store intermediate results).
>
> The basic idea is that the workers do the regular build, but instead of
> flushing the data into the index after hitting the memory limit, it gets
> written into a shared tuplesort and sorted by the index key. And the
> leader then reads this sorted data, accumulates the TID for a given key
> and inserts that into the index in one go.

In the code, GIN insertions are still basically single btree
insertions, all starting from the top (but with many same-valued
tuples at once). Now that we have a tuplesort with the full table's
data, couldn't the code be adapted to do more efficient btree loading,
such as that seen in the

Why is FOR ORDER BY function getting called when the index is handling ordering?

2024-05-02 Thread Chris Cleveland

Sorry to have to ask for help here, but no amount of stepping through code
is giving me the answer.

I'm building an index access method which supports an ordering operator:

CREATE OPERATOR pg_catalog.<<=>> (
FUNCTION = rdb.rank_match,
LEFTARG = record,
RIGHTARG = rdb.RankSpec
);

CREATE OPERATOR CLASS rdb_ops DEFAULT FOR TYPE record USING rdb AS
OPERATOR 1 pg_catalog.<===> (record, rdb.userqueryspec),
OPERATOR 2 pg_catalog.<<=>> (record, rdb.rankspec) FOR ORDER BY
pg_catalog.float_ops;

Here is the supporting code (in Rust):

#[derive(Serialize, Deserialize, PostgresType, Debug)]
pub struct RankSpec {
pub fastrank: String,
pub slowrank: String,
pub candidates: i32,
}

#[pg_extern(
sql = "CREATE FUNCTION rdb.rank_match(rec record, rankspec
rdb.RankSpec) \
RETURNS real IMMUTABLE STRICT PARALLEL SAFE LANGUAGE c \
AS 'MODULE_PATHNAME', 'rank_match_wrapper';",
requires = [RankSpec],
no_guard
)]
fn rank_match(_fcinfo: pg_sys::FunctionCallInfo) -> f32 {
// todo make this an error
pgrx::log!("Error -- this ranking method can only be called when
there is an RDB full-text search in the WHERE clause.");
42.0
}

#[pg_extern(immutable, strict, parallel_safe)]
fn rank(fastrank: String, slowrank: String, candidates: i32) ->
RankSpec {
RankSpec {
fastrank,
slowrank,
candidates,
}
}

The index access method works fine. It successfully gets the keys and the
orderbys in the amrescan() method, and it dutifully returns the appropriate
scan.xs_heaptid in the amgettuple() method. It returns the tids in the
proper order. It also sets:

scandesc.xs_recheck = false;
scandesc.xs_recheckorderby = false;

so there's no reason the system needs to know the actual float value
returned by rank_match(), the ordering operator distance function. In any
case, that value can only be calculated based on information in the index
itself, and can't be calculated by rank_match().

Nevertheless, the system calls rank_match() after every call to
amgettuple(), and I can't figure out why. I've stepped through the code,
and it looks like it has something to do with ScanState.ps.ps.ps_ProjInfo,
but I can't figure out where or why it's getting set.

Here's a sample query. I have not found a query that does *not* call
rank_match():

SELECT *
FROM products
WHERE products <===> rdb.userquery('teddy')
ORDER BY products <<=>> rdb.rank();

I'd be grateful for any help or insights.


-- 
Chris Cleveland
312-339-2677 mobile

Specify tranch name in error when not registered

2024-05-02 Thread Tristan Partin

I thought that this might be a small quality of life improvement for 
people scrolling through logs wondering which tranche name wasn't 
registered.


--
Tristan Partin
Neon (https://neon.tech)
From 63c8d92a8a82acc5f8859ab47da5105cef46b88e Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Thu, 2 May 2024 11:05:04 -0500
Subject: [PATCH v1] Specify tranche name in error when not registeed

Useful when tracking down which tranch isn't registered.
---
 src/backend/storage/lmgr/lwlock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b1e388dc7c9..e370b54d9fd 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -593,7 +593,7 @@ GetNamedLWLockTranche(const char *tranche_name)
 		lock_pos += NamedLWLockTrancheRequestArray[i].num_lwlocks;
 	}
 
-	elog(ERROR, "requested tranche is not registered");
+	elog(ERROR, "requested tranche (%s) is not registered", tranche_name);
 
 	/* just to keep compiler quiet */
 	return NULL;
-- 
Tristan Partin
Neon (https://neon.tech)

Re: cataloguing NOT NULL constraints

2024-05-02 Thread Alvaro Herrera

Hello Alexander

On 2024-May-02, Alexander Lakhin wrote:

> Could you also clarify, please, how CREATE TABLE ... LIKE is expected to
> work with NOT NULL constraints?

It should behave identically to 16.  If in 16 you end up with a
not-nullable column, then in 17 you should get a not-null constraint.

> I wonder whether EXCLUDING CONSTRAINTS (ALL) should cover not-null
> constraints too. What I'm seeing now, is that:
> CREATE TABLE t1 (i int, CONSTRAINT nn NOT NULL i);
> CREATE TABLE t2 (LIKE t1 EXCLUDING ALL);
> \d+ t2
> -- ends with:
> Not-null constraints:
>     "nn" NOT NULL "i"

In 16, this results in
   Table "public.t2"
 Column │  Type   │ Collation │ Nullable │ Default │ Storage │ Compression │ 
Stats target │ Description 
┼─┼───┼──┼─┼─┼─┼──┼─
 i  │ integer │   │ not null │ │ plain   │ │
  │ 
Access method: heap

so the fact that we have a not-null constraint in pg17 is correct.

> Or a similar case with PRIMARY KEY:
> CREATE TABLE t1 (i int PRIMARY KEY);
> CREATE TABLE t2 (LIKE t1 EXCLUDING CONSTRAINTS EXCLUDING INDEXES);
> \d+ t2
> -- leaves:
> Not-null constraints:
>     "t2_i_not_null" NOT NULL "i"

Here you also end up with a not-nullable column in 16, so I made it do
that.

Now you could argue that EXCLUDING CONSTRAINTS is explicit in saying
that we don't want the constraints; but in that case why did 16 mark the
columns as not-null?  The answer seems to be that the standard requires
this.  Look at 11.3  syntax rule 9) b) iii) 4):

  4) If the nullability characteristic included in LCDi is known not
  nullable, then let LNCi be NOT NULL; otherwise, let LNCi be the
  zero-length character string.

where LCDi is "1) Let LCDi be the column descriptor of the i-th column
of LT." and then

  5) Let CDi be the 
 LCNi LDTi LNCi

Now, you could claim that the standard doesn't mention
INCLUDING/EXCLUDING CONSTRAINTS, therefore since we have come up with
its definition then we should make it affect not-null constraints.
However, there's also this note:

  NOTE 520 — s, except for NOT NULL, are not included in
  CDi; s are effectively transformed to s and are thereby also excluded.

which is explicitly saying that not-null constraints are treated
differently; in essence, with INCLUDING CONSTRAINTS we choose to affect
the constraints that the standard says to ignore.

Thanks for looking!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
https://twitter.com/thingskatedid/status/1456027786158776329

Re: New GUC autovacuum_max_threshold ?

2024-05-02 Thread Imseih (AWS), Sami

> And as far as that goes, I'd like you - and others - to spell out more
> precisely why you think 100 or 200 million tuples is too much. It
> might be, or maybe it is in some cases but not in others. To me,
> that's not a terribly large amount of data. Unless your tuples are
> very wide, it's a few tens of gigabytes. That is big enough that I can
> believe that you *might* want autovacuum to run when you hit that
> threshold, but it's definitely not *obvious* to me that you want
> autovacuum to run when you hit that threshold.


Vacuuming the heap alone gets faster the more I do it, thanks to the 
visibility map. However, the more indexes I have, and the larger ( in the TBs),
the indexes become, autovacuum workers will be 
monopolized vacuuming these indexes.


At 500k tuples, I am constantly vacuuming large indexes
and monopolizing autovacuum workers. At 100 or 200
million tuples, I will also monopolize autovacuum workers
as they vacuums indexes for many minutes or hours. 


At 100 or 200 million, the monopolization will occur less often, 
but it will still occur leading an operator to maybe have to terminate
the autovacuum an kick of a manual vacuum. 


I am not convinced a new tuple based threshold will address this, 
but I may also may be misunderstanding the intention of this GUC.


> To make that concrete: If the table is 10TB, do you want to vacuum to
> reclaim 20GB of bloat? You might be vacuuming 5TB of indexes to
> reclaim 20GB of heap space - is that the right thing to do? If yes,
> why?


No, I would not want to run autovacuum on 5TB indexes to reclaim 
a small amount of bloat.




Regards,


Sami

Re: Document NULL

2024-05-02 Thread David G. Johnston

On Wed, May 1, 2024 at 9:47 PM Tom Lane  wrote:

> David Rowley  writes:
> > Let's bash it into shape a bit more before going any further on actual
> wording.
>
> FWIW, I want to push back on the idea of making it a tutorial section.
> I too considered that, but in the end I think it's a better idea to
> put it into the "main" docs, for two reasons:
>
>
Version 2 attached.  Still a draft, focused on topic picking and overall
structure.  Examples and links planned plus the usual semantic markup stuff.

I chose to add a new sect1 in the user guide (The SQL Language) chapter,
"Data".  Don't tell Robert.

The "Data Basics" sub-section lets us readily slide this Chapter into the
main flow and here the NULL discussion feels like a natural fit.  In
hindsight, the lack of a Data chapter in a Database manual seems like an
oversight.  One easily made because we assume if you are here you "know"
what data is, but there is still stuff to be discussed, if nothing else to
establish a common understanding between us and our users.

David J.
From 7798121992154edab4768d7eab5a89be04730b2f Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Wed, 1 May 2024 07:45:48 -0700
Subject: [PATCH] Document NULL

---
 doc/src/sgml/data.sgml | 169 +
 doc/src/sgml/filelist.sgml |   1 +
 doc/src/sgml/postgres.sgml |   1 +
 3 files changed, 171 insertions(+)
 create mode 100644 doc/src/sgml/data.sgml

diff --git a/doc/src/sgml/data.sgml b/doc/src/sgml/data.sgml
new file mode 100644
index 00..2b09382494
--- /dev/null
+++ b/doc/src/sgml/data.sgml
@@ -0,0 +1,169 @@
+
+ Data
+
+ 
+  This chapter provides definitions for, and an overview of, data.
+  It discusses the basic design of the metadata related to values
+  and then goes on to describe the special value NULL which typically
+  represents unknown
+ 
+
+ 
+  Data Basics
+  
+   All literals, columns, variables, and expression results in PostgreSQL
+   are typed, which are listed in the next chapter.  Literals and columns
+   must only use one of the concrete types while variables can use
+   either a concrete type or a pseudo-type.  Expression results
+   are limited to concrete types and the pseudo-type record described below.
+  
+  
+   The pseudo-types prefixed with any implement polymorphism
+   in PostgreSQL.  Polymorphism allows a single function specification
+   to act on multiple concrete types.  At runtime, the function body
+   associates concrete types to all polymorphic types based upon the
+   conrete argument of its inputs. See ... for more details.
+  
+  
+   The record pseudo-type is also polymorphic in nature but allows
+   the caller of the function to specify the row-like structure of
+   output within the containing query. See ... for more details.
+   The ROW(...) expression (see ...) will also produce a record
+   result comprised of the named columns.
+  
+ 
+
+  
+   Unknown Values (NULL)
+   
+This section first introduces the meaning of NULL and then goes
+on to explain how different parts of the system behave when faced
+with NULL input.
+   
+
+  
+   NULL in Data Models
+   
+Generally NULL is assumed to mean "unknown".  However,
+in practice meaning comes from context and so a model design may state that
+NULL is to be used to represent "not applicable" - i.e., that a value is not
+even possible.  SQL has only the single value NULL while there are multiple
+concepts that people have chosen to apply it to.  In any case the behavior
+of the system when dealing with NULL is the same regardless of the meaning
+the given to it in the surrounding context.
+   
+   
+NULL also takes on a literal meaning of "not found" when produced as the
+result of an outer join.
+   
+  
+
+  
+   NULL Usage
+   
+As NULL is treated as a data value it, like all values, must have
+a data type.  NULL is a valid value for all data types.
+   
+
+   
+A NULL literal is written as unquoted NULL.  Its type is unknown but
+can be cast to any concrete data type.  The [type 'string'] syntax
+however will not work as there is no way to express NULL using single
+quotes and unlike.
+   
+
+   
+The presence of NULL in the system results in three-valued logic.
+In binary logic every outcome is either true or false.  In
+three-valued logic unknown, represented using a NULL value, is
+also an outcome.  Aspects of the system that branch based upon
+whether a condition variable is true or false thus must also
+decide how to behave when then condition is NULL.  The remaining
+sub-sections summarize these decisions.
+   
+  
+
+  
+The Cardinal Rule of NULL
+   
+The cardinal rule, NULL is never equal or unequal to any non-null
+value (or itself).  [NULL = anything yields NULL].
+Checking for NULL has an explicit test documented
+[here] and additionally there are distinctness tests that treat NULL like
+a value equal

Re: Limit index pages visited in planner's get_actual_variable_range

2024-05-02 Thread Peter Geoghegan

On Thu, May 2, 2024 at 2:12 AM Rian McGuire  wrote:
> The planner was burning a huge amount of CPU time looking through
> index pages for the first visible tuple. The problem eventually
> resolved when the affected index was vacuumed, but that took several
> hours to complete.

This is exactly the same problem recently that Mark Callaghan recently
encountered when benchmarking Postgres using a variant of his insert
benchmark:

https://smalldatum.blogspot.com/2024/01/updated-insert-benchmark-postgres-9x-to_10.html
https://smalldatum.blogspot.com/2024/01/updated-insert-benchmark-postgres-9x-to_27.html
https://smalldatum.blogspot.com/2024/03/trying-to-tune-postgres-for-insert.html

This is a pretty nasty sharp edge. I bet many users encounter this
problem without ever understanding it.

> The previous discussion [1] touched on the idea of also limiting the
> number of index page fetches, but there were doubts about the safety
> of back-patching and the ugliness of modifying the index AM API to
> support this.

Fundamentally, the problem is that the heuristics that we have don't
care about the cost of reading index leaf pages. All that it takes is
a workload where that becomes the dominant cost -- such a workload
won't be helped at all by the existing heap-focussed heuristic. This
seems obvious to me.

It seems natural to fix the problem by teaching the heuristics to give
at least some consideration to the cost of reading index leaf pages --
more than zero. The patch from Simon seems like the right general
approach to me, since it precisely targets the underlying problem.
What other approach is there, really?

> I would like to submit our experience as evidence that the lack of
> limit on index page fetches is a real problem. Even if a fix for this
> doesn't get back-patched, it would be nice to see it in a major
> version.

I find that very easy to believe.

-- 
Peter Geoghegan

Re: cataloguing NOT NULL constraints

2024-05-02 Thread Alexander Lakhin


Hello Alvaro,

01.05.2024 20:49, Alvaro Herrera wrote:

Here are two patches that I intend to push soon (hopefully tomorrow).



Thank you for fixing those issues!

Could you also clarify, please, how CREATE TABLE ... LIKE is expected to
work with NOT NULL constraints?

I wonder whether EXCLUDING CONSTRAINTS (ALL) should cover not-null
constraints too. What I'm seeing now, is that:
CREATE TABLE t1 (i int, CONSTRAINT nn NOT NULL i);
CREATE TABLE t2 (LIKE t1 EXCLUDING ALL);
\d+ t2
-- ends with:
Not-null constraints:
    "nn" NOT NULL "i"

Or a similar case with PRIMARY KEY:
CREATE TABLE t1 (i int PRIMARY KEY);
CREATE TABLE t2 (LIKE t1 EXCLUDING CONSTRAINTS EXCLUDING INDEXES);
\d+ t2
-- leaves:
Not-null constraints:
    "t2_i_not_null" NOT NULL "i"

Best regards,
Alexander

Re: Build with meson + clang + sanitizer resulted in undefined reference

2024-05-02 Thread Maxim Orlov

On Wed, 1 May 2024 at 00:11, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Seems to be a meson quirk [1]. I could reproduce this, and adding
> -Db_lundef=false on top of your configuration solved the issue.
>
> [1]: https://github.com/mesonbuild/meson/issues/3853
>

Thank you for a reply!  Yes, it seems to be that way.  Many thanks for the
clarification.

-- 
Best regards,
Maxim Orlov.

Re: [PoC] Reducing planning time when tables have many partitions

2024-05-02 Thread jian he

On Thu, May 2, 2024 at 3:57 PM Yuya Watari  wrote:
>

hi. sorry to bother you, maybe a dumb question.

trying to understand something under the hood.
currently I only applied
v24-0001-Speed-up-searches-for-child-EquivalenceMembers.patch.

on v24-0001:
+/*
+ * add_eq_member - build a new EquivalenceMember and add it to an EC
+ */
+static EquivalenceMember *
+add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids,
+  JoinDomain *jdomain, Oid datatype)
+{
+ EquivalenceMember *em = make_eq_member(ec, expr, relids, jdomain,
+   NULL, datatype);
+
+ ec->ec_members = lappend(ec->ec_members, em);
+ return em;
+}
+
this part seems so weird to me.
add_eq_member function was added very very long ago,
why do we create a function with the same function name?

also I didn't see deletion of original add_eq_member function
(https://git.postgresql.org/cgit/postgresql.git/tree/src/backend/optimizer/path/equivclass.c#n516)
in v24-0001.

Obviously, now I cannot compile it correctly.
What am I missing?

Re: Rename libpq trace internal functions

2024-05-02 Thread Peter Eisentraut


On 24.04.24 12:34, Yugo NAGATA wrote:

On Wed, 24 Apr 2024 09:39:02 +0200
Peter Eisentraut  wrote:


libpq's pqTraceOutputMessage() used to look like this:

  case 'Z':   /* Ready For Query */
  pqTraceOutputZ(conn->Pfdebug, message, );
  break;

Commit f4b54e1ed98 introduced macros for protocol characters, so now
it looks like this:

  case PqMsg_ReadyForQuery:
  pqTraceOutputZ(conn->Pfdebug, message, );
  break;

But this introduced a disconnect between the symbol in the switch case
and the function name to be called, so this made the manageability of
this file a bit worse.

This patch changes the function names to match, so now it looks like
this:

  case PqMsg_ReadyForQuery:
  pqTraceOutput_ReadyForQuery(conn->Pfdebug, message, );
  break;


+1

I prefer the new function names since it seems more natural and easier to read.

I noticed pqTraceOutputNR() is left as is, is this intentional? Or, shoud this
be changed to pqTranceOutput_NoticeResponse()?


pqTraceOutputNR() is shared code used internally by _ErrorResponse() and 
_NoticeResponse().  I have updated the comments a bit to make this clearer.


With that, I have committed it.  Thanks.

Re: EXPLAN redundant options

2024-05-02 Thread Tom Lane

"David G. Johnston"  writes:
> On Thu, May 2, 2024 at 6:17 AM jian he  wrote:
>> explain (verbose, verbose off, analyze on, analyze off, analyze on)

> I have no desire to introduce breakage here.  The implemented concept is
> actually quite common.  The inconsistency with COPY seems like a minor
> point.  It would influence my green field choice but not enough for
> changing long-standing behavior.

The argument for changing this would be consistency, but if you want
to argue for it on those grounds, you'd need to change *every* command
that acts that way.  I really doubt EXPLAIN is the only one.

There's also a theological argument to be had about which
behavior is preferable.  For my own taste, I like last-one-wins.
That's extremely common with command line switches, for instance.
So maybe we should be making our commands consistent in the other
direction.

regards, tom lane

Re: EXPLAN redundant options

2024-05-02 Thread Euler Taveira

On Thu, May 2, 2024, at 10:36 AM, David G. Johnston wrote:
> On Thu, May 2, 2024 at 6:17 AM jian he  wrote:
>> explain (verbose, verbose off, analyze on, analyze off, analyze on)
> 
> I would just update this paragraph to note the last one wins behavior.
> 
> "When the option list is surrounded by parentheses, the options can be 
> written in any order.  However, if options are repeated the last one listed 
> is used."
> 
> I have no desire to introduce breakage here.  The implemented concept is 
> actually quite common.  The inconsistency with COPY seems like a minor point. 
>  It would influence my green field choice but not enough for changing 
> long-standing behavior.
> 

There is no policy saying we cannot introduce incompatibility changes in major
releases. If you check for "conflicting or redundant options" or
"errorConflictingDefElem", you will notice that various SQL commands prevent
you to inform redundant options. IMO avoid redundant options is a good goal
because (i) it keeps the command short and (b) it doesn't require you to check
all options to figure out what's the current option value. If the application
is already avoiding redundant options for other commands, the probability of
allowing it just for EXPLAIN is low.

postgres=# create database foo with owner = bar owner = xpto;
ERROR:  conflicting or redundant options
LINE 1: create database foo with owner = bar owner = xpto;
   ^
postgres=# create user foo with createdb login createdb;
ERROR:  conflicting or redundant options
LINE 1: create user foo with createdb login createdb;
^

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Support LIKE with nondeterministic collations

2024-05-02 Thread Peter Eisentraut


On 30.04.24 14:39, Daniel Verite wrote:

   postgres=# SELECT '.foo.' like '_oo' COLLATE ign_punct;
?column?
   --
f
   (1 row)

The first two results look fine, but the next one is inconsistent.


This is correct, because '_' means "any single character".  This is 
independent of the collation.


I think with nondeterministic collations, the single-character wildcard 
is often not going to be all that useful.

Re: EXPLAN redundant options

2024-05-02 Thread David G. Johnston

On Thu, May 2, 2024 at 6:17 AM jian he  wrote:

> explain (verbose, verbose off, analyze on, analyze off, analyze on)
>
>
I would just update this paragraph to note the last one wins behavior.

"When the option list is surrounded by parentheses, the options can be
written in any order.  However, if options are repeated the last one listed
is used."

I have no desire to introduce breakage here.  The implemented concept is
actually quite common.  The inconsistency with COPY seems like a minor
point.  It would influence my green field choice but not enough for
changing long-standing behavior.

David J.

EXPLAN redundant options

2024-05-02 Thread jian he

hi.

just found out we can:
explain (verbose, verbose off, analyze on, analyze off, analyze on)
select count(*) from tenk1;

similar to COPY, do we want to error out these redundant options?

Re: Removing unneeded self joins

2024-05-02 Thread Richard Guo

On Thu, May 2, 2024 at 6:08 PM Alexander Korotkov 
wrote:

> On Thu, May 2, 2024 at 12:45 PM Andrei Lepikhov
>  wrote:
> > One question for me is: Do we anticipate other lateral self-references
> > except the TABLESAMPLE case? Looking into the extract_lateral_references
> > implementation, I see the only RTE_SUBQUERY case to be afraid of. But we
> > pull up subqueries before extracting lateral references. So, if we have
> > a reference to a subquery, it means we will not flatten this subquery
> > and don't execute SJE. Do we need more code, as you have written in the
> > first patch?
>
> I think my first patch was crap anyway.  Your explanation seems
> reasonable to me.  I'm not sure this requires any more code.  Probably
> it would be enough to add more comments about this.

The tablesample case is not the only factor that can cause a relation to
have a lateral dependency on itself after self-join removal.  It can
also happen with PHVs.  As an example, consider

explain (costs off)
select * from t t1
left join lateral
  (select t1.a as t1a, * from t t2) t2
on true
where t1.a = t2.a;
server closed the connection unexpectedly

This is because after self-join removal, a PlaceHolderInfo's ph_lateral
might contain rels mentioned in ph_eval_at, which we should get rid of.

For the tablesample case, I agree that we should not consider relations
with TABLESAMPLE clauses as candidates to be removed.  Removing such a
relation could potentially change the syntax of the query, as shown by
Alexander's example.  It seems to me that we can just check that in
remove_self_joins_recurse, while we're collecting the base relations
that are considered to be candidates for removal.

This leads to the attached patch.  This patch also includes some code
refactoring for the surrounding code.

Thanks
Richard

v1-0001-Fix-bogus-lateral-dependency-after-self-join-removal.patch
Description: Binary data

AW: Type and CAST error on lowest negative integer values for smallint, int and bigint

2024-05-02 Thread Hans Buschmann

Thank you for your quick response.

This was very helpfull and resolved my problem.

But for me it is a bit counterintuitive that -32768 is not treated as a 
negative constant but as a unary operator to a positive constant.

It could be helpfull to remind the user of the nature of negative constants and 
tthe highest precedence of casts in the numeric type section of the doc.

Perhaps someone could add an index topic "precedence of operators", since this 
is a very important information for every computer language.

(I just found out that a topic of "operator precedence" exists already in the 
index)

I just sent this message for the case this could be a problem to be resolved 
before the next minor versions scheduled for next week.

Regards

Hans Buschmann

Von: David Rowley 
Gesendet: Donnerstag, 2. Mai 2024 13:33
An: Hans Buschmann
Cc: pgsql-hack...@postgresql.org
Betreff: Re: Type and CAST error on lowest negative integer values for 
smallint, int and bigint

On Thu, 2 May 2024 at 23:25, Hans Buschmann  wrote:
> postgres=# select -32768::smallint;
> ERROR:  smallint out of range

The precedence order of operations applies the cast before the unary
minus operator.

Any of the following will work:

postgres=# select cast(-32768 as smallint), (-32768)::smallint,
'-32768'::smallint;
  int2  |  int2  |  int2
++
 -32768 | -32768 | -32768
(1 row)

David

Re: Typos in the code and README

2024-05-02 Thread Alexander Lakhin


Hello,

28.04.2024 11:05, David Rowley wrote:


On Sat, 20 Apr 2024 at 16:09, David Rowley  wrote:

Here are a few more to see if it motivates anyone else to do a more
thorough search for another batch.

I've pushed these now.


Please look also at the list of other typos and inconsistencies I've found
on the master branch:
additional_notnulls -> old_notnulls (cf. b0e96f311)
ATAddCheckConstraint ->? ATAddCheckNNConstraint (cf. b0e96f311)
bt_page_check -> bt_target_page_check (cf. 5ae208720)
calblack_arg -> callback_arg
combinig -> combining
compaining -> complaining
ctllock - remove (cf. 53c2a97a9)
dabatase -> database
eval_const_exprs_mutator -> eval_const_expressions_mutator
ExecEndValuesScan - remove (cf. d060e921e)
get_json_nested_columns -> get_json_table_nested_columns
io_direct -> debug_io_direct
iS ->? IS (a neatnik-ism)
joing ->? join (cf. 20f90a0e4)
_jumbleRangeTblEntry - remove (cf. 367c989cd)
Mallroy -> Mallory
onstead -> instead
procedual -> procedural
psql_safe -> safe_psql
read_quoted_pattern -> read_quoted_string
repertiore -> repertoire
rightfirstdataoffset ->? rightfirstoffset (cf. 5ae208720)
Sincle ->? Since (perhaps the whole sentence could be rephrased)
sslnegotition=direct/requiredirectwould -> sslnegotiation=direct/requiredirect 
would
sslnegotitation -> sslnegotiation
walreciever -> walreceiver
xid_wrapround -> xid_wraparound

(some of them are located in doc/, so it's not a code-only change)
I've attached the patch for your convenience, though maybe some
of the suggestions are to be discarded.

Best regards,
Alexanderdiff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 20da4a46ba..70f65b645a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1086,7 +1086,7 @@ bt_entry_unique_check(BtreeCheckState *state, IndexTuple itup,
 
 	/*
 	 * Current tuple has no posting list. If TID is visible save info about it
-	 * for the next comparisons in the loop in bt_page_check(). Report
+	 * for the next comparisons in the loop in bt_target_page_check(). Report
 	 * duplicate if lVis_tid is already valid.
 	 */
 	else
@@ -1953,7 +1953,7 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  *
- * Save rightfirstdataoffset for detailed error message.
+ * Save rightfirstoffset for detailed error message.
  */
 static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state, OffsetNumber *rightfirstoffset)
diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index d7e78204ad..f4b00399e4 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -331,7 +331,7 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption'
  xid_wraparound
  
   
-   Runs the test suite under src/test/module/xid_wrapround.
+   Runs the test suite under src/test/module/xid_wraparound.
Not enabled by default because it is resource intensive.
   
  
diff --git a/doc/src/sgml/targets-meson.txt b/doc/src/sgml/targets-meson.txt
index 1c42dd3028..bd470c87a7 100644
--- a/doc/src/sgml/targets-meson.txt
+++ b/doc/src/sgml/targets-meson.txt
@@ -11,7 +11,7 @@ Code Targets:
   backend   Build backend and related modules
   bin   Build frontend binaries
   contrib   Build contrib modules
-  plBuild procedual languages
+  plBuild procedural languages
 
 Developer Targets:
   reformat-dat-filesRewrite catalog data files into standard format
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bf28400dd8..d0f3848c95 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2598,7 +2598,7 @@ _brin_parallel_heapscan(BrinBuildState *state)
  *
  * After waiting for all workers to finish, merge the per-worker results into
  * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
+ * (start of the page range). While combining the per-worker results we merge
  * summaries for the same page range, and also fill-in empty summaries for
  * ranges without any tuples.
  *
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index f99ec38a4a..77b05cc0a7 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -228,7 +228,6 @@ SimpleLruAutotuneBuffers(int divisor, int max)
  * name: name of SLRU.  (This is user-visible, pick with care!)
  * nslots: number of page slots to use.
  * nlsns: number of LSN groups per page (set to zero if not relevant).
- * ctllock: LWLock to use to control access to the shared control structure.
  * subdir: PGDATA-relative subdirectory that will contain the files.
  * buffer_tranche_id: tranche ID to use for the SLRU's

Re: Type and CAST error on lowest negative integer values for smallint, int and bigint

2024-05-02 Thread David Rowley

On Thu, 2 May 2024 at 23:25, Hans Buschmann  wrote:
> postgres=# select -32768::smallint;
> ERROR:  smallint out of range

The precedence order of operations applies the cast before the unary
minus operator.

Any of the following will work:

postgres=# select cast(-32768 as smallint), (-32768)::smallint,
'-32768'::smallint;
  int2  |  int2  |  int2
++
 -32768 | -32768 | -32768
(1 row)

David

Type and CAST error on lowest negative integer values for smallint, int and bigint

2024-05-02 Thread Hans Buschmann

I tried to initialize a table with values for smallint columns.

The final goal is to get mask values for logical operations.


The insert failed with ERROR: smallint out of range.


the same occurs when using a simple select statement like:

select -32768::smallint;
select -2147483648::int;
select -9223372036854775808::bigint;

These limit values are taken from the documentation 8.1.1 Integer Types

This occurs on PG16.2 on Windows or Linux (64bit).

This prevents me to enter the binary value 0b1000_ into a smallint 
column
(I could use some other tricks, but this is ugly!)

---

postgres=# select version ();
 version
--
 PostgreSQL 16.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 14.0.1 20240411 
(Red Hat 14.0.1-0), 64-bit
(1 Zeile)

postgres=# select -32768::smallint;
ERROR:  smallint out of range

Thank you for looking


Hans Buschmann

Re: Reducing the log spam

2024-05-02 Thread Jelte Fennema-Nio

On Thu, 2 May 2024 at 13:08, Jelte Fennema-Nio  wrote:
> 2. Change the newly added check in errcode() to only set
> output_to_server to false when IsLogicalWorker() returns false.

Actually a third, and probably even better solution would be to only
apply this new GUC to non-backgroundworker processes. That seems quite
reasonable, since often the only way to access background worker
errors is often through the logs.

Re: Reducing the log spam

2024-05-02 Thread Jelte Fennema-Nio

On Thu, 2 May 2024 at 12:47, Laurenz Albe  wrote:
> Yes.  But I'd argue that that is a shortcoming of logical replication:
> there should be a ways to get this information via SQL.  Having to look into
> the log file is not a very useful option.

Definitely agreed that accessing the error details using SQL would be
much better. But having no way at all (by default) to find the cause
of the failure is clearly much worse.

> The feature will become much less useful if unique voilations keep getting 
> logged.

Agreed. How about changing the patch so that the GUC is not applied to
logical replication apply workers (and document that accordingly). I
can think of two ways of achieving that (but there might be
other/better ones):
1. Set the GUC to empty string when an apply worker is started.
2. Change the newly added check in errcode() to only set
output_to_server to false when IsLogicalWorker() returns false.

Re: Reducing the log spam

2024-05-02 Thread Laurenz Albe

On Mon, 2024-03-11 at 09:33 +0100, Jelte Fennema-Nio wrote:
> -   the subscriber's server log.
> +   the subscriber's server log if you remove 23505 from
> +   .
> 
> This seems like a pretty big regression. Being able to know why your
> replication got closed seems pretty critical.

Yes.  But I'd argue that that is a shortcoming of logical replication:
there should be a ways to get this information via SQL.  Having to look into
the log file is not a very useful option.

The feature will become much less useful if unique voilations keep getting 
logged.

Yours,
Laurenz Albe

Re: Removing unneeded self joins

2024-05-02 Thread Alexander Korotkov

On Thu, May 2, 2024 at 12:45 PM Andrei Lepikhov
 wrote:
>
> On 5/1/24 18:59, Alexander Korotkov wrote:
> > I think we probably could forbid SJE for the tables with TABLESAMPLE
> > altogether.  Please, check the attached patch.
> Your patch looks good to me. I added some comments and test case into
> the join.sql.

Thank you

> One question for me is: Do we anticipate other lateral self-references
> except the TABLESAMPLE case? Looking into the extract_lateral_references
> implementation, I see the only RTE_SUBQUERY case to be afraid of. But we
> pull up subqueries before extracting lateral references. So, if we have
> a reference to a subquery, it means we will not flatten this subquery
> and don't execute SJE. Do we need more code, as you have written in the
> first patch?

I think my first patch was crap anyway.  Your explanation seems
reasonable to me.  I'm not sure this requires any more code.  Probably
it would be enough to add more comments about this.

--
Regards,
Alexander Korotkov

Re: Removing unneeded self joins

2024-05-02 Thread Andrei Lepikhov


On 5/1/24 18:59, Alexander Korotkov wrote:

I think we probably could forbid SJE for the tables with TABLESAMPLE
altogether.  Please, check the attached patch.
Your patch looks good to me. I added some comments and test case into 
the join.sql.


One question for me is: Do we anticipate other lateral self-references 
except the TABLESAMPLE case? Looking into the extract_lateral_references 
implementation, I see the only RTE_SUBQUERY case to be afraid of. But we 
pull up subqueries before extracting lateral references. So, if we have 
a reference to a subquery, it means we will not flatten this subquery 
and don't execute SJE. Do we need more code, as you have written in the 
first patch?


--
regards,
Andrei Lepikhov
Postgres Professional
From dac8afd2095586921dfcf5564e7f2979e89b77de Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Thu, 2 May 2024 16:17:52 +0700
Subject: [PATCH] Forbid self-join elimination on table sampling scans.

Having custom table sampling methods we can stuck into unpredictable issues
replacing join with scan operation. It may had sense to analyse possible
situations and enable SJE, but the real profit from this operation looks
too low.
---
 src/backend/optimizer/plan/analyzejoins.c |  8 
 src/backend/optimizer/plan/initsplan.c|  5 -
 src/test/regress/expected/join.out| 13 +
 src/test/regress/sql/join.sql |  5 +
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 506fccd20c..bb89558dcd 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -2148,6 +2148,14 @@ remove_self_joins_one_group(PlannerInfo *root, Relids relids)
 			Assert(root->simple_rte_array[k]->relid ==
    root->simple_rte_array[r]->relid);
 
+			/*
+			 * To avoid corner cases with table sampling methods just forbid
+			 * SJE for table sampling entries.
+			 */
+			if (root->simple_rte_array[k]->tablesample ||
+root->simple_rte_array[r]->tablesample)
+continue;
+
 			/*
 			 * It is impossible to eliminate join of two relations if they
 			 * belong to different rules of order. Otherwise planner can't be
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e2c68fe6f9..bf839bcaf6 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -415,7 +415,10 @@ extract_lateral_references(PlannerInfo *root, RelOptInfo *brel, Index rtindex)
 	if (!rte->lateral)
 		return;
 
-	/* Fetch the appropriate variables */
+	/* Fetch the appropriate variables.
+	 * Changes in this place may need changes in SJE logic, see
+	 * the remove_self_joins_one_group routine.
+	 */
 	if (rte->rtekind == RTE_RELATION)
 		vars = pull_vars_of_level((Node *) rte->tablesample, 0);
 	else if (rte->rtekind == RTE_SUBQUERY)
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 8b640c2fc2..63143fe55f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -6900,6 +6900,19 @@ where s1.x = 1;
Filter: (1 = 1)
 (9 rows)
 
+-- Check that SJE doesn't touch TABLESAMPLE joins
+explain (costs off)
+SELECT * FROM emp1 t1 NATURAL JOIN LATERAL
+  (SELECT * FROM emp1 t2 TABLESAMPLE SYSTEM (t1.code));
+ QUERY PLAN  
+-
+ Nested Loop
+   ->  Seq Scan on emp1 t1
+   ->  Sample Scan on emp1 t2
+ Sampling: system (t1.code)
+ Filter: ((t1.id = id) AND (t1.code = code))
+(5 rows)
+
 -- Check that PHVs do not impose any constraints on removing self joins
 explain (verbose, costs off)
 select * from emp1 t1 join emp1 t2 on t1.id = t2.id left join
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index c4c6c7b8ba..184fd0876b 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -2652,6 +2652,11 @@ select 1 from emp1 t1 left join
 on true
 where s1.x = 1;
 
+-- Check that SJE doesn't touch TABLESAMPLE joins
+explain (costs off)
+SELECT * FROM emp1 t1 NATURAL JOIN LATERAL
+  (SELECT * FROM emp1 t2 TABLESAMPLE SYSTEM (t1.code));
+
 -- Check that PHVs do not impose any constraints on removing self joins
 explain (verbose, costs off)
 select * from emp1 t1 join emp1 t2 on t1.id = t2.id left join
-- 
2.39.2

Re: Extend ALTER DEFAULT PRIVILEGES for large objects

2024-05-02 Thread Yugo NAGATA

On Fri, 26 Apr 2024 12:23:45 +0200
Matthias van de Meent  wrote:

> On Fri, 26 Apr 2024 at 10:54, Yugo NAGATA  wrote:
> >
> > On Wed, 24 Apr 2024 16:08:39 -0500
> > Nathan Bossart  wrote:
> >
> > > On Tue, Apr 23, 2024 at 11:47:38PM -0400, Tom Lane wrote:
> > > > On the whole I find this proposed feature pretty unexciting
> > > > and dubiously worthy of the implementation/maintenance effort.
> > >
> > > I don't have any particularly strong feelings on $SUBJECT, but I'll admit
> > > I'd be much more interested in resolving any remaining reasons folks are
> > > using large objects over TOAST.  I see a couple of reasons listed in the
> > > docs [0] that might be worth examining.
> > >
> > > [0] https://www.postgresql.org/docs/devel/lo-intro.html
> >
> > If we could replace large objects with BYTEA in any use cases, large objects
> > would be completely obsolete. However, currently some users use large 
> > objects
> > in fact, so improvement in this feature seems beneficial for them.
> >
> >
> > Apart from that, extending TOAST to support more than 1GB data and
> > stream-style access seems a good challenge. I don't know if there was a
> > proposal for this in past. This is  just a thought, for this purpose, we
> > will need a new type of varlena that can contains large size information,
> > and a new toast table schema that can store offset information or some way
> > to convert a offset to chunk_seq.
> 
> If you're interested in this, you may want to check out [0] and [1] as
> threads on the topic of improving TOAST handling of large values ([1]
> being a thread where the limitations of our current external TOAST
> pointer became clear once more), and maybe talk with Aleksander
> Alekseev and Nikita Malakhov. They've been working closely with
> systems that involve toast pointers and their limitations.
> 
> The most recent update on the work of Nikita (reworking TOAST
> handling) [2] is that he got started adapting their externally
> pluggable toast into type-internal methods only, though I've not yet
> noticed any updated patches appear on the list.

Thank you for your information. I'll check the threads you mentioned.

> As for other issues with creating larger TOAST values:
> TOAST has a value limit of ~1GB, which means a single large value (or
> two, for that matter) won't break anything in the wire protocol, as
> DataRow messages have a message size field of uint32 [^3]. However, if
> we're going to allow even larger values to be stored in table's
> attributes, we'll have to figure out how we're going to transfer those
> larger values to (and from) clients. For large objects, this is much
> less of an issue because the IO operations are already chunked by
> design, but this may not work well for types that you want to use in
> your table's columns.

I overlooked this issue. I faced the similar issue when I tried to
pg_dump large text values, although the error was raised from
enlargeStringInfo() in that case

Regards,
Yugo Nagata


> Kind regards,
> 
> Matthias van de Meent
> 
> [0] 
> https://postgr.es/m/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com
> [1] 
> https://postgr.es/m/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
> [2] 
> https://postgr.es/m/CAN-LCVOgMrda9hOdzGkCMdwY6dH0JQa13QvPsqUwY57TEn6jww%40mail.gmail.com
> 
> [^3] Most, if not all PostgreSQL wire protocol messages have this
> uint32 message size field, but the DataRow one is relevant here as
> it's the one way users get their data out of the database.


-- 
Yugo NAGATA

Re: Weird test mixup

2024-05-02 Thread Andrey M. Borodin

> On 2 May 2024, at 13:43, Michael Paquier  wrote:
> 
> A detach is not a wakeup.

Oh, now I see. Sorry for the noise.

Detaching local injection point of other backend seems to be useless and can be 
forbidden.
As far as I understand, your patch is already doing this in
+   if (!injection_point_allowed(name))
+   elog(ERROR, "cannot detach injection point \"%s\" not allowed 
to run",
+name);
+

As far as I understand this will effectively forbid calling 
injection_points_detach() for local injection point of other backend. Do I get 
it right?

Best regards, Andrey Borodin.

Re: Weird test mixup

2024-05-02 Thread Michael Paquier

On Thu, May 02, 2024 at 01:33:45PM +0500, Andrey M. Borodin wrote:
> That seems to prevent meaningful use case. If we want exactly one
> session to be waiting just before some specific point, the only way
> to achieve this is to create local injection point. But the session
> must be resumable from another session.
> Without this local waiting injection points are meaningless.

I am not quite sure to follow your argument here.  It is still
possible to attach a local injection point with a wait callback that
can be awaken by a different backend:
s1: select injection_points_set_local();
s1: select injection_points_attach('popo', 'wait');
s1: select injection_points_run('popo'); -- waits
s2: select injection_points_wakeup('popo');
s1: -- ready for action.

A detach is not a wakeup.
--
Michael

signature.asc
Description: PGP signature

Re: Weird test mixup

2024-05-02 Thread Andrey M. Borodin




> On 2 May 2024, at 12:27, Michael Paquier  wrote:
> 
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
>> While writing an injection point test, I encountered a variant of the race
>> condition that f4083c4 fixed.  It had three sessions and this sequence of
>> events:
>> 
>> s1: local-attach to POINT
>> s2: enter InjectionPointRun(POINT), yield CPU just before 
>> injection_callback()
>> s3: detach POINT, deleting the InjectionPointCondition record
>> s2: wake up and run POINT as though it had been non-local
> 
> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

That seems to prevent meaningful use case. If we want exactly one session to be 
waiting just before some specific point, the only way to achieve this is to 
create local injection point. But the session must be resumable from another 
session.
Without this local waiting injection points are meaningless.


Best regards, Andrey Borodin.

Re: [PoC] Reducing planning time when tables have many partitions

2024-05-02 Thread Yuya Watari

Hello Ashutosh,

Thank you for your email and for reviewing the patch. I sincerely
apologize for the delay in responding.

On Wed, Mar 6, 2024 at 11:16 PM Ashutosh Bapat
 wrote:
> here's summary of observations
> 1. The patch improves planning time significantly (3X to 20X) and the 
> improvement increases with the number of tables being joined.
> 2. In the assert enabled build the patch slows down (in comparison to HEAD) 
> planning with higher number of tables in the join. You may want to 
> investigate this. But this is still better than my earlier measurements.
> 3. The patch increased memory consumption by planner. But the numbers have 
> improved since my last measurement. Still it will be good to investigate what 
> causes this extra memory consumption.
> 4. Generally with the assert enabled build planner consumes more memory with 
> or without patch. From my previous experience this might be due to Bitmapset 
> objects created within Assert() calls.

Thank you for testing the patch and sharing the results. For comment
#1, these results show the effectiveness of the patch.

For comment #2, I agree that we should not slow down assert-enabled
builds. The patch adds a lot of assertions to avoid adding bugs, but
they might be too excessive. I will reconsider these assertions and
remove unnecessary ones.

For comments #3 and #4, while the patch improves time complexity, it
has some negative impacts on space complexity. The patch uses a
Bitmapset-based index to speed up searching for EquivalenceMembers and
RestrictInfos. Reducing this memory consumption is a little hard, but
this is a very important problem in committing this patch, so I will
investigate this further.

> Does v24-0002 have any relation/overlap with my patches to reduce memory 
> consumed by RestrictInfos? Those patches have code to avoid creating 
> duplicate RestrictInfos (including commuted RestrictInfos) from ECs. [2]

Thank you for sharing these patches. My patch may be related to your
patches. My patch speeds up slow linear searches over
EquivalenceMembers and RestrictInfos. It uses several approaches, one
of which is the Bitmapset-based index. Bitmapsets are space
inefficient, so if there are many EquivalenceMembers and
RestrictInfos, this index becomes large. This is true for highly
partitioned cases, where there are a lot of similar (or duplicate)
elements. Eliminating such duplicate elements may help my patch reduce
memory consumption. I will investigate this further.

Unfortunately, I've been busy due to work, so I may not be able to
respond soon. I really apologize for this. However, I will look into
the patches, including yours, and share further information if found.

Again, I apologize for my late response and appreciate your kind review.

-- 
Best regards,
Yuya Watari

Re: Weird test mixup

2024-05-02 Thread Michael Paquier

On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> While writing an injection point test, I encountered a variant of the race
> condition that f4083c4 fixed.  It had three sessions and this sequence of
> events:
> 
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s3: detach POINT, deleting the InjectionPointCondition record
> s2: wake up and run POINT as though it had been non-local

Fun.  One thing I would ask is why it makes sense to be able to detach
a local point from a different session than the one who defined it as
local.  Shouldn't the operation of s3 be restricted rather than
authorized as a safety measure, instead?

> On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
>> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
>>> Wrt. the spinlock and shared memory handling, I think this would be simpler
>>> if you could pass some payload in the InjectionPointAttach() call, which
>>> would be passed back to the callback function:
>>> 
>>> In this case, the payload would be the "slot index" in shared memory.
>>> 
>>> Or perhaps always allocate, say, 1024 bytes of working area for every
>>> attached injection point that the test module can use any way it wants. Like
>>> for storing extra conditions, or for the wakeup counter stuff in
>>> injection_wait(). A fixed size working area is a little crude, but would be
>>> very handy in practice.
> 
> That would be one good way to solve it.  (Storing a slot index has the same
> race condition, but it fixes the race to store a struct containing the PID.)
> 
> The best alternative I see is to keep an InjectionPointCondition forever after
> creating it.  Give it a "bool valid" field that we set on detach.  I don't see
> a major reason to prefer one of these over the other.  One puts a negligible
> amount of memory pressure on the main segment, but it simplifies the module
> code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
> opinions?

If more state data is needed, the fixed area injection_point.c would
be better.  Still, I am not sure that this is required here, either.

> Separately, injection_points_cleanup() breaks the rules by calling
> InjectionPointDetach() while holding a spinlock.  The latter has an
> elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
> given as much thought to solutions for this one.

Indeed.  That's a brain fade.  This one could be fixed by collecting
the point names when cleaning up the conditions and detach after
releasing the spinlock.  This opens a race condition between the
moment when the spinlock is released and the detach, where another
backend could come in and detach a point before the shmem_exit
callback has the time to do its cleanup, even if detach() is
restricted for local points.  So we could do the callback cleanup in
three steps in the shmem exit callback: 
- Collect the names of the points to detach, while holding the
spinlock.
- Do the Detach.
- Take again the spinlock, clean up the conditions.

Please see the attached.
--
Michael
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index ace386da50..caf58ef9db 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -159,10 +159,39 @@ injection_point_allowed(const char *name)
 static void
 injection_points_cleanup(int code, Datum arg)
 {
+	char		names[INJ_MAX_CONDITION][INJ_NAME_MAXLEN] = {0};
+	int			count = 0;
+
 	/* Leave if nothing is tracked locally */
 	if (!injection_point_local)
 		return;
 
+	/*
+	 * This is done in three steps: detect the points to detach,
+	 * detach them and release their conditions.
+	 */
+	SpinLockAcquire(_state->lock);
+	for (int i = 0; i < INJ_MAX_CONDITION; i++)
+	{
+		InjectionPointCondition *condition = _state->conditions[i];
+
+		if (condition->name[0] == '\0')
+			continue;
+
+		if (condition->pid != MyProcPid)
+			continue;
+
+		/* Extract the point name to detach */
+		strlcpy(names[count], condition->name, INJ_NAME_MAXLEN);
+		count++;
+	}
+	SpinLockRelease(_state->lock);
+
+	/* Detach, without holding the spinlock */
+	for (int i = 0; i < count; i++)
+		InjectionPointDetach(names[i]);
+
+	/* Clear all the conditions */
 	SpinLockAcquire(_state->lock);
 	for (int i = 0; i < INJ_MAX_CONDITION; i++)
 	{
@@ -174,8 +203,6 @@ injection_points_cleanup(int code, Datum arg)
 		if (condition->pid != MyProcPid)
 			continue;
 
-		/* Detach the injection point and unregister condition */
-		InjectionPointDetach(condition->name);
 		condition->name[0] = '\0';
 		condition->pid = 0;
 	}
@@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
 {
 	char	   *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
 
+	if (!injection_point_allowed(name))
+		elog(ERROR, "cannot detach injection point \"%s\" not allowed to run",
+			 name);
+

Re: New GUC autovacuum_max_threshold ?

2024-05-02 Thread Frédéric Yhuel





Le 01/05/2024 à 20:50, Robert Haas a écrit :

Possibly what we need here is
something other than a cap, where, say, we vacuum a 10GB table twice
as often as now, a 100GB table four times as often, and a 1TB table
eight times as often. Or whatever the right answer is.


IMO, it would make more sense. So maybe something like this:

vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples, 
vac_base_thresh + vac_scale_factor * sqrt(reltuples) * 1000);


(it could work to compute a score, too, like in David's proposal)

Re: small documentation fixes related to collations/ICU

2024-05-02 Thread Peter Eisentraut


On 29.04.24 09:18, Kashif Zeeshan wrote:

Looks good.

On Mon, Apr 29, 2024 at 12:05 PM Peter Eisentraut > wrote:


I found two mistakes related to collation and/or ICU support in the
documentation that should probably be fixed and backpatched.  See
attached patches.


Committed, thanks.

Limit index pages visited in planner's get_actual_variable_range

2024-05-02 Thread Rian McGuire

Hi hackers,

It seems the get_actual_variable_range function has a long history of
fixes attempting to improve its worst-case behaviour, most recently in
9c6ad5eaa95, which limited the number of heap page fetches to 100.
There's currently no limit on the number of index pages fetched.

We managed to get into trouble after deleting a large number of rows
(~8 million) from the start of an index, which caused planning time to
blow up on a hot (~250 calls/s) query. During the incident `explain
(analyze, buffers)` looked like this:
Planning:
  Buffers: shared hit=88311
Planning Time: 249.902 ms
Execution Time: 0.066 ms

The planner was burning a huge amount of CPU time looking through
index pages for the first visible tuple. The problem eventually
resolved when the affected index was vacuumed, but that took several
hours to complete.

There's a reproduction with a smaller dataset below.

Our current workaround to safely bulk delete from these large tables
involves delaying deletion of the minimum row until after a vacuum has
run, so there's always a visible tuple near the start of the index.
It's not realistic for us to run vacuums more frequently (ie. after
deleting a smaller number of rows) because they're so time-consuming.

The previous discussion [1] touched on the idea of also limiting the
number of index page fetches, but there were doubts about the safety
of back-patching and the ugliness of modifying the index AM API to
support this.

I would like to submit our experience as evidence that the lack of
limit on index page fetches is a real problem. Even if a fix for this
doesn't get back-patched, it would be nice to see it in a major
version.

As a starting point, I've updated the WIP index page limit patch from
Simon Riggs [2] to apply cleanly to master.

Regards,
Rian

[1] 
https://www.postgresql.org/message-id/flat/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CANbhV-GUAo5cOw6XiqBjsLVBQsg%2B%3DkpcCCWYjdTyWzLP28ZX-Q%40mail.gmail.com

=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,1000);
=# analyze test;

An explain with no dead tuples looks like this:
=# explain (analyze, buffers) select id from test where id in (select
id from test order by id desc limit 1);
Planning:
  Buffers: shared hit=8
Planning Time: 0.244 ms
Execution Time: 0.067 ms

But if we delete a large number of rows from the start of the index:
=# delete from test where id <= 400;

The performance doesn't become unreasonable immediately - it's limited
to visiting 100 heap pages while it's setting killed bits on the index
tuples:
=# explain (analyze, buffers) select id from test where id in (select
id from test order by id desc limit 1);
Planning:
  Buffers: shared hit=1 read=168 dirtied=163
Planning Time: 5.910 ms
Execution Time: 0.107 ms

But the number of index buffers visited increases on each query, and
eventually all the killed bits are set:
$ for i in {1..500}; do psql test -c 'select id from test where id in
(select id from test order by id desc limit 1)' >/dev/null; done
=# explain (analyze, buffers) select id from test where id in (select
id from test order by id desc limit 1);
Planning:
  Buffers: shared hit=11015
Planning Time: 35.772 ms
Execution Time: 0.070 ms

With the patch:
=# explain (analyze, buffers) select id from test where id in (select
id from test order by id desc limit 1);
Planning:
  Buffers: shared hit=107
Planning Time: 0.377 ms
Execution Time: 0.045 ms
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813d..8d97a5b0c1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -333,6 +333,9 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	scan->xs_page_limit = 0;
+	scan->xs_pages_visited = 0;
+
 	return scan;
 }
 
@@ -366,6 +369,9 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	scan->xs_page_limit = 0;
+	scan->xs_pages_visited = 0;
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 			orderbys, norderbys);
 }
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4c..4d8b8f1c83 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2189,6 +2189,11 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 BTScanPosInvalidate(so->currPos);
 return false;
 			}
+			if (unlikely(scan->xs_page_limit > 0) && ++scan->xs_pages_visited > scan->xs_page_limit)
+			{
+BTScanPosInvalidate(so->currPos);
+return false;
+			}
 			/* check for interrupts while we're not holding any buffer lock */
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
diff --git a/src/backend/utils/adt/selfuncs.c

Re: Partitioned tables and [un]loggedness

2024-05-02 Thread Michael Paquier

On Thu, Apr 25, 2024 at 08:55:27AM +0900, Michael Paquier wrote:
> On Wed, Apr 24, 2024 at 04:43:58PM -0700, David G. Johnston wrote:
>> My point is that if you feel that treating logged as a copy-able property
>> is OK then doing the following should also just work:
>> 
>> postgres=# create temp table parentt ( id integer ) partition by range (id);
>> CREATE TABLE
>> postgres=# create table child10t partition of parentt for values from (0)
>> to (9);
>> ERROR:  cannot create a permanent relation as partition of temporary
>> relation "parentt"
>> 
>> i.e., child10t should be created as a temporary partition under parentt.
> 
> Ah, indeed, I've missed your point here.  Lifting the error and
> inheriting temporary in this case would make sense.

The case of a temporary persistence is actually *very* tricky.  The
namespace, where the relation is created, is guessed and locked with
permission checks done based on the RangeVar when the CreateStmt is
transformed, which is before we try to look at its inheritance tree to
find its partitioned parent.  So we would somewhat need to shortcut
the existing RangeVar lookup and include the parents in the loop to
find out the correct namespace.  And this is much earlier than now.
The code complexity is not trivial, so I am getting cold feet when
trying to support this case in a robust fashion.  For now, I have
discarded this case and focused on the main problem with SET LOGGED
and UNLOGGED.

Switching between logged <-> unlogged does not have such
complications, because the namespace where the relation is created is
going to be the same.  So we won't lock or perform permission checks
on an incorrect namespace.

The addition of LOGGED makes the logic deciding how the loggedness of
a partition table based on its partitioned table or the query quite
easy to follow, but this needs some safety nets in the sequence, view
and CTAS code paths to handle with the case where the query specifies
no relpersistence.

I have also looked at support for ONLY, and I've been surprised that
it is not that complicated.  tablecmds.c has a ATSimpleRecursion()
that is smart enough to do an inheritance tree lookup and apply the
rewrites where they should happen in the step 3 of ALTER TABLE, while
handling ONLY on its own.  The relpersistence of partitioned tables is
updated in step 2, with the catalog changes.

Attached is a new patch series:
- 0001 refactors some code around ATPrepChangePersistence() that I
found confusing after applying the operation to partitioned tables.
- 0002 adds support for a LOGGED keyword.
- 0003 expands ALTER TABLE SET [UN]LOGGED to partitioned tables,
without recursion to partitions.
- 0004 adds the recursion logic, expanding regression tests to show
the difference.

0003 and 0004 should be merged together, I think.  Still, splitting
them makes reviews a bit easier.
--
Michael
From 30d572fac2aa58ce2d62ba929c30fded9b020e0b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 2 May 2024 08:16:33 +0900
Subject: [PATCH v2 1/4] Refactor some code of ALTER TABLE SET LOGGED/UNLOGGED

This is in preparation for an upcoming patch set to extend the
possibilities in this area, making the code more consistent with the
surroundings related to access methods and tablespaces.
---
 src/backend/commands/tablecmds.c | 38 +---
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 08c87e6029..baabbf82e7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -602,7 +602,8 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
 static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
 static void ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname);
 static void ATExecSetAccessMethodNoStorage(Relation rel, Oid newAccessMethod);
-static bool ATPrepChangePersistence(Relation rel, bool toLogged);
+static void ATPrepSetPersistence(AlteredTableInfo *tab, Relation rel,
+ bool toLogged);
 static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
 const char *tablespacename, LOCKMODE lockmode);
 static void ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode);
@@ -5037,13 +5038,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 ereport(ERROR,
 		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		 errmsg("cannot change persistence setting twice")));
-			tab->chgPersistence = ATPrepChangePersistence(rel, true);
-			/* force rewrite if necessary; see comment in ATRewriteTables */
-			if (tab->chgPersistence)
-			{
-tab->rewrite |= AT_REWRITE_ALTER_PERSISTENCE;
-tab->newrelpersistence = RELPERSISTENCE_PERMANENT;
-			}
+			ATPrepSetPersistence(tab, rel, true);
 			pass = AT_PASS_MISC;
 			break;
 		case AT_SetUnLogged:	/* SET UNLOGGED */
@@ -5052,13 +5047,7 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,

62 matches

Mail list logo