Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-07 Thread Amit Kapila
On Fri, Jul 7, 2023 at 1:31 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for your analysis!
>
> >
> Yes, I agree, it is (and was before my patch as well) un-documented 
> limitation of REPLICA IDENTITY FULL.
> And, as far as I can see, my patch actually didn't have any impact on the 
> limitation. The unsupported
> cases are still unsupported, but now the same error is thrown in a slightly 
> different place.
> I think that is a minor limitation, but maybe should be listed [1]?
> >

+1.

>
> Yes, your modification did not touch the restriction. It has existed before 
> the
> commit. I (or my colleague) will post the patch to add the description, maybe
> after [1] is committed.
>

I don't think there is any dependency between the two. So, feel free
to post the patch separately.

-- 
With Regards,
Amit Kapila.




Re: Check lateral references within PHVs for memoize cache keys

2023-07-07 Thread Paul A Jungwirth
On Tue, Jul 4, 2023 at 12:33 AM Richard Guo  wrote:
>
> Rebase the patch on HEAD as cfbot reminds.

All of this seems good to me. I can reproduce the problem, tests pass,
and the change is sensible as far as I can tell.

One adjacent thing I noticed is that when we renamed "Result Cache" to
"Memoize" this bit of the docs in config.sgml got skipped (probably
because of the line break):

Hash tables are used in hash joins, hash-based aggregation, result
cache nodes and hash-based processing of IN
subqueries.

I believe that should say "memoize nodes" instead. Is it worth
correcting that as part of this patch? Or perhaps another one?

Regards,

-- 
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: Add hint message for check_log_destination()

2023-07-07 Thread Michael Paquier
On Fri, Jul 07, 2023 at 07:23:47PM +0800, Japin Li wrote:
> + appendStringInfoString(, "\"stderr\"");
> +#ifdef HAVE_SYSLOG
> + appendStringInfoString(, ", \"syslog\"");
> +#endif
> +#ifdef WIN32
> + appendStringInfoString(, ", \"eventlog\"");
> +#endif
> + appendStringInfoString(, ", \"csvlog\", and 
> \"jsonlog\"");

Hmm.  Is that OK as a translatable string?
--
Michael


signature.asc
Description: PGP signature


Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-07 Thread Michael Paquier
On Thu, Jul 06, 2023 at 10:09:20AM -0700, Andres Freund wrote:
> Well, it adds an exploitation opportunity. If other functions in the extension
> reference the original location (explicitly or via search_path), somebody else
> can create a function there, which might be called from a more privileged
> context. Obviously permissions limit the likelihood of this being a real
> issue.

Yeah..

> I also don't think pg_dump will dump the changed schema, which means a
> dump/restore leads to a different schema - IMO something to avoid.

Yes, you're right here.  The function dumped is restored in the same
schema as the extension.  For instance:
psql postgres << EOF
CREATE SCHEMA test_func_dep1;
CREATE SCHEMA test_func_dep2;
CREATE EXTENSION test_ext_req_schema1 SCHEMA test_func_dep1;
ALTER FUNCTION test_func_dep1.dep_req1() SET SCHEMA test_func_dep2;
EOF
pg_dump -f dump.sql postgres
createdb popo
psql -f dump.sql popo
psql -c '\dx+ test_ext_req_schema1' popo

Objects in extension "test_ext_req_schema1"
 Object description 

 function test_func_dep1.dep_req1()
(1 row)

I am honestly not sure how much restrictions we should have here, as
this could hurt anybody relying on the existing behavior, as well, if
there are any.  (It seems that that schema modification restrictions
for extension objects would need to go through
ExecAlterObjectSchemaStmt().)

Anyway, I think that I'll just go ahead and fix the SET SCHEMA bug, as
that's wrong as it stands.  Regarding these restrictions, perhaps
something could be done on HEAD, though it impacts usability IMO.
--
Michael


signature.asc
Description: PGP signature


Re: add non-option reordering to in-tree getopt_long

2023-07-07 Thread Nathan Bossart
I spent some time tidying up the patch and adding a more detailed commit
message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 2525e6aed066fe8eafdaab0d33c8c5df055c7e09 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 7 Jul 2023 20:00:47 -0700
Subject: [PATCH v5 1/1] Teach in-tree getopt_long() to move non-options to the
 end of argv.

Unlike the other implementations of getopt_long() I could find, the
in-tree implementation does not reorder non-options to the end of
argv.  Instead, it returns -1 as soon as the first non-option is
found, even if there are other options listed afterwards.  By
moving non-options to the end of argv, getopt_long() can parse all
specified options and return -1 when only non-options remain.
This quirk is periodically missed by hackers (e.g., 869aa40a27,
ffd398021c, and d9ddc50baf).

This commit introduces the aforementioned non-option reordering
behavior to the in-tree getopt_long() implementation.  The only
systems I'm aware of that use it are Windows and AIX, both of which
have been tested.

Special thanks to Noah Misch for his help validating behavior on
AIX.

Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20230609232257.GA121461%40nathanxps13
---
 src/bin/scripts/t/040_createuser.pl | 10 ++--
 src/port/getopt_long.c  | 79 -
 2 files changed, 60 insertions(+), 29 deletions(-)

diff --git a/src/bin/scripts/t/040_createuser.pl b/src/bin/scripts/t/040_createuser.pl
index 9ca282181d..3290e30c88 100644
--- a/src/bin/scripts/t/040_createuser.pl
+++ b/src/bin/scripts/t/040_createuser.pl
@@ -42,9 +42,8 @@ $node->issues_sql_like(
 	'add a role as a member with admin option of the newly created role');
 $node->issues_sql_like(
 	[
-		'createuser', '-m',
-		'regress_user3', '-m',
-		'regress user #4', 'REGRESS_USER5'
+		'createuser', 'REGRESS_USER5', '-m', 'regress_user3',
+		'-m', 'regress user #4'
 	],
 	qr/statement: CREATE ROLE "REGRESS_USER5" NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS ROLE regress_user3,"regress user #4";/,
 	'add a role as a member of the newly created role');
@@ -73,11 +72,14 @@ $node->issues_sql_like(
 	qr/statement: CREATE ROLE regress_user11 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--role');
 $node->issues_sql_like(
-	[ 'createuser', '--member-of', 'regress_user1', 'regress_user12' ],
+	[ 'createuser', 'regress_user12', '--member-of', 'regress_user1' ],
 	qr/statement: CREATE ROLE regress_user12 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--member-of');
 
 $node->command_fails([ 'createuser', 'regress_user1' ],
 	'fails if role already exists');
+$node->command_fails(
+	[ 'createuser', 'regress_user1', '-m', 'regress_user2', 'regress_user3' ],
+	'fails for too many non-options');
 
 done_testing();
diff --git a/src/port/getopt_long.c b/src/port/getopt_long.c
index c989276988..3fc2cf36e9 100644
--- a/src/port/getopt_long.c
+++ b/src/port/getopt_long.c
@@ -50,8 +50,9 @@
  * This implementation does not use optreset.  Instead, we guarantee that
  * it can be restarted on a new argv array after a previous call returned -1,
  * if the caller resets optind to 1 before the first call of the new series.
- * (Internally, this means we must be sure to reset "place" to EMSG before
- * returning -1.)
+ *
+ * Note that this routine reorders the pointers in argv (despite the const
+ * qualifier) so that all non-options will be at the end when -1 is returned.
  */
 int
 getopt_long(int argc, char *const argv[],
@@ -60,40 +61,68 @@ getopt_long(int argc, char *const argv[],
 {
 	static char *place = EMSG;	/* option letter processing */
 	char	   *oli;			/* option letter list index */
+	static int	nonopt_start = -1;
+	static bool force_nonopt = false;
 
 	if (!*place)
 	{			/* update scanning pointer */
-		if (optind >= argc)
+		for (;;)
 		{
-			place = EMSG;
-			return -1;
-		}
+			char	  **args = (char **) argv;
 
-		place = argv[optind];
+			if (optind >= argc)
+			{
+place = EMSG;
+nonopt_start = -1;
+force_nonopt = false;
+return -1;
+			}
 
-		if (place[0] != '-')
-		{
-			place = EMSG;
-			return -1;
-		}
+			place = argv[optind];
+
+			/*
+			 * Any string that starts with '-' but is not equivalent to '-' or
+			 * '--' is considered an option.  If we see an option, we can
+			 * immediately break out of the loop since no argument reordering
+			 * is required.  Note that we treat '--' as the end of options and
+			 * immediately force reordering for every subsequent argument.
+			 * This reinstates the original order of all non-options (which
+			 * includes everything after the '--') before we return.
+			 */
+			if (!force_nonopt && place[0] == '-' && place[1])
+			{
+if (place[1] != '-' || place[2])
+	break;
 
-		place++;
+optind++;
+force_nonopt = true;
+continue;
+		

Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-07 Thread Amit Kapila
On Fri, Jul 7, 2023 at 1:36 PM Masahiko Sawada  wrote:
>
> I prefer the first suggestion. I've attached the updated patch.
>

This looks mostly good to me but I think it would be better if we can
also add the information that the leftmost index column must be a
non-expression. So, how about: "Candidate indexes must be btree,
non-partial, and the leftmost index column must be a non-expression
and reference to a published table column (i.e. cannot consist of only
expressions)."?

-- 
With Regards,
Amit Kapila.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-07-07 Thread John Naylor
On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada 
wrote:
>
> On Wed, Jul 5, 2023 at 8:21 PM John Naylor 
wrote:
> > Well, it's going to be a bit of a mess until I can demonstrate it
working (and working well) with bitmap heap scan. Fixing that now is just
going to create conflicts. I do have a couple small older patches laying
around that were quick experiments -- I think at least some of them should
give a performance boost in loading speed, but haven't had time to test.
Would you like to take a look?
>
> Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring
growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves,
but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a
long-standing inefficiency. It looks a bit tricky, but I think it's
actually straightforward after drawing out the cases with pen and paper. It
works if the fanout is either 4 or 5, so we have some wiggle room. This may
give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with
sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about
this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs
tend to be more dominated by RAM latency.

--
John Naylor
EDB: http://www.enterprisedb.com


v36-ART.tar.gz
Description: application/gzip


Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-07 Thread Thomas Munro
On Fri, Jul 7, 2023 at 4:57 AM Jacob Champion  wrote:
> On Wed, Jul 5, 2023 at 3:07 PM Thomas Munro  wrote:
> > BTW I will happily do the epoll->kqueue port work if necessary.
>
> And I will happily take you up on that; thanks!

Some initial hacking, about 2 coffees' worth:
https://github.com/macdice/postgres/commits/oauth-kqueue

This compiles on FreeBSD and macOS, but I didn't have time to figure
out all your Python testing magic so I don't know if it works yet and
it's still red on CI...  one thing I wondered about is the *altsock =
timerfd part which I couldn't do.

The situation on macOS is a little odd: the man page says EVFILT_TIMER
is not implemented.  But clearly it is, we can read the source code as
I had to do to find out which unit of time it defaults to[1] (huh,
Apple's github repo for Darwin appears to have been archived recently
-- no more source code updates?  that'd be a shame!), and it works
exactly as expected in simple programs.  So I would just assume it
works until we see evidence otherwise.  (We already use a couple of
other things on macOS more or less by accident because configure finds
them, where they are undocumented or undeclared.)

[1] https://github.com/apple/darwin-xnu/blob/main/bsd/kern/kern_event.c#L1345




Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-07 Thread Michael Paquier
On Thu, Jul 06, 2023 at 06:41:49PM +0200, Peter Eisentraut wrote:
> On 29.06.23 01:36, Michael Paquier wrote:
>> While working on a different patch, I have noted three code paths that
>> call changeDependencyFor() but don't check that they do not return
>> errors.  In all the three cases (support function, extension/schema
>> and object/schema), it seems to me that only one dependency update is
>> expected.
> 
> Why can't changeDependencyFor() raise the error itself?

There is appeal in that, but I can't really get excited for any
out-of-core callers of this routine.  Even if you would not lose much
error context, it would not be completely flexible if the number of
dependencies to switch is a variable number.
--
Michael


signature.asc
Description: PGP signature


Re: DecodeInterval fixes

2023-07-07 Thread Tom Lane
Jacob Champion  writes:
> Hi Joe, here's a partial review:
> On Sun, Apr 9, 2023 at 5:44 PM Joseph Koshakow  wrote:
>> 1) Removes dead code for handling unit type RESERVE.

> Looks good. From a quick skim it looks like the ECPG copy of this code
> (ecpg/pgtypeslib/interval.c) might need to be updated as well?

The ECPG datetime datatype support code has been basically unmaintained
for years, and has diverged quite far at this point from the core code.
I wouldn't expect that a patch to the core code necessarily applies
easily to ECPG, nor would I expect somebody patching the core to bother
trying.

Perhaps modernizing/resyncing that ECPG code would be a worthwhile
undertaking, but it'd be a mighty large one, and I'm not sure about
the size of the return.  In the meantime, benign neglect is the policy.

regards, tom lane




Re: MERGE ... RETURNING

2023-07-07 Thread Dean Rasheed
On Thu, 6 Jul 2023 at 06:13, Gurjeet Singh  wrote:
>
> Most of the review was done with the v6 of the patch, minus the doc
> build step. The additional changes in v7 of the patch were eyeballed,
> and tested with `make check`.
>

Thanks for the review!

> > One minor annoyance is that, due to the way that pg_merge_action() and
> > pg_merge_when_clause() require access to the MergeActionState, they
> > only work if they appear directly in the RETURNING list. They can't,
> > for example, appear in a subquery in the RETURNING list, and I don't
> > see an easy way round that limitation.
>
> I believe that's a serious limitation, and would be a blocker for the feature.
>

Yes, that was bugging me for quite a while.

Attached is a new version that removes that restriction, allowing the
merge support functions to appear anywhere. This requires a bit of
planner support, to deal with merge support functions in subqueries,
in a similar way to how aggregates and GROUPING() expressions are
handled. But a number of the changes from the previous patch are no
longer necessary, so overall, this version of the patch is slightly
smaller.

> I think the name of function pg_merge_when_clause() can be improved.
> How about pg_merge_when_clause_ordinal().
>

That's a bit of a mouthful, but I don't have a better idea right now.
I've left the names alone for now, in case something better occurs to
anyone.

> In the doc page of MERGE, you've moved the 'with_query' from the
> bottom of Parameters section to the top of that section. Any reason
> for this? Since the Parameters section is quite large, for a moment I
> thought the patch was _adding_ the description of 'with_query'.
>

Ah yes, I was just making the order consistent with the
INSERT/UPDATE/DELETE pages. That could probably be committed
separately.

> +/*
> + * Merge support functions should only be called directly from a MERGE
> + * command, and need access to the parent ModifyTableState.  The parser
> + * should have checked that such functions only appear in the RETURNING
> + * list of a MERGE, so this should never fail.
> + */
> +if (IsMergeSupportFunction(funcid))
> +{
> +if (!state->parent ||
> +!IsA(state->parent, ModifyTableState) ||
> +((ModifyTableState *) state->parent)->operation != CMD_MERGE)
> +elog(ERROR, "merge support function called in non-merge 
> context");
>
> As the comment says, this is an unexpected condition, and should have
> been caught and reported by the parser. So I think it'd be better to
> use an Assert() here. Moreover, there's ERROR being raised by the
> pg_merge_action() and pg_merge_when_clause() functions, when they
> detect the function context is not appropriate.
>
> I found it very innovative to allow these functions to be called only
> in a certain part of certain SQL command. I don't think  there's a
> precedence for this in  Postgres; I'd be glad to learn if there are
> other functions that Postgres exposes this way.
>
> -EXPR_KIND_RETURNING,/* RETURNING */
> +EXPR_KIND_RETURNING,/* RETURNING in INSERT/UPDATE/DELETE */
> +EXPR_KIND_MERGE_RETURNING,  /* RETURNING in MERGE */
>
> Having to invent a whole new ParseExprKind enum, feels like a bit of
> an overkill. My only objection is that this is exactly like
> EXPR_KIND_RETURNING, hence EXPR_KIND_MERGE_RETURNING needs to be dealt
> with exactly in as many places. But I don't have any alternative
> proposals.
>

That's all gone now from the new patch, since there is no longer any
restriction on where these functions can appear.

> --- a/src/include/catalog/pg_proc.h
> +++ b/src/include/catalog/pg_proc.h
> +/* Is this a merge support function?  (Requires fmgroids.h) */
> +#define IsMergeSupportFunction(funcid) \
> +((funcid) == F_PG_MERGE_ACTION || \
> + (funcid) == F_PG_MERGE_WHEN_CLAUSE)
>
> If it doesn't cause recursion or other complications, I think we
> should simply include the fmgroids.h in pg_proc.h. I understand that
> not all .c files/consumers that include pg_proc.h may need fmgroids.h,
> but #include'ing it will eliminate the need to keep the "Requires..."
> note above, and avoid confusion, too.
>

There's now only one place that uses this macro, whereas there are a
lot of places that include pg_proc.h and not fmgroids.h, so I don't
think forcing them all to include fmgroids.h is a good idea. (BTW,
this approach and comment is borrowed from IsTrueArrayType() in
pg_type.h)

> --- a/src/test/regress/expected/merge.out
> +++ b/src/test/regress/expected/merge.out
>
> -WHEN MATCHED AND tid > 2 THEN
> +WHEN MATCHED AND tid >= 2 THEN
>
> This change can be treated as a bug fix :-)
>
> +-- COPY (MERGE ... RETURNING) TO ...
> +BEGIN;
> +COPY (
> +MERGE INTO sq_target t
> +USING v
> +ON tid = sid
> +WHEN MATCHED AND tid > 2 THEN
>
> For consistency, the check should be tid >= 2, like you've fixed in
> command referenced above.
>
> +CREATE FUNCTION 

Re: check_strxfrm_bug()

2023-07-07 Thread Thomas Munro
On Sat, Jul 8, 2023 at 8:52 AM Tristan Partin  wrote:
> Should you wrap the two _l function replacements in HAVE_USELOCALE
> instead of WIN32?

I find that more confusing, and I'm also not sure if HAVE_USELOCALE is
even going to survive (based on your nearby thread).  I mean, by the
usual criteria that we applied to a lot of other system library
functions in the 16 cycle, I'd drop it.  It's in the standard and all
relevant systems have it except Windows which we have to handle with
special pain-in-the-neck logic anyway.

> > +if not cc.has_type('locale_t', prefix: '#include ') and 
> > cc.has_type('locale_t', prefix: '#include ')
>
> I wouldn't mind a line break after the 'and'.

Ah, right, I am still learning what is allowed in this syntax...  will do.

> Other than these comments, the patch looks fine to me.

Thanks.  I will wait a bit to see if Peter has any other comments and
then push this.  I haven't actually tested with Solution.pm due to
lack of CI for that, but fingers crossed, since the build systems will
now agree, reducing the screw-up surface.




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-07 Thread Andrey Chudnovsky
Thanks Jacob for making progress on this.

> 3) Is the current conn->async_auth() entry point sufficient for an
> application to implement the Microsoft flows discussed upthread?

Please confirm my understanding of the flow is correct:
1. Client calls PQconnectStart.
  - The client doesn't know yet what is the issuer and the scope.
  - Parameters are strings, so callback is not provided yet.
2. Client gets PgConn from PQconnectStart return value and updates
conn->async_auth to its own callback.
3. Client polls PQconnectPoll and checks conn->sasl_state until the
value is SASL_ASYNC
4. Client accesses conn->oauth_issuer and conn->oauth_scope and uses
those info to trigger the token flow.
5. Expectations on async_auth:
a. It returns PGRES_POLLING_READING while token acquisition is going on
b. It returns PGRES_POLLING_OK and sets conn->sasl_state->token
when token acquisition succeeds.
6. Is the client supposed to do anything with the altsock parameter?

Is the above accurate understanding?

If yes, it looks workable with a couple of improvements I think would be nice:
1. Currently, oauth_exchange function sets conn->async_auth =
pg_fe_run_oauth_flow and starts Device Code flow automatically when
receiving challenge and metadata from the server.
There probably should be a way for the client to prevent default
Device Code flow from triggering.
2. The current signature and expectations from async_auth function
seems to be tightly coupled with the internal implementation:
- Pieces of information need to be picked and updated in different
places in the PgConn structure.
- Function is expected to return PostgresPollingStatusType which
is used to communicate internal state to the client.
   Would it make sense to separate the internal callback used to
communicate with Device Code flow from client facing API?
   I.e. introduce a new client facing structure and enum to facilitate
callback and its return value.

---
On a separate note:
The backend code currently spawns an external command for token validation.
As we discussed before, an extension hook would be a more efficient
extensibility option.
We see clients make 10k+ connections using OAuth tokens per minute to
our service, and stating external processes would be too much overhead
here.

---

> 5) Does this maintenance tradeoff (full control over the client vs. a
> large amount of RFC-governed code) seem like it could be okay?

It's nice for psql to have Device Code flow. Can be made even more
convenient with refresh tokens support.
And for clients on resource constrained devices to be able to
authenticate with Client Credentials (app secret) without bringing
more dependencies.

In most other cases, upstream PostgreSQL drivers written in higher
level languages have libraries / abstractions to implement OAUTH flows
for the platforms they support.

On Fri, Jul 7, 2023 at 11:48 AM Jacob Champion  wrote:
>
> On Thu, Jul 6, 2023 at 1:48 PM Thomas Munro  wrote:
> > On Fri, Jul 7, 2023 at 4:57 AM Jacob Champion  
> > wrote:
> > > Something analogous to libcurl's socket and timeout callbacks [1],
> > > then? Or is there an existing libpq API you were thinking about using?
> >
> > Yeah.  Libpq already has an event concept.
>
> Thanks -- I don't know how I never noticed libpq-events.h before.
>
> Per-connection events (or callbacks) might bring up the same
> chicken-and-egg situation discussed above, with the notice hook. We'll
> be fine as long as PQconnectStart is guaranteed to return before the
> PQconnectPoll engine gets to authentication, and it looks like that's
> true with today's implementation, which returns pessimistically at
> several points instead of just trying to continue the exchange. But I
> don't know if that's intended as a guarantee for the future. At the
> very least we would have to pin that implementation detail.
>
> > > > Or, more likely in the
> > > > first version, you just can't do it at all...  Doesn't seem that bad
> > > > to me.
> > >
> > > Any initial opinions on whether it's worse or better than a worker thread?
> >
> > My vote is that it's perfectly fine to make a new feature that only
> > works on some OSes.  If/when someone wants to work on getting it going
> > on Windows/AIX/Solaris (that's the complete set of no-epoll, no-kqueue
> > OSes we target), they can write the patch.
>
> Okay. I'm curious to hear others' thoughts on that, too, if anyone's lurking.
>
> Thanks!
> --Jacob




Re: check_strxfrm_bug()

2023-07-07 Thread Tristan Partin
On Fri Jul 7, 2023 at 3:30 PM CDT, Thomas Munro wrote:
> On Fri, Jul 7, 2023 at 4:20 AM Peter Eisentraut  wrote:
> > I think the correct solution is to set HAVE_MBSTOWCS_L in Solution.pm.
> > Compare HAVE_FSEEKO, which is set in Solution.pm with fseeko being
> > defined in win32_port.h.
>
> In this version I have it in there, but set it to undef.  This way
> configure (MinGW), meson (MinGW or MSVC), and Solution.pm all agree.
> This version passes on CI (not quite sure what I screwed up before).
>
> To restate the problem I am solving with this Solution.pm change +
> associated changes in the C: configure+MinGW disabled the libc
> provider completely before.  With this patch that is fixed, and that
> forced me to address the inconsistency (because if you have the libc
> provider but no _l functions, you hit uselocale() code paths that
> Windows can't do).  It was a bug, really, but I don't plan to
> back-patch anything (nobody really uses configure+MinGW builds AFAIK,
> they exist purely as a developer [in]convenience).  But that explains
> why I needed to make a change.
>
> Thinking about how to bring this all into "normal" form -- where
> HAVE_XXX means "system defines XXX", not "system defines XXX || we
> define a replacement" -- leads to the attached.  Do you like this one?

Should you wrap the two _l function replacements in HAVE_USELOCALE
instead of WIN32?

> +if not cc.has_type('locale_t', prefix: '#include ') and 
> cc.has_type('locale_t', prefix: '#include ')

I wouldn't mind a line break after the 'and'.

Other than these comments, the patch looks fine to me.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: check_strxfrm_bug()

2023-07-07 Thread Thomas Munro
On Fri, Jul 7, 2023 at 4:20 AM Peter Eisentraut  wrote:
> I think the correct solution is to set HAVE_MBSTOWCS_L in Solution.pm.
> Compare HAVE_FSEEKO, which is set in Solution.pm with fseeko being
> defined in win32_port.h.

In this version I have it in there, but set it to undef.  This way
configure (MinGW), meson (MinGW or MSVC), and Solution.pm all agree.
This version passes on CI (not quite sure what I screwed up before).

To restate the problem I am solving with this Solution.pm change +
associated changes in the C: configure+MinGW disabled the libc
provider completely before.  With this patch that is fixed, and that
forced me to address the inconsistency (because if you have the libc
provider but no _l functions, you hit uselocale() code paths that
Windows can't do).  It was a bug, really, but I don't plan to
back-patch anything (nobody really uses configure+MinGW builds AFAIK,
they exist purely as a developer [in]convenience).  But that explains
why I needed to make a change.

Thinking about how to bring this all into "normal" form -- where
HAVE_XXX means "system defines XXX", not "system defines XXX || we
define a replacement" -- leads to the attached.  Do you like this one?
From a5decf23c6198eb469835b23e9f4ea7778df869c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 4 Jul 2023 16:32:35 +1200
Subject: [PATCH v3] All supported systems have locale_t.

locale_t is defined by POSIX.1-2008 and SUSv4, and available on all
targeted systems.  For Windows, win32_port.h redirects to a partial
implementation called _locale_t.  It's time to remove a lot of
compile-time tests for HAVE_LOCALE_T, and a few associated comments and
dead code branches.

Since configure + MinGW builds didn't detect locale_t but now we assume
that all systems have it, further inconsistencies among the 3 Windows build
systems were revealed.  With this commit, we no longer define
HAVE_WCSTOMBS_L and HAVE_MBSTOWCS_L on any Windows build system, but
we have logic to deal with that so that replacements are available where
appropriate.

Reviewed-by: Noah Misch 
Reviewed-by: Tristan Partin 
Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGLg7_T2GKwZFAkEf0V7vbnur-NfCjZPKZb%3DZfAXSV1ORw%40mail.gmail.com
---
 config/c-library.m4  | 10 +---
 configure|  5 --
 meson.build  | 22 ++-
 src/backend/commands/collationcmds.c |  2 +-
 src/backend/regex/regc_pg_locale.c   | 52 +
 src/backend/utils/adt/formatting.c   | 18 --
 src/backend/utils/adt/like.c |  2 -
 src/backend/utils/adt/like_support.c |  2 -
 src/backend/utils/adt/pg_locale.c| 86 +++-
 src/include/pg_config.h.in   |  3 -
 src/include/utils/pg_locale.h|  7 +--
 src/tools/msvc/Solution.pm   |  5 +-
 12 files changed, 45 insertions(+), 169 deletions(-)

diff --git a/config/c-library.m4 b/config/c-library.m4
index c1dd804679..aa8223d2ef 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -86,9 +86,9 @@ AC_DEFUN([PGAC_STRUCT_SOCKADDR_SA_LEN],
 # PGAC_TYPE_LOCALE_T
 # --
 # Check for the locale_t type and find the right header file.  macOS
-# needs xlocale.h; standard is locale.h, but glibc also has an
-# xlocale.h file that we should not use.
-#
+# needs xlocale.h; standard is locale.h, but glibc <= 2.25 also had an
+# xlocale.h file that we should not use, so we check the standard
+# header first.
 AC_DEFUN([PGAC_TYPE_LOCALE_T],
 [AC_CACHE_CHECK([for locale_t], pgac_cv_type_locale_t,
 [AC_COMPILE_IFELSE([AC_LANG_PROGRAM(
@@ -102,10 +102,6 @@ locale_t x;],
 [])],
 [pgac_cv_type_locale_t='yes (in xlocale.h)'],
 [pgac_cv_type_locale_t=no])])])
-if test "$pgac_cv_type_locale_t" != no; then
-  AC_DEFINE(HAVE_LOCALE_T, 1,
-[Define to 1 if the system has the type `locale_t'.])
-fi
 if test "$pgac_cv_type_locale_t" = 'yes (in xlocale.h)'; then
   AC_DEFINE(LOCALE_T_IN_XLOCALE, 1,
 [Define to 1 if `locale_t' requires .])
diff --git a/configure b/configure
index c4463cb17a..842ef855fc 100755
--- a/configure
+++ b/configure
@@ -15120,11 +15120,6 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_type_locale_t" >&5
 $as_echo "$pgac_cv_type_locale_t" >&6; }
-if test "$pgac_cv_type_locale_t" != no; then
-
-$as_echo "#define HAVE_LOCALE_T 1" >>confdefs.h
-
-fi
 if test "$pgac_cv_type_locale_t" = 'yes (in xlocale.h)'; then
 
 $as_echo "#define LOCALE_T_IN_XLOCALE 1" >>confdefs.h
diff --git a/meson.build b/meson.build
index 62942ead09..3d4a85f8c6 100644
--- a/meson.build
+++ b/meson.build
@@ -2283,17 +2283,12 @@ else
   cdata.set('STRERROR_R_INT', false)
 endif
 
-# Check for the locale_t type and find the right header file.  macOS
-# needs xlocale.h; standard is locale.h, but glibc also has an
-# xlocale.h file that we should not use.  MSVC has a replacement
-# defined in src/include/port/win32_port.h.
-if 

Re: 010_database.pl fails on openbsd w/ LC_ALL=LANG=C

2023-07-07 Thread Jeff Davis
On Sat, 2023-07-08 at 07:04 +1200, Thomas Munro wrote:
> Doesn't look too hopeful: https://man.openbsd.org/setlocale.3

Hmm. I could try using a bogus encoding, but that may be too clever.
I'll just remove the test.

Regards,
Jeff Davis





Re: 010_database.pl fails on openbsd w/ LC_ALL=LANG=C

2023-07-07 Thread Thomas Munro
On Sat, Jul 8, 2023 at 3:52 AM Jeff Davis  wrote:
> The test is assuming that locale "@colStrength=primary" is valid for
> ICU but invalid for libc. It seems that on that platform, setlocale()
> is accepting it?
>
> If some libc implementations are too permissive, I might need to just
> disable this test. But if we can find a locale that is consistently
> acceptable in ICU but invalid in libc, then I can keep it... perhaps
> "und@colStrength=primary"?

Doesn't look too hopeful: https://man.openbsd.org/setlocale.3




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-07 Thread Jacob Champion
On Thu, Jul 6, 2023 at 1:48 PM Thomas Munro  wrote:
> On Fri, Jul 7, 2023 at 4:57 AM Jacob Champion  wrote:
> > Something analogous to libcurl's socket and timeout callbacks [1],
> > then? Or is there an existing libpq API you were thinking about using?
>
> Yeah.  Libpq already has an event concept.

Thanks -- I don't know how I never noticed libpq-events.h before.

Per-connection events (or callbacks) might bring up the same
chicken-and-egg situation discussed above, with the notice hook. We'll
be fine as long as PQconnectStart is guaranteed to return before the
PQconnectPoll engine gets to authentication, and it looks like that's
true with today's implementation, which returns pessimistically at
several points instead of just trying to continue the exchange. But I
don't know if that's intended as a guarantee for the future. At the
very least we would have to pin that implementation detail.

> > > Or, more likely in the
> > > first version, you just can't do it at all...  Doesn't seem that bad
> > > to me.
> >
> > Any initial opinions on whether it's worse or better than a worker thread?
>
> My vote is that it's perfectly fine to make a new feature that only
> works on some OSes.  If/when someone wants to work on getting it going
> on Windows/AIX/Solaris (that's the complete set of no-epoll, no-kqueue
> OSes we target), they can write the patch.

Okay. I'm curious to hear others' thoughts on that, too, if anyone's lurking.

Thanks!
--Jacob




Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-07 Thread Nathan Bossart
On Fri, Jul 07, 2023 at 09:57:10AM -0700, Nathan Bossart wrote:
> Yeah, I guess I should just revert it in both.  Given your fix will
> hopefully be committed soon, I was hoping to avoid reverting and
> un-reverting in quick succession to prevent affecting git-blame too much.
> 
> I found an example of a post-beta2 revert on both master and a stable
> branch where Tom set the catversions to different values (20b6847,
> e256312).  I'll do the same here.

reverted

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Standardize type of variable when extending Buffers

2023-07-07 Thread Gurjeet Singh
On Fri, Jul 7, 2023 at 6:12 AM Ranier Vilela  wrote:
>
> Hi,
>
> This has already been discussed in [1].
> But I thought it best to start a new thread.
>
> The commit 31966b1 introduced the infrastructure to extend
> buffers.
> But the patch mixed types with int and uint32.
> The correct type of the variable counter is uint32.
>
> Fix by standardizing the int type to uint32.
>
> patch attached.

LGTM.

+CC Kyotaro, as they were involved in the previous discussion.

>
> [1] 
> https://www.postgresql.org/message-id/CAEudQAr_oWHpZk4uumZijYS362gp4KHAah-yUe08CQY4a4SsOQ%40mail.gmail.com


Best regards,
Gurjeet
http://Gurje.et




Re: PATCH: Using BRIN indexes for sorted output

2023-07-07 Thread Tomas Vondra
Hi,

I finally had time to look at this patch again. There's a bit of bitrot,
so here's a rebased version (no other changes).

[more comments inline]

On 2/27/23 16:40, Matthias van de Meent wrote:
> Hi,
> 
> On Thu, 23 Feb 2023 at 17:44, Matthias van de Meent
>  wrote:
>>
>> I'll see to further reviewing 0004 and 0005 when I have additional time.
> 
> Some initial comments on 0004:
> 
>> +/*
>> + * brin_minmax_ranges
>> + *Load the BRIN ranges and sort them.
>> + */
>> +Datum
>> +brin_minmax_ranges(PG_FUNCTION_ARGS)
>> +{
> 
> Like in 0001, this seems to focus on only single columns. Can't we put
> the scan and sort infrastructure in brin, and put the weight- and
> compare-operators in the opclasses? I.e.
> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> brin_minmax_compare(order, order) -> int? I'm thinking of something
> similar to GIST's distance operators, which would make implementing
> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> brin infrastructure.
> 
> Note: One big reason I don't really like the current
> brin_minmax_ranges (and the analyze code in 0001) is because it breaks
> the operatorclass-vs-index abstraction layer. Operator classes don't
> (shouldn't) know or care about which attribute number they have, nor
> what the index does with the data.
> Scanning the index is not something that I expect the operator class
> to do, I expect that the index code organizes the scan, and forwards
> the data to the relevant operator classes.
> Scanning the index N times for N attributes can be excused if there
> are good reasons, but I'd rather have that decision made in the
> index's core code rather than at the design level.
> 

I think you're right. It'd be more appropriate to have most of the core
scanning logic in brin.c, and then delegate only some minor decisions to
the opclass. Like, comparisons, extraction of min/max from ranges etc.

However, it's not quite clear to me is what you mean by the weight- and
compare-operators? That is, what are

 - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
 - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
 - brin_minmax_compare(order, order) -> int

supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

In principle we just need a procedure that tells us min/max for a given
page range - I guess that's what the minorder/maxorder functions do? But
why would we need the compare one? We're comparing by the known data
type, so we can just delegate the comparison to that, no?

Also, the existence of these opclass procedures should be enough to
identify the opclasses which can support this.

>> +/*
>> + * XXX Does it make sense (is it possible) to have a sort by more than one
>> + * column, using a BRIN index?
>> + */
> 
> Yes, even if only one prefix column is included in the BRIN index
> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> tuples can add additional sorting without first reading the whole
> table, potentially (likely) reducing the total resource usage of the
> query. That utilizes the same idea as incremental sorts, but with the
> caveat that the input sort order is approximately likely but not at
> all guaranteed. So, even if the range sort is on a single index
> column, we can still do the table's tuplesort on all ORDER BY
> attributes, as long as a prefix of ORDER BY columns are included in
> the BRIN index.
> 

That's now quite what I meant, though. When I mentioned sorting by more
than one column, I meant using a multi-column BRIN index on those
columns. Something like this:

   CREATE TABLE t (a int, b int);
   INSERT INTO t ...
   CREATE INDEX ON t USING brin (a,b);

   SELECT * FROM t ORDER BY a, b;

Now, what I think you described is using BRIN to sort by "a", and then
do incremental sort for "b". What I had in mind is whether it's possible
to use BRIN to sort by "b" too.

I was suspecting it might be made to work, but now that I think about it
again it probably can't - BRIN pretty much sorts the columns separately,
it's not like having an ordering by (a,b) - first by "a", then "b".

>> +/*
>> + * XXX We can be a bit smarter for LIMIT queries - once we
>> + * know we have more rows in the tuplesort than we need to
>> + * output, we can stop spilling - those rows are not going
>> + * to be needed. We can discard the tuplesort (no need to
>> + * respill) and stop spilling.
>> + */
> 
> Shouldn't that be "discard the tuplestore"?
> 

Yeah, definitely.

>> +#define BRIN_PROCNUM_RANGES 12/* optional */
> 
> It would be useful to add documentation for this in this patch.
> 

Right, this should be documented in doc/src/sgml/brin.sgml.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-07 Thread Akshat Jaimini
The patch looks fine and passes all the tests. I am using Arch Linux on an 
x86_64 system.
The patch does not cause any unnecessary bugs and does not make any non trivial 
changes to the source code.
I believe it is ready to be committed!

The new status of this patch is: Ready for Committer


Re: [17] CREATE COLLATION default provider

2023-07-07 Thread Gurjeet Singh
On Fri, Jul 7, 2023 at 9:33 AM Jeff Davis  wrote:
>
> On Sat, 2023-06-17 at 09:09 -0700, Gurjeet Singh wrote:
> > The docs for the CREATE COLLATION option 'locale' say: "This is a
> > shortcut for setting LC_COLLATE and LC_CTYPE at once."
> >
> > So it's not intuitive why the check does not include a test for the
> > presence of 'localeEl', as well? If we consider the presence of
> > LC_COLLATE _or_ LC_CTYPE options to be a determining factor for some
> > decision, then the presence of LOCALE option should also lead to the
> > same outcome.
> >
>
> The docs say: "If provider is libc, this is a shortcut...". The point
> is that LC_COLLATE and LC_CTYPE act as a signal that what the user
> really wants is a libc collation. LOCALE works for either, so we need a
> default.

Sorry about the noise, I was consulting current/v15 docs online. Now
that v16 docs are online, I can see that the option in fact says this
is the case only if libc is the provider.

(note to self: for reviewing patches to master, consult devel docs [1] online)

[1]: https://www.postgresql.org/docs/devel/

Best regards,
Gurjeet
http://Gurje.et




Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-07-07 Thread Daniel Verite
Tom Lane wrote:

> This gives me several "-Wincompatible-pointer-types" warnings
> [...]
> I think what you probably ought to do to avoid all that is to change
> the arguments of PrintQueryResult and nearby routines to be "const
> PGresult *result" not just "PGresult *result".

The const-ness issue that I ignored in the previous patch is that
while C is fine with passing T* to a function expecting const T*, it's
not okay with passing T** to a function expecting const T**,
or more generally converting T** to const T**.

When callers need to pass arrays of PGresult* instead of const
PGresult*, I've opted to remove the const qualifiers for the functions
that are concerned by this change.


PFA an updated patch.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 5973df2e39..476a9770f0 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -372,6 +372,7 @@ AcceptResult(const PGresult *result, bool show_error)
 		{
 			case PGRES_COMMAND_OK:
 			case PGRES_TUPLES_OK:
+			case PGRES_SINGLE_TUPLE:
 			case PGRES_EMPTY_QUERY:
 			case PGRES_COPY_IN:
 			case PGRES_COPY_OUT:
@@ -695,7 +696,7 @@ PrintNotifications(void)
  * Returns true if successful, false otherwise.
  */
 static bool
-PrintQueryTuples(const PGresult *result, const printQueryOpt *opt,
+PrintQueryTuples(PGresult *result, const printQueryOpt *opt,
  FILE *printQueryFout)
 {
 	bool		ok = true;
@@ -1391,6 +1392,47 @@ DescribeQuery(const char *query, double *elapsed_msec)
 	return OK;
 }
 
+/*
+ * Check if an output stream for \g needs to be opened, and if
+ * yes, open it.
+ * Return false if an error occurred, true otherwise.
+ */
+static bool
+SetupGOutput(PGresult *result, FILE **gfile_fout, bool *is_pipe)
+{
+	ExecStatusType status = PQresultStatus(result);
+	if (pset.gfname != NULL &&			/* there is a \g file or program */
+		*gfile_fout == NULL &&   /* and it's not already opened */
+		(status == PGRES_TUPLES_OK ||
+		 status == PGRES_SINGLE_TUPLE ||
+		 status == PGRES_COPY_OUT))
+	{
+		if (openQueryOutputFile(pset.gfname, gfile_fout, is_pipe))
+		{
+			if (is_pipe)
+disable_sigpipe_trap();
+		}
+		else
+			return false;
+	}
+	return true;
+}
+
+static void
+CloseGOutput(FILE *gfile_fout, bool is_pipe)
+{
+	/* close \g file if we opened it */
+	if (gfile_fout)
+	{
+		if (is_pipe)
+		{
+			SetShellResultVariables(pclose(gfile_fout));
+			restore_sigpipe_trap();
+		}
+		else
+			fclose(gfile_fout);
+	}
+}
 
 /*
  * ExecQueryAndProcessResults: utility function for use by SendQuery()
@@ -1422,10 +1464,18 @@ ExecQueryAndProcessResults(const char *query,
 	bool		success;
 	instr_time	before,
 after;
+	int fetch_count = pset.fetch_count;
 	PGresult   *result;
+
 	FILE	   *gfile_fout = NULL;
 	bool		gfile_is_pipe = false;
 
+	PGresult   **result_array = NULL; /* to collect results in single row mode */
+	int64		total_tuples = 0;
+	int			ntuples;
+	int			flush_error = 0;
+	bool		is_pager = false;
+
 	if (timing)
 		INSTR_TIME_SET_CURRENT(before);
 	else
@@ -1448,6 +1498,33 @@ ExecQueryAndProcessResults(const char *query,
 		return -1;
 	}
 
+	/*
+	 * If FETCH_COUNT is set and the context allows it, use the single row
+	 * mode to fetch results and have no more than FETCH_COUNT rows in
+	 * memory.
+	 */
+	if (fetch_count > 0 && !pset.crosstab_flag && !pset.gexec_flag && !is_watch
+		&& !pset.gset_prefix && pset.show_all_results)
+	{
+		/*
+		 * The row-by-row fetch is not enabled when SHOW_ALL_RESULTS is false,
+		 * since we would need to accumulate all rows before knowing
+		 * whether they need to be discarded or displayed, which contradicts
+		 * FETCH_COUNT.
+		 */
+		if (!PQsetSingleRowMode(pset.db))
+		{
+			pg_log_warning("fetching results in single row mode is unavailable");
+			fetch_count = 0;
+		}
+		else
+		{
+			result_array = (PGresult**) pg_malloc(fetch_count * sizeof(PGresult*));
+		}
+	}
+	else
+		fetch_count = 0;		/* disable single-row mode */
+
 	/*
 	 * If SIGINT is sent while the query is processing, the interrupt will be
 	 * consumed.  The user's intention, though, is to cancel the entire watch
@@ -1467,6 +1544,8 @@ ExecQueryAndProcessResults(const char *query,
 		ExecStatusType result_status;
 		PGresult   *next_result;
 		bool		last;
+		/* whether the output starts before results are fully fetched */
+		bool		partial_display = false;
 
 		if (!AcceptResult(result, false))
 		{
@@ -1593,6 +1672,94 @@ ExecQueryAndProcessResults(const char *query,
 			success &= HandleCopyResult(, copy_stream);
 		}
 
+		if (fetch_count > 0 && result_status == PGRES_SINGLE_TUPLE)
+		{
+			FILE	   *tuples_fout = printQueryFout ? printQueryFout : stdout;
+			printQueryOpt my_popt = pset.popt;
+
+			ntuples = 0;
+			total_tuples = 0;
+			partial_display = true;
+
+			success = SetupGOutput(result, _fout, _is_pipe);
+			if (gfile_fout)
+tuples_fout = gfile_fout;
+
+			/* initialize print options for partial 

Re: HOT readme missing documentation on summarizing index handling

2023-07-07 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 19:06, Tomas Vondra  wrote:
>
> On 7/7/23 18:34, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 00:14, Tomas Vondra  
> > wrote:
> >> The original text was really about on/off, and I'm not quite sure the
> >> part about "exception" makes this very clear.
> >
> > Agreed on both points. Attached an updated version which incorporates
> > your points.
> >
>
> Thanks, pushed after correcting a couple typos.

Thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: HOT readme missing documentation on summarizing index handling

2023-07-07 Thread Tomas Vondra
On 7/7/23 18:34, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 00:14, Tomas Vondra  
> wrote:
>>
>> Yeah, README.HOT should have been updated, and I see no reason not to
>> backpatch this to v16. Barring objections, I'll do that tomorrow.
>>
>> I have two suggesting regarding the README.HOT changes:
>>
>> 1) I'm not entirely sure it's very clear what "referential integrity of
>> indexes across tuple updates" actually means. I'm afraid "referential
>> integrity" may lead readers to think about foreign keys. Maybe it'd be
>> better to explain this is about having index pointers to the new tuple
>> version, etc.
>>
>> 2) Wouldn't it be good to make it a bit more explicit we now have three
>> "levels" of HOT:
>>
>>   (a) no indexes need update
>>   (b) update only summarizing indexes
>>   (c) update all indexes
>>
>> The original text was really about on/off, and I'm not quite sure the
>> part about "exception" makes this very clear.
> 
> Agreed on both points. Attached an updated version which incorporates
> your points.
> 

Thanks, pushed after correcting a couple typos.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-07 Thread Nathan Bossart
On Fri, Jul 07, 2023 at 09:22:22AM -0700, Jeff Davis wrote:
> On Thu, 2023-07-06 at 22:14 -0700, Nathan Bossart wrote:
>> Since we are only reverting from v16, the REL_16_STABLE catversion
>> will be
>> bumped ahead of the one on master.
> 
> I don't object to you doing it this way, but FWIW, I'd just revert in
> both branches to avoid this kind of weirdness.
> 
> Also I'm not quite sure how quickly my search_path fix will be
> committed. Hopefully soon, because the current state is not great, but
> it's hard for me to say for sure.

Yeah, I guess I should just revert it in both.  Given your fix will
hopefully be committed soon, I was hoping to avoid reverting and
un-reverting in quick succession to prevent affecting git-blame too much.

I found an example of a post-beta2 revert on both master and a stable
branch where Tom set the catversions to different values (20b6847,
e256312).  I'll do the same here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: DecodeInterval fixes

2023-07-07 Thread Jacob Champion
Hi Joe, here's a partial review:

On Sun, Apr 9, 2023 at 5:44 PM Joseph Koshakow  wrote:
> 1) Removes dead code for handling unit type RESERVE.

Looks good. From a quick skim it looks like the ECPG copy of this code
(ecpg/pgtypeslib/interval.c) might need to be updated as well?

> 2) Restrict the unit "ago" to only appear at the end of the
> interval. According to the docs [0], this is the only valid place to
> put it, but we allowed it multiple times at any point in the input.

Also looks reasonable to me. (Same note re: ECPG.)

> 3) Error when the user has multiple consecutive units or a unit without
> an accompanying value. I spent a lot of time trying to come up with
> robust ways to detect this and ultimately settled on using the "type"
> field. I'm not entirely happy with this approach, because it involves
> having to look ahead to the next field in a couple of places. The other
> approach I was considering was to introduce a boolean flag called
> "unhandled_unit". After parsing a unit it would be set to true, after
> applying the unit to a number it would be set to false. If it was true
> right before parsing a unit, then we would error. Please let me know
> if you have any suggestions here.

I'm new to this code, but I agree that the use of `type` and the
lookahead are not particularly obvious/intuitive. At the very least,
they'd need some more explanation in the code. Your boolean flag idea
sounds reasonable, though.

> There is one more problem I noticed, but didn't fix. We allow multiple
> "@" to be sprinkled anywhere in the input, even though the docs [0]
> only allow it to appear at the beginning of the input.

(No particular opinion on this.)

It looks like this patch needs a rebase for the CI, too, but there are
no conflicts.

Thanks!
--Jacob




Re: HOT readme missing documentation on summarizing index handling

2023-07-07 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 00:14, Tomas Vondra  wrote:
>
> Yeah, README.HOT should have been updated, and I see no reason not to
> backpatch this to v16. Barring objections, I'll do that tomorrow.
>
> I have two suggesting regarding the README.HOT changes:
>
> 1) I'm not entirely sure it's very clear what "referential integrity of
> indexes across tuple updates" actually means. I'm afraid "referential
> integrity" may lead readers to think about foreign keys. Maybe it'd be
> better to explain this is about having index pointers to the new tuple
> version, etc.
>
> 2) Wouldn't it be good to make it a bit more explicit we now have three
> "levels" of HOT:
>
>   (a) no indexes need update
>   (b) update only summarizing indexes
>   (c) update all indexes
>
> The original text was really about on/off, and I'm not quite sure the
> part about "exception" makes this very clear.

Agreed on both points. Attached an updated version which incorporates
your points.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v2-0001-Add-documentation-in-README.HOT-for-handling-summ.patch
Description: Binary data


Re: [17] CREATE COLLATION default provider

2023-07-07 Thread Jeff Davis
On Sat, 2023-06-17 at 09:09 -0700, Gurjeet Singh wrote:
> The docs for the CREATE COLLATION option 'locale' say: "This is a
> shortcut for setting LC_COLLATE and LC_CTYPE at once."
> 
> So it's not intuitive why the check does not include a test for the
> presence of 'localeEl', as well? If we consider the presence of
> LC_COLLATE _or_ LC_CTYPE options to be a determining factor for some
> decision, then the presence of LOCALE option should also lead to the
> same outcome.
> 

The docs say: "If provider is libc, this is a shortcut...". The point
is that LC_COLLATE and LC_CTYPE act as a signal that what the user
really wants is a libc collation. LOCALE works for either, so we need a
default.

That being said, I'm now having second thoughts about where that
default should come from. While getting the default from datlocprovider
is convenient, I'm not sure that the datlocprovider provides a good
signal. A lot of users will have datlocprovider=c and datcollate=C,
which really means they want the built-in memcmp behavior, and to me
that doesn't signal that they want CREATE COLLATION to use libc for a
non-C locale.

A GUC might be a better default, and we could have CREATE COLLATION
default to ICU if the server is built with ICU and if PROVIDER,
LC_COLLATE and LC_CTYPE are unspecified.

Regards,
Jeff Davis





Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-07 Thread Jeff Davis
On Thu, 2023-07-06 at 22:14 -0700, Nathan Bossart wrote:
> Since we are only reverting from v16, the REL_16_STABLE catversion
> will be
> bumped ahead of the one on master.

I don't object to you doing it this way, but FWIW, I'd just revert in
both branches to avoid this kind of weirdness.

Also I'm not quite sure how quickly my search_path fix will be
committed. Hopefully soon, because the current state is not great, but
it's hard for me to say for sure.

Regards,
Jeff Davis





Re: ICU locale validation / canonicalization

2023-07-07 Thread Jeff Davis
On Sat, 2023-07-01 at 10:31 -0700, Noah Misch wrote:
> As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-
> posix.  Check
> got daticulocale=NULL.

With the same test setup, that locale takes about 8.6 seconds (opening
it 10M times), about 2.5X slower than "en-US" and about 7X slower than
"und". I think that explains it.

The locale "en-US-u-va-posix" normally happens when passing a locale
beginning with "C" to ICU. After 2535c74b1a we don't get ICU locales
from the environment anywhere, so that should be rare (and probably
indicates a user mistake). I don't think this is a practical problem
any more.

Regards,
Jeff Davis





Re: Issue attaching a table to a partitioned table with an auto-referenced foreign key

2023-07-07 Thread Jehan-Guillaume de Rorthais
So I gave a look at this one... And it's a tricky one.

The current policy about DETACHing a partition is to keep/adjust all FK
referencing it or referenced by it.

However, in this exact self-referencing usecase, we can have rows referencing
rows from the same partition OR another one. It seems like an
impossible issue to solve.

Here is an example based on Guillaume's scenario ([c1_old, c2_old] -> [c1, c2]):

  t1:
t1_a:
  c1 | c1_old | c2 | c2_old
 +++
   1 |   NULL |  2 |   NULL
   1 |  1 |  3 |  2
   1 |  2 |  4 |  2 
t1_b:
  c1 | c1_old | c2 | c2_old
 +++
   2 |  1 |  2 |  3

Now, what happens with the FK when we DETACH t1_a?
  * it's not enough t1_a only keeps a self-FK, as it references some
rows from t1_b:
(1, 2, 4, 2) -> (2, 1, 2, 3)
  * and t1_a can not only keeps a FK referencing t1 either as it references some
rows fro itself:
(1, 1, 3, 2) -> (1, NULL, 2, NULL)

I'm currently not able to think about a constraint we could build to address
this situation after the DETACH.

The only clean way out would be to drop the FK between the old partition and
the partitioned table. But then, it breaks the current policy to keep the
constraint after DETACH. Not mentioning the nightmare to detect this situation
from some other ones.

Thoughts?

On Wed, 22 Mar 2023 11:14:19 +0100
Guillaume Lelarge  wrote:

> One last ping, hoping someone will have more time now than in january.
> 
> Perhaps my test is wrong, but I'd like to know why.
> 
> Thanks.
> 
> Le mar. 17 janv. 2023 à 16:53, Guillaume Lelarge  a
> écrit :
> 
> > Quick ping, just to make sure someone can get a look at this issue :)
> > Thanks.
> >
> >
> > Le ven. 6 janv. 2023 à 11:07, Guillaume Lelarge 
> > a écrit :
> >  
> >> Hello,
> >>
> >> One of our customers has an issue with partitions and foreign keys. He
> >> works on a v13, but the issue is also present on v15.
> >>
> >> I attach a SQL script showing the issue, and the results on 13.7, 13.9,
> >> and 15.1. But I'll explain the script here, and its behaviour on 13.9.
> >>
> >> There is one partitioned table, two partitions and a foreign key. The
> >> foreign key references the same table:
> >>
> >> create table t1 (
> >>   c1 bigint not null,
> >>   c1_old bigint null,
> >>   c2 bigint not null,
> >>   c2_old bigint null,
> >>   primary key (c1, c2)
> >>   )
> >>   partition by list (c1);
> >> create table t1_a   partition of t1 for values in (1);
> >> create table t1_def partition of t1 default;
> >> alter table t1 add foreign key (c1_old, c2_old) references t1 (c1, c2) on
> >> delete restrict on update restrict;
> >>
> >> I've a SQL function that shows me some information from pg_constraints
> >> (code of the function in the SQL script attached). Here is the result of
> >> this function after creating the table, its partitions, and its foreign
> >> key:
> >>
> >> select * from show_constraints();
> >> conname |   t|  tref  |   coparent
> >> +++---
> >>  t1_c1_old_c2_old_fkey  | t1 | t1 |
> >>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> >> (5 rows)
> >>
> >> The constraint works great :
> >>
> >> insert into t1 values(1, NULL, 2, NULL);
> >> insert into t1 values(2, 1,2, 2);
> >> delete from t1 where c1 = 1;
> >> psql:ticket15010_v3.sql:34: ERROR:  update or delete on table "t1_a"
> >> violates foreign key constraint "t1_c1_old_c2_old_fkey1" on table "t1"
> >> DETAIL:  Key (c1, c2)=(1, 2) is still referenced from table "t1".
> >>
> >> This error is normal since the line I want to delete is referenced on the
> >> other line.
> >>
> >> If I try to detach the partition, it also gives me an error.
> >>
> >> alter table t1 detach partition t1_a;
> >> psql:ticket15010_v3.sql:36: ERROR:  removing partition "t1_a" violates
> >> foreign key constraint "t1_c1_old_c2_old_fkey1"
> >> DETAIL:  Key (c1_old, c2_old)=(1, 2) is still referenced from table "t1".
> >>
> >> Sounds good to me too (well, I'd like it to be smarter and find that the
> >> constraint is still good after the detach, but I can understand why it
> >> won't allow it).
> >>
> >> The pg_constraint didn't change of course:
> >>
> >> select * from show_constraints();
> >> conname |   t|  tref  |   coparent
> >> +++---
> >>  t1_c1_old_c2_old_fkey  | t1 | t1 |
> >>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
> >>  t1_c1_old_c2_old_fkey2 | 

Re: 010_database.pl fails on openbsd w/ LC_ALL=LANG=C

2023-07-07 Thread Jeff Davis
On Sun, 2023-07-02 at 09:56 -0700, Andres Freund wrote:
> # expected: anything else
> [07:25:06.424](0.001s) not ok 7 - ICU-specific locale must be
> specified with ICU_LOCALE: error message
> [07:25:06.424](0.001s) #   Failed test 'ICU-specific locale must be
> specified with ICU_LOCALE: error message'
> #   at /home/postgres/postgres/src/test/icu/t/010_database.pl line
> 80.
> [07:25:06.424](0.000s) #   'psql::2: NOTICE: 
> using standard form "und-u-ks-level1" for ICU locale
> "@colStrength=primary"'
> # doesn't match '(?^:ERROR:  invalid LC_COLLATE locale name)'
> [07:25:06.425](0.000s) 1..7

[I apologize for the delay.]

The test is assuming that locale "@colStrength=primary" is valid for
ICU but invalid for libc. It seems that on that platform, setlocale()
is accepting it?

If some libc implementations are too permissive, I might need to just
disable this test. But if we can find a locale that is consistently
acceptable in ICU but invalid in libc, then I can keep it... perhaps
"und@colStrength=primary"?

Regards,
Jeff Davis





Re: Fix search_path for all maintenance commands

2023-07-07 Thread Jeff Davis
Hi,

On Thu, 2023-07-06 at 23:22 -0400, Isaac Morland wrote:
> Maybe pg_upgrade could apply "SET search_path TO pg_catalog, pg_temp"
> to any function used in a functional index that doesn't have a
> search_path setting of its own?

I don't think we want to go down the road of trying to solve this at
upgrade time.

> Now I'm doing more reading and I'm worried about SET TIME ZONE (or
> more precisely, its absence) and maybe some other ones.

That's a good point that it's not limited to search_path, but
search_path is by far the biggest problem.

For one thing, functions affected by TimeZone or other GUCs are
typically marked STABLE, and can't be used in index expressions. Also,
search_path affects a lot more functions.
 
> Change it so by default each function gets handled as if "SET
> search_path FROM CURRENT" was applied to it?

Yes, that's one idea, along with some syntax to get the old behavior
(inherit search_path at runtime) if you want.

It feels weird to make search_path too special in the syntax though. If
we want a general solution, we could do something like:

  CREATE FUNCTION ...
[DEPENDS ON CONFIGURATION {NONE|{some_guc}[, ...]}]
[CONFIGURATION IS {STATIC|DYNAMIC}]

where STATIC means "all of the GUC dependencies are SET FROM CURRENT
unless specified otherwise" and DYNAMIC means "all of the GUC
dependencies come from the session at runtime unless specified
otherwise".

The default would be "DEPENDS CONFIGURATION search_path CONFIGURATION
IS STATIC".

That would make search_path special only because, by default, every
function would depend on it. Which I think summarizes the reason
search_path really is special.

That also opens up opportunities to do other things we might want to
do:

  * have a compatibility GUC to set the default back to DYNAMIC
  * track other dependencies of functions better ("DEPENDS ON TABLE
...")
  * provide better error messages, like "can't use function xyz in
index expression because it depends on configuration parameter foo"
  * be more consistent about using STABLE to mean that the function
depends on a snapshot, rather than overloading it for GUC dependencies

The question is, given that the acute problem is search_path, do we
want to invent all of the syntax above? Are there other use cases for
it, or should we just focus on search_path?

>  That's what I do for all my functions (maybe hurting performance?).

It doesn't look cheap, although I think we could optimize it.

> If a view calls a function, shouldn't it be called in the context of
> the view's definer/owner?

Yeah, there are a bunch of problems along those lines. I don't know if
we can solve them all in one release.

> Is the fundamental problem that we now find ourselves wanting to do
> things that require different defaults to work smoothly? On some
> level I suspect we want lexical scoping, which is what most of us
> have in our programming languages, in the database; but the database
> has many elements of dynamic scoping, and changing that is both a
> compatibility break and requires significant changes in the way the
> database is designed.

Does that suggest another approach?

Regards,
Jeff Davis





Re: Implement missing join selectivity estimation for range types

2023-07-07 Thread Schoemans Maxime
Hi,

Thank you for picking up this patch.

 > The patch doesn't apply to the current postgres version. Could you 
please update it?
Indeed, the code was initially written on pg15.
You can find attached a new version of the patch that can be applied on 
the current master branch of postgres.

Please let us know if there is anything else we can do.

Best regards,
Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index cefc4710fd..c670d225a0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used
+ * for selectivity estimation (non-joins).
+ *
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+		   const RangeBound *hist1, int nhist1,
+		   const RangeBound *hist2, int nhist2)
+{
+	int			i,
+j;
+	double		selectivity,
+cur_sel1,
+cur_sel2,
+prev_sel1,
+prev_sel2;
+	RangeBound	cur_sync;
+
+	/*
+	 * Histograms will never be empty. In fact, a histogram will never have
+	 * less than 2 values (1 bin)
+	 */
+	Assert(nhist1 > 1);
+	Assert(nhist2 > 1);
+
+	/* Fast-forwards i and j to start of iteration */
+	for (i = 0; range_cmp_bound_values(typcache, [i], [0]) < 0; i++);
+	for (j = 0; range_cmp_bound_values(typcache, [j], [0]) < 0; j++);
+
+	if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+		cur_sync = hist1[i++];
+	else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+		cur_sync = hist2[j++];
+	else
+	{
+		/* If equal, skip one */
+		cur_sync = hist1[i];
+		i++;
+		j++;
+	}
+	prev_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+			 hist1, nhist1, false);
+	prev_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+			 hist2, nhist2, false);
+
+	/*
+	 * Do the estimation on overlapping region
+	 */
+	selectivity = 0.0;
+	while (i < nhist1 && j < nhist2)
+	{
+		if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+			cur_sync = hist1[i++];
+		else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+			cur_sync = hist2[j++];
+		else
+		{
+			/* If equal, skip one */
+			cur_sync = hist1[i];
+			i++;
+			j++;
+		}
+		cur_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+hist1, nhist1, false);
+		cur_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+hist2, nhist2, false);
+
+		

Re: Unlogged relations and WAL-logging

2023-07-07 Thread Heikki Linnakangas

On 28/01/2022 15:57, Robert Haas wrote:

On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas  wrote:

Unlogged relations are not WAL-logged, but creating the init-fork is.
There are a few things around that seem sloppy:

1. In index_build(), we do this:


*/
   if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
   !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
   {
   smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
   indexRelation->rd_indam->ambuildempty(indexRelation);
   }


Shouldn't we call log_smgrcreate() here? Creating the init fork is
otherwise not WAL-logged at all.


Yes, that's a bug.


Pushed and backpatched this patch (commit 3142a8845b).


2. Some implementations of ambuildempty() use the buffer cache (hash,
gist, gin, brin), while others bypass it and call smgrimmedsync()
instead (btree, spgist, bloom). I don't see any particular reason for
those decisions, it seems to be based purely on which example the author
happened to copy-paste.


I thought that this inconsistency was odd when I was developing the
unlogged feature, but I tried to keep each routine's ambuildempty()
consistent with whatever ambuild() was doing. I don't mind if you want
to change it, though.


3. Those ambuildempty implementations that bypass the buffer cache use
smgrwrite() to write the pages. That doesn't make any difference in
practice, but in principle it's wrong: You are supposed to use
smgrextend() when extending a relation.


That's a mistake on my part.


4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.


That is also a mistake on my part.


I'm still sitting on these fixes. I think the patch I posted still makes 
sense, but I got carried away with a more invasive approach that 
introduces a whole new set of functions for bulk-creating a relation, 
which would handle WAL-logging, smgrimmedsync() and all that (see 
below). We have some repetitive, error-prone code in all the index build 
functions for that. But that's not backpatchable, so I'll rebase the 
original approach next week.



5. In heapam_relation_set_new_filenode(), we do this:



   /*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart.  An immediate sync is required
* even if the page has been logged, because the write did not go through
* shared_buffers and therefore a concurrent checkpoint may have moved 
the
* redo pointer past our xlog record.  Recovery may as well remove it
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
   if (persistence == RELPERSISTENCE_UNLOGGED)
   {
   Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
  rel->rd_rel->relkind == RELKIND_MATVIEW ||
  rel->rd_rel->relkind == RELKIND_TOASTVALUE);
   smgrcreate(srel, INIT_FORKNUM, false);
   log_smgrcreate(newrnode, INIT_FORKNUM);
   smgrimmedsync(srel, INIT_FORKNUM);
   }


The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.


Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.


I see. I pushed the fix from the other thread that makes smgrcreate() 
call register_dirty_segment (commit 4b4798e13). I believe that makes 
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens 
with a redo pointer greater than this WAL record, it must've received 
the fsync request created by smgrcreate(). That depends on the fact that 
we write the WAL record *after* smgrcreate(). Subtle..


Hmm, we have a similar smgrimmedsync() call after index build, because 
we have written pages directly with smgrextend(skipFsync=true). If no 
checkpoints have occurred during the index build, we could call 
register_dirty_segment() instead of smgrimmedsync(). That would avoid 
the fsync() latency when creating an index on an empty or small index.


This is all very subtle to get right though. That's why I'd like to 
invent a new bulk-creation facility that would handle this stuff, and 
make the callers less error-prone.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: CHECK Constraint Deferrable

2023-07-07 Thread David G. Johnston
On Friday, July 7, 2023, Himanshu Upadhyaya 
wrote:

> I can think of one scenario, as below
>
> 1) any department should have an employee
> 2)any employee should be assigned to a department
> so, the employee table has a FK to the department table, and another check
> constraint should be added to the department table to ensure there should
> be one/more employees in this department.
>
>
That isn’t a valid/allowed check constraint - it contains a prohibited
reference to another table.

David J.


Re: CHECK Constraint Deferrable

2023-07-07 Thread Himanshu Upadhyaya
I can think of one scenario, as below

1) any department should have an employee
2)any employee should be assigned to a department
so, the employee table has a FK to the department table, and another check
constraint should be added to the department table to ensure there should
be one/more employees in this department. It's kind of a deadlock
situation, each one depends on the other one. We cant insert a new
department, coz there is no employee. Also, we can't insert new employee
belongs to this new department, coz the department hasn't been and cant be
added. So if we have a check constraint defined as deferrable we can solve
this problem.

‘postgres[685143]=#’CREATE FUNCTION checkEmpPresent(did int) RETURNS int AS
$$ SELECT count(*) from emp where emp.deptno = did $$ IMMUTABLE LANGUAGE
SQL;
CREATE FUNCTION
‘postgres[685143]=#’alter table dept add constraint check_cons check
(checkEmpPresent(deptno) > 0);
ALTER TABLE
‘postgres[685143]=#’\d dept;
Table "public.dept"
  Column  | Type  | Collation | Nullable | Default
--+---+---+--+-
 deptno   | integer   |   | not null |
 deptname | character(20) |   |  |
Indexes:
"dept_pkey" PRIMARY KEY, btree (deptno)
Check constraints:
"check_cons" CHECK (checkemppresent(deptno) > 0)
Referenced by:
TABLE "emp" CONSTRAINT "fk_cons" FOREIGN KEY (deptno) REFERENCES
dept(deptno)

‘postgres[685143]=#’insert into dept values (1, 'finance');
ERROR:  23514: new row for relation "dept" violates check constraint
"check_cons"
DETAIL:  Failing row contains (1, finance ).
SCHEMA NAME:  public
TABLE NAME:  dept
CONSTRAINT NAME:  check_cons
LOCATION:  ExecConstraints, execMain.c:2069
‘postgres[685143]=#’\d emp;
   Table "public.emp"
 Column | Type  | Collation | Nullable | Default
+---+---+--+-
 empno  | integer   |   |  |
 ename  | character(20) |   |  |
 deptno | integer   |   |  |
Foreign-key constraints:
"fk_cons" FOREIGN KEY (deptno) REFERENCES dept(deptno)

‘postgres[685143]=#’insert into emp values (1001, 'test', 1);
ERROR:  23503: insert or update on table "emp" violates foreign key
constraint "fk_cons"
DETAIL:  Key (deptno)=(1) is not present in table "dept".
SCHEMA NAME:  public
TABLE NAME:  emp
CONSTRAINT NAME:  fk_cons
LOCATION:  ri_ReportViolation, ri_triggers.c:2608

I have tried with v1 patch as below;

‘postgres[685143]=#’alter table dept drop constraint check_cons;
ALTER TABLE
‘postgres[685143]=#’alter table dept add constraint check_cons check
(checkEmpPresent(deptno) > 0) DEFERRABLE INITIALLY DEFERRED;
ALTER TABLE
‘postgres[685143]=#’BEGIN;
BEGIN
‘postgres[685143]=#*’insert into dept values (1, 'finance');
INSERT 0 1
‘postgres[685143]=#*’insert into emp values (1001, 'test', 1);
INSERT 0 1
‘postgres[685143]=#*’commit;
COMMIT
‘postgres[685143]=#’select * from dept;
 deptno |   deptname
+--
  1 | finance
(1 row)

‘postgres[685143]=#’select * from emp;
 empno |ename | deptno
---+--+
  1001 | test |  1
(1 row)

Thanks,
Himanshu

On Fri, Jul 7, 2023 at 5:21 PM Dilip Kumar  wrote:

> On Wed, Jul 5, 2023 at 3:08 PM Himanshu Upadhyaya
>  wrote:
> >
> > Hi,
> >
> > Currently, there is no support for CHECK constraint DEFERRABLE in a
> create table statement.
> > SQL standard specifies that CHECK constraint can be defined as
> DEFERRABLE.
>
> I think this is a valid argument that this is part of SQL standard so
> it would be good addition to PostgreSQL.  So +1 for the feature.
>
> But I am wondering whether there are some real-world use cases for
> deferred CHECK/NOT NULL constraints?  I mean like for foreign key
> constraints if there is a cyclic dependency between two tables then
> deferring the constraint is the simplest way to insert without error.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com


Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-07-07 Thread Stephen Frost
Greetings,

* Nikolay Samokhvalov (n...@postgres.ai) wrote:
> On Fri, Jun 30, 2023 at 14:33 Bruce Momjian  wrote:
> > On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote:
> > > I'm not quite clear on how Nikolay got into trouble here. I don't
> > > think I understand under exactly what conditions the procedure is
> > > reliable and under what conditions it isn't. But there is no way in
> > > heck I would ever advise anyone to use this procedure on a database
> > > they actually care about. This is a great party trick or something to
> > > show off in a lightning talk at PGCon, not something you ought to be
> > > doing with valuable data that you actually care about.
> >
> > Well, it does get used, and if we remove it perhaps we can have it on
> > our wiki and point to it from our docs.

I was never a fan of having it actually documented because it's a pretty
complex and involved process that really requires someone doing it have
a strong understanding of how PG works.

> In my case, we performed some additional writes on the primary before
> running "pg_upgrade -k" and we did it *after* we shut down all the
> standbys. So those changes were not replicated and then "rsync --size-only"
> ignored them. (By the way, that cluster has wal_log_hints=on to allow
> Patroni run pg_rewind when needed.)

That's certainly going to cause problems..

> But this can happen with anyone who follows the procedure from the docs as
> is and doesn't do any additional steps, because in step 9 "Prepare for
> standby server upgrades":
> 
> 1) there is no requirement to follow specific order to shut down the nodes
>- "Streaming replication and log-shipping standby servers can remain
> running until a later step" should probably be changed to a
> requirement-like "keep them running"

Agreed that it would be good to clarify that the primary should be shut
down first, to make sure everything written by the primary has been
replicated to all of the replicas.

> 2) checking the latest checkpoint position with pg_controldata now looks
> like a thing that is good to do, but with uncertainty purpose -- it does
> not seem to be used to support any decision
>   - "There will be a mismatch if old standby servers were shut down before
> the old primary or if the old standby servers are still running" should
> probably be rephrased saying that if there is mismatch, it's a big problem

Yes, it's absolutely a big problem and that's the point of the check.
Slightly surprised that we need to explicitly say "if they don't match
then you need to figure out what you did wrong and don't move forward
until you get everything shut down and with matching values", but that's
also why it isn't a great idea to try and do this without a solid
understanding of how PG works.

> So following the steps as is, if some writes on the primary are not
> replicated (due to whatever reason) before execution of pg_upgrade -k +
> rsync --size-only, then those writes are going to be silently lost on
> standbys.

Yup.

> I wonder, if we ensure that standbys are fully caught up before upgrading
> the primary, if we check the latest checkpoint positions, are we good to
> use "rsync --size-only", or there are still some concerns? It seems so to
> me, but maybe I'm missing something.

I've seen a lot of success with it.

Ultimately, when I was describing this process, it was always with the
idea that it would be performed by someone quite familiar with the
internals of PG or, ideally, could be an outline of how an interested PG
hacker could write a tool to do it.  Hard to say, but I do feel like
having it documented has actually reduced the interest in writing a tool
to do it, which, if that's the case, is quite unfortunate.

Thanks,

Stephen


signature.asc
Description: PGP signature


RE: UUID v7

2023-07-07 Thread Kyzer Davis (kydavis)
Great discussions group,

> I think it would be reasonable to review this patch now.
I am happy to review the format and logic for any proposed v7 and/or v8
UUID. Just point me to a PR or some code review location.

> Distributed UUID Generation" states that "node ID" is there to decrease
> the likelihood of a collision.
Correct, node identifiers help provide some bit space that ensures no
collision in the event the stars align where two nodes create the exact
UUID.

>From what I have seen UUIDv7 should meet the requirements outlined thus far
In this thread.

Also to add, there are two UUID prototypes for postgres from my checks.
Although they are outdated from the latest draft sent up for official
Publication so review them from an academic perspective.)
- https://github.com/uuid6/prototypes
- pg_uuid_next (see this thread which nicely summarizes some UUIDv7
"checkboxes" https://github.com/x4m/pg_uuid_next/issues/1)
- UUID_v7_for_Postgres.sql

Don't forget, if we have UUIDv1 already implemented in the postgres code you
may want to examine UUIDv6.
UUIDv6 is simply a fork of that code and swap of the timestamp bits.
In terms of effort UUIDv6 easy to implement and gives you a time ordered
UUID re-using 99% of the code you may already have.

Lastly, my advice on v8 is that I would examine/implement v6 or v7 first
before jumping to v8
because whatever you do for implementing v6 or v7 will help you implement a
better v8.
There are also a number of v8 prototype implementations (at the previous
link) if somebody wants to give them a scroll.

Happy to answer any other questions where I can provide input.

Thanks,

-Original Message-
From: Andrey M. Borodin  
Sent: Friday, July 7, 2023 8:06 AM
To: Peter Eisentraut 
Cc: Tom Lane ; Daniel Gustafsson ;
Matthias van de Meent ; Nikolay Samokhvalov
; Kyzer Davis (kydavis) ; Andres
Freund ; Andrey Borodin ;
PostgreSQL Hackers ; b...@peabody.io;
wol...@gmail.com
Subject: Re: UUID v7



> On 6 Jul 2023, at 21:38, Peter Eisentraut
 wrote:
> 
> I think it would be reasonable to review this patch now.
+1.

Also, I think we should discuss UUID v8. UUID version 8 provides an
RFC-compatible format for experimental or vendor-specific use cases.
Revision 1 of IETF draft contained interesting code for v8: almost similar
to v7, but with fields for "node ID" and "rolling sequence number".
I think this is reasonable approach, thus I attach implementation of UUID v8
per [0]. But from my point of view this implementation has some flaws.
These two new fields "node ID" and "sequence" are there not for uniqueness,
but rather for data locality.
But they are placed at the end, in bytes 14 and 15, after randomly generated
numbers.

I think that "sequence" is there to help generate local ascending
identifiers when the real time clock do not provide enough resolution. So
"sequence" field must be placed after 6 bytes of time-generated identifier.

On a contrary "node ID" must differentiate identifiers generated on
different nodes. So it makes sense to place "node ID" before timing. So
identifiers generated on different nodes will tend to be in different
ranges.
Although, section "6.4. Distributed UUID Generation" states that "node ID"
is there to decrease the likelihood of a collision. So my intuition might be
wrong here.


Do we want to provide this "vendor-specific" UUID with tweaks for databases?
Or should we limit the scope with well defined UUID v7?


Best regards, Andrey Borodin.


[0] https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis-01



smime.p7s
Description: S/MIME cryptographic signature


Standardize type of variable when extending Buffers

2023-07-07 Thread Ranier Vilela
Hi,

This has already been discussed in [1].
But I thought it best to start a new thread.

The commit 31966b1

introduced the infrastructure to extend
buffers.
But the patch mixed types with int and uint32.
The correct type of the variable counter is uint32.

Fix by standardizing the int type to uint32.

patch attached.

best regards,
Ranier Vilela

[1]
https://www.postgresql.org/message-id/CAEudQAr_oWHpZk4uumZijYS362gp4KHAah-yUe08CQY4a4SsOQ%40mail.gmail.com


0001-Standardize-type-of-extend_by-counter.patch
Description: Binary data


Re: benchmark results comparing versions 15.2 and 16

2023-07-07 Thread Alexander Lakhin

Hello David,

11.05.2023 16:00, Alexander Lakhin wrote:

Yeah, I see. It's also interesting to me, which tests perform better after
that commit. It takes several hours to run all tests, so I can't present
results quickly, but I'll try to collect this information next week.


To my regret, I could not find such tests that week, so I decided to try
later, after reviewing my benchmark running infrastructure.

But for now, as Postgres Pro company graciously shared the benchmarking
infrastructure project [1], it's possible for anyone to confirm or deny my
results. (You can also see which tests were performed and how.)

Having done one more round of comprehensive testing, I still couldn't find
winning tests for commit 3c6fc5820 (but reassured that
test s64da_tpcds.query87 loses a little (and also s64da_tpcds.query38)).
So it seems to me that the tests I performed or their parameters is not
representative enough for that improvement, unfortunately.

[1] https://github.com/postgrespro/pg-mark

Best regards,
Alexander




RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-07 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing. PSA new version.
I planned to post new version after supporting more indexes, but it may take 
more time.
So I want to address comments from you once.

> ==
> src/backend/executor/execReplication.c
> 
> 1. get_equal_strategy_number
> 
> +/*
> + * Return the appropriate strategy number which corresponds to the equality
> + * comparisons.
> + *
> + * TODO: support other indexes: GiST, GIN, SP-GiST, BRIN
> + */
> +static int
> +get_equal_strategy_number(Oid opclass)
> +{
> + Oid am_method = get_opclass_method(opclass);
> + int ret;
> +
> + switch (am_method)
> + {
> + case BTREE_AM_OID:
> + ret = BTEqualStrategyNumber;
> + break;
> + case HASH_AM_OID:
> + ret = HTEqualStrategyNumber;
> + break;
> + case GIST_AM_OID:
> + case GIN_AM_OID:
> + case SPGIST_AM_OID:
> + case BRIN_AM_OID:
> + /* TODO */
> + default:
> + /* XXX: Do we have to support extended indexes? */
> + ret = InvalidStrategy;
> + break;
> + }
> +
> + return ret;
> +}
> 
> 1a.
> In the file syscache.c there are already some other functions like
> get_op_opfamily_strategy so I am wondering if this new function also
> really belongs in that file.

According to atop comment in the syscache.c, it contains routines which access
system catalog cache. get_equal_strategy_number() does not check it, so I don't
think it should be at the file.

> 1b.
> Should that say "operator" instead of "comparisons"?

Changed.

> 1c.
> "am" stands for "access method" so "am_method" is like "access method
> method" – is it correct?

Changed to "am".

> 2. build_replindex_scan_key
> 
>   int table_attno = indkey->values[index_attoff];
> + int strategy_number;
> 
> 
> Ought to say this is a strategy for "equality", so a better varname
> might be "equality_strategy_number" or "eq_strategy" or similar.

Changed to "eq_strategy".

> src/backend/replication/logical/relation.c
> 
> 3. IsIndexUsableForReplicaIdentityFull
> 
> It seems there is some overlap between this code which hardwired 2
> valid AMs and the switch statement in your other
> get_equal_strategy_number function which returns a strategy number for
> those 2 AMs.
> 
> Would it be better to make another common function like
> get_equality_strategy_for_am(), and then you don’t have to hardwire
> anything? Instead, you can say:
> 
> is_usable_type = get_equality_strategy_for_am(indexInfo->ii_Am) !=
> InvalidStrategy;

Added. How do you think?

> src/backend/utils/cache/lsyscache.c
> 
> 4. get_opclass_method
> 
> +/*
> + * get_opclass_method
> + *
> + * Returns the OID of the index access method operator family is for.
> + */
> +Oid
> +get_opclass_method(Oid opclass)
> +{
> + HeapTuple tp;
> + Form_pg_opclass cla_tup;
> + Oid result;
> +
> + tp = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclass));
> + if (!HeapTupleIsValid(tp))
> + elog(ERROR, "cache lookup failed for opclass %u", opclass);
> + cla_tup = (Form_pg_opclass) GETSTRUCT(tp);
> +
> + result = cla_tup->opcmethod;
> + ReleaseSysCache(tp);
> + return result;
> +}
> 
> Is the comment correct? IIUC, this function is not doing anything for
> "families".

Modified to "class".

> src/test/subscription/t/034_hash.pl
> 
> 5.
> +# insert some initial data within the range 0-9 for y
> +$node_publisher->safe_psql('postgres',
> + "INSERT INTO test_replica_id_full SELECT i, (i%10)::text FROM
> generate_series(0,10) i"
> +);
> 
> Why does the comment only say "for y"?

After considering more, I thought we do not have to mention data.
So removed the part " within the range 0-9 for y".

> 6.
> +# wait until the index is used on the subscriber
> +# XXX: the test will be suspended here
> +$node_publisher->wait_for_catchup($appname);
> +$node_subscriber->poll_query_until('postgres',
> + q{select (idx_scan = 4) from pg_stat_all_indexes where indexrelname
> = 'test_replica_id_full_idx';}
> +  )
> +  or die
> +  "Timed out while waiting for check subscriber tap_sub_rep_full
> updates 4 rows via index";
> +
> 
> AFAIK this is OK but it was slightly misleading because it says
> "updates 4 rows" whereas the previous UPDATE was only for 2 rows. So
> here I think the 4 also counts the 2 DELETED rows. The comment can be
> expanded slightly to clarify this.

Clarified two rows were deleted.

> 7.
> FYI, when I ran the TAP test the result was this:
> 
> t/034_hash.pl .. 1/? # Tests were run but no plan
> was declared and done_testing() was not seen.
> t/034_hash.pl .. All 2 subtests passed
> t/100_bugs.pl .. ok
> 
> Test Summary Report
> ---
> t/034_hash.pl(Wstat: 0 Tests: 2 Failed: 0)
>   Parse errors: No plan found in TAP output
> Files=36, Tests=457, 338 wallclock secs ( 0.29 usr  0.07 sys + 206.73
> cusr 47.94 csys = 255.03 CPU)
> Result: FAIL
> 
> ~
> 
> Just adding the missing done_testing() seemed to fix this.

Right. I used meson build system and it said OK. Added.

Furthermore, I added a check to reject indexes which do not implement 

Re: pg_upgrade and cross-library upgrades

2023-07-07 Thread Daniel Gustafsson
> On 6 Jul 2023, at 02:19, Michael Paquier  wrote:
> On Wed, Jul 05, 2023 at 07:03:56AM -0400, Tom Lane wrote:

>> Not entirely sure this is worth the effort.
> 
> I am not sure either..

I can't see off the cuff that it would bring better test coverage or find bugs
relative to the effort of making it stable.

>  Anyway, the buildfarm code does similar things
> already, see around $bad_module in TestUpgradeXversion.pm, for
> instance.  So this kind of workaround exists already.  It seems to me
> that we should try to pull that out of the buildfarm code and have
> that in the core module instead as a routine that would be used by the
> in-core TAP tests of pg_upgrade and the buildfarm code.

That however, would be a more interesting outcome.

--
Daniel Gustafsson





Re: Disabling Heap-Only Tuples

2023-07-07 Thread Ants Aasma
On Fri, 7 Jul 2023 at 13:18, Tomas Vondra  wrote:
> On 7/7/23 11:55, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
> >>
> >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
> >>  wrote:
> >>>
> >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> 
>  On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
>   wrote:
> > So what were you thinking of? A session GUC? A table option?
> 
>  Both.
> >>>
> >>> Here's a small patch implementing a new table option max_local_update
> >>> (name very much bikesheddable). Value is -1 (default, disabled) or the
> >>> size of the table in MiB that you still want to allow to update on the
> >>> same page. I didn't yet go for a GUC as I think that has too little
> >>> control on the impact on the system.
> >>
> >> So IIUC, this parameter we can control that instead of putting the new
> >> version of the tuple on the same page, it should choose using
> >> RelationGetBufferForTuple(), and that can reduce the fragmentation
> >> because now if there is space then most of the updated tuple will be
> >> inserted in same pages.  But this still can not truncate the pages
> >> from the heap right? because we can not guarantee that the new page
> >> selected by RelationGetBufferForTuple() is not from the end of the
> >> heap, and until we free the pages from the end of the heap, the vacuum
> >> can not truncate any page.  Is my understanding correct?
> >
> > Yes. If you don't have pages with (enough) free space for the updated
> > tuples in your table, or if the FSM doesn't accurately reflect the
> > actual state of free space in your table, this won't help (which is
> > also the reason why I run vacuum in the tests). It also won't help if
> > you don't update the tuples physically located at the end of your
> > table, but in the targeted workload this would introduce a bias where
> > new tuple versions are moved to the front of the table.
> >
> > Something to note is that this may result in very bad bloat when this
> > is combined with a low fillfactor: All blocks past max_local_update
> > will be unable to use space reserved by fillfactor because FSM lookups
> > always take fillfactor into account, and all updates (which ignore
> > fillfactor when local) would go through the FSM instead, thus reducing
> > the space available on each block to exactly the fillfactor. So, this
> > might need some extra code to make sure we don't accidentally blow up
> > the table's size with UPDATEs when max_local_update is combined with
> > low fillfactors. I'm not sure where that would fit best.
> >
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).
>
> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

Just as another point in support of strategy based/extensible tuple
placement, I would at some point try out placing INSERT ON CONFLICT
tuples on the same page as the preceding key in the index. Use case is
in tables with (series, timestamp) primary key to get locality of
access range scanning for a single series. Placement will always be a
tradeoff that is dependent on hardware and workload, and the effect
can be pretty large. For the mentioned use case, if placement can
maintain some semblance of clustering, there will be a 10-100x
reduction in buffers accessed for a relatively minor increase in
bloat.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: remaining sql/json patches

2023-07-07 Thread Alvaro Herrera
Looking at 0001 now.

I noticed that it adds JSON, JSON_SCALAR and JSON_SERIALIZE as reserved
keywords to doc/src/sgml/keywords/sql2016-02-reserved.txt; but those
keywords do not appear in the 2016 standard as reserved.  I see that
those keywords appear as reserved in sql2023-02-reserved.txt, so I
suppose you're covered as far as that goes; you don't need to patch
sql2016, and indeed that's the wrong thing to do.

I see that you add json_returning_clause_opt, but we already have
json_output_clause_opt.  Shouldn't these two be one and the same?
I think the new name is more sensible than the old one, since the
governing keyword is RETURNING; I suppose naming it "output" comes from
the fact that the standard calls this .

typo "requeted"

I'm not in love with the fact that JSON and JSONB have pretty much
parallel type categorizing functionality. It seems entirely artificial.
Maybe this didn't matter when these were contained inside each .c file
and nobody else had to deal with that, but I think it's not good to make
this an exported concept.  Is it possible to do away with that?  I mean,
reduce both to a single categorization enum, and a single categorization
API.  Here you have to cast the enum value to int in order to make
ExecInitExprRec work, and that seems a bit lame; moreso when the
"is_jsonb" is determined separately (cf. ExecEvalJsonConstructor)

In the 2023 standard, JSON_SCALAR is just

 ::= JSON_SCALAR   

but we seem to have added a  clause to it.  Should
we really?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Entristecido, Wutra (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"




gcc -Wclobbered in PostgresMain

2023-07-07 Thread Sergey Shinderuk

Hi, hackers!

While analyzing -Wclobbered warnings from gcc we found a true one in 
PostgresMain():


postgres.c: In function ‘PostgresMain’:
postgres.c:4118:25: warning: variable 
‘idle_in_transaction_timeout_enabled’ might be clobbered by ‘longjmp’ or 
‘vfork’ [-Wclobbered]
 4118 | boolidle_in_transaction_timeout_enabled = 
false;

  | ^~~
postgres.c:4119:25: warning: variable ‘idle_session_timeout_enabled’ 
might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]

 4119 | boolidle_session_timeout_enabled = false;
  | ^~~~

These variables must be declared volatile, because they are read after 
longjmp(). send_ready_for_query declared there is volatile.


Without volatile, these variables are kept in registers and restored by 
longjump(). I think, this is harmless because the error handling code 
calls disable_all_timeouts() anyway. But strictly speaking the code is 
invoking undefined behavior by reading those variables after longjmp(), 
so it's worth fixing. And for consistency with send_ready_for_query too. 
I believe, making them volatile doesn't affect performance in any way.


I also moved firstchar's declaration inside the loop where it's used, to 
make it clear that this variable needn't be volatile and is not 
preserved after longjmp().


Best regards,

--
Sergey Shinderukhttps://postgrespro.com/diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 01b6cc1f7d3..56d8b0814b5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4111,12 +4111,11 @@ PostgresSingleUserMain(int argc, char *argv[],
 void
 PostgresMain(const char *dbname, const char *username)
 {
-	int			firstchar;
 	StringInfoData input_message;
 	sigjmp_buf	local_sigjmp_buf;
 	volatile bool send_ready_for_query = true;
-	bool		idle_in_transaction_timeout_enabled = false;
-	bool		idle_session_timeout_enabled = false;
+	volatile bool idle_in_transaction_timeout_enabled = false;
+	volatile bool idle_session_timeout_enabled = false;
 
 	Assert(dbname != NULL);
 	Assert(username != NULL);
@@ -4418,6 +4417,8 @@ PostgresMain(const char *dbname, const char *username)
 
 	for (;;)
 	{
+		int			firstchar;
+
 		/*
 		 * At top of loop, reset extended-query-message flag, so that any
 		 * errors encountered in "idle" state don't provoke skip.


Re: DROP DATABASE is interruptible

2023-07-07 Thread Daniel Gustafsson
> On 25 Jun 2023, at 19:03, Andres Freund  wrote:
> On 2023-06-21 12:02:04 -0700, Andres Freund wrote:
>> I'm hacking on this bugfix again, 

This patch LGTM from reading through and testing (manually and with your
supplied tests in the patch), I think this is a sound approach to deal with
this.

>> I've been adding checks for partiall-dropped databases to the following 
>> places
>> so far:
>> - vac_truncate_clog(), as autovacuum can't process it anymore. Otherwise a
>>  partially dropped database could easily lead to shutdown-due-to-wraparound.
>> - get_database_list() - so autovacuum workers don't error out when connecting
>> - template database used by CREATE DATABASE
>> - pg_dumpall, so we don't try to connect to the database
>> - vacuumdb, clusterdb, reindexdb, same
> 
> Also pg_amcheck.

That seems like an exhaustive list to me, I was unable to think of any other
place which would need the same treatment.  pg_checksums does come to mind but
it can clearly not see the required info so there doesn't seem like theres a
lot to do about that.

>> I haven't yet added checks to pg_upgrade, even though that's clearly
>> needed. I'm waffling a bit between erroring out and just ignoring the
>> database? pg_upgrade already fails when datallowconn is set "wrongly", see
>> check_proper_datallowconn().  Any opinions?
> 
> There don't need to be explict checks, because pg_upgrade will fail, because
> it connects to every database. Obviously the error could be nicer, but it
> seems ok for something hopefully very rare. I did add a test ensuring that the
> behaviour is caught.

I don't see any pg_upgrade test in the patch?

> It's somewhat odd that pg_upgrade prints errors on stdout...

There are many odd things about pg_upgrade logging, updating it to use the
common logging framework of other utils would be nice.

>> I'm not sure what should be done for psql. It's probably not a good idea to
>> change tab completion, that'd just make it appear the database is gone. But 
>> \l
>> could probably show dropped databases more prominently?
> 
> I have not done that. I wonder if this is something that should be done in the
> back branches?

Possibly, I'm not sure where we usually stand on changing the output format of
\ commands in psql in minor revisions.

A few small comments on the patch:

+ * Max connections allowed (-1=no limit, -2=invalid database). A database
+ * is set to invalid partway through eing dropped.  Using datconnlimit=-2
+ * for this purpose isn't particularly clean, but is backpatchable.
Typo: s/eing/being/.  A limit of -1 makes sense, but the meaning of -2 is less
intuitive IMO.  Would it make sense to add a #define with a more descriptive
name to save folks reading this having to grep around and figure out what -2
means?

+   errhint("Use DROP DATABASE to drop invalid databases"));
Should end with a period as a complete sentence?

+   errmsg("cannot alter invalid database \"%s\"", stmt->dbname),
+   errdetail("Use DROP DATABASE to drop invalid databases"));
Shouldn't this be an errhint() instead? Also ending with a period.

+   if (database_is_invalid_form((Form_pg_database) dbform))
+   continue;
Would it make sense to stick a DEBUG2 log entry in there to signal that such a
database exist? (The same would apply for the similar hunk in autovacuum.c.)

--
Daniel Gustafsson





Re: UUID v7

2023-07-07 Thread Andrey M. Borodin


> On 6 Jul 2023, at 21:38, Peter Eisentraut  
> wrote:
> 
> I think it would be reasonable to review this patch now.
+1.

Also, I think we should discuss UUID v8. UUID version 8 provides an 
RFC-compatible format for experimental or vendor-specific use cases. Revision 1 
of IETF draft contained interesting code for v8: almost similar to v7, but with 
fields for "node ID" and "rolling sequence number".
I think this is reasonable approach, thus I attach implementation of UUID v8 
per [0]. But from my point of view this implementation has some flaws.
These two new fields "node ID" and "sequence" are there not for uniqueness, but 
rather for data locality.
But they are placed at the end, in bytes 14 and 15, after randomly generated 
numbers.

I think that "sequence" is there to help generate local ascending identifiers 
when the real time clock do not provide enough resolution. So "sequence" field 
must be placed after 6 bytes of time-generated identifier.

On a contrary "node ID" must differentiate identifiers generated on different 
nodes. So it makes sense to place "node ID" before timing. So identifiers 
generated on different nodes will tend to be in different ranges.
Although, section "6.4. Distributed UUID Generation" states that "node ID" is 
there to decrease the likelihood of a collision. So my intuition might be wrong 
here.


Do we want to provide this "vendor-specific" UUID with tweaks for databases? Or 
should we limit the scope with well defined UUID v7?


Best regards, Andrey Borodin.


[0] https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis-01



v2-0001-Implement-UUID-v7-and-v8-as-per-IETF-draft.patch
Description: Binary data


Re: remaining sql/json patches

2023-07-07 Thread Amit Langote
On Fri, Jul 7, 2023 at 8:31 PM Peter Eisentraut  wrote:
> On 21.06.23 10:25, Amit Langote wrote:
> > I realized that the patch for the "other sql/json functions" part is
> > relatively straightforward and has no dependence on the "sql/json
> > query functions" part getting done first.  So I've made that one the
> > 0001 patch.  The patch I posted in the last email is now 0002, though
> > it only has changes related to changing the order of the patch, so I
> > decided not to change the patch version marker (v1).
>
> (I suggest you change the version number anyway, next time.  There are
> plenty of numbers available.)

Will do. :)

> The 0001 patch contains a change to
> doc/src/sgml/keywords/sql2016-02-reserved.txt, which seems
> inappropriate.  The additional keywords are already listed in the 2023
> file, and they are not SQL:2016 keywords.

Ah, indeed.  Will remove.

> Another thing, I noticed that the SQL/JSON patches in PG16 introduced
> some nonstandard indentation in gram.y.  I would like to apply the
> attached patch to straighten this out.

Sounds fine to me.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: CHECK Constraint Deferrable

2023-07-07 Thread Dilip Kumar
On Wed, Jul 5, 2023 at 3:08 PM Himanshu Upadhyaya
 wrote:
>
> Hi,
>
> Currently, there is no support for CHECK constraint DEFERRABLE in a create 
> table statement.
> SQL standard specifies that CHECK constraint can be defined as DEFERRABLE.

I think this is a valid argument that this is part of SQL standard so
it would be good addition to PostgreSQL.  So +1 for the feature.

But I am wondering whether there are some real-world use cases for
deferred CHECK/NOT NULL constraints?  I mean like for foreign key
constraints if there is a cyclic dependency between two tables then
deferring the constraint is the simplest way to insert without error.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: remaining sql/json patches

2023-07-07 Thread Peter Eisentraut

On 21.06.23 10:25, Amit Langote wrote:

I realized that the patch for the "other sql/json functions" part is
relatively straightforward and has no dependence on the "sql/json
query functions" part getting done first.  So I've made that one the
0001 patch.  The patch I posted in the last email is now 0002, though
it only has changes related to changing the order of the patch, so I
decided not to change the patch version marker (v1).


(I suggest you change the version number anyway, next time.  There are 
plenty of numbers available.)


The 0001 patch contains a change to 
doc/src/sgml/keywords/sql2016-02-reserved.txt, which seems 
inappropriate.  The additional keywords are already listed in the 2023 
file, and they are not SQL:2016 keywords.


Another thing, I noticed that the SQL/JSON patches in PG16 introduced 
some nonstandard indentation in gram.y.  I would like to apply the 
attached patch to straighten this out.diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 39ab7eac0d..edb6c00ece 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -645,23 +645,20 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
 %typehash_partbound
 %type  hash_partbound_elem
 
+%typejson_format_clause_opt
+   json_value_expr
+   json_output_clause_opt
+   json_name_and_value
+   json_aggregate_func
+%typejson_name_and_value_list
+   json_value_expr_list
+   json_array_aggregate_order_by_clause_opt
+%typejson_encoding_clause_opt
+   json_predicate_type_constraint
+%type json_key_uniqueness_constraint_opt
+   json_object_constructor_null_clause_opt
+   json_array_constructor_null_clause_opt
 
-%typejson_format_clause_opt
-   json_value_expr
-   json_output_clause_opt
-   json_name_and_value
-   json_aggregate_func
-
-%typejson_name_and_value_list
-   json_value_expr_list
-   json_array_aggregate_order_by_clause_opt
-
-%typejson_encoding_clause_opt
-   json_predicate_type_constraint
-
-%type json_key_uniqueness_constraint_opt
-   json_object_constructor_null_clause_opt
-   json_array_constructor_null_clause_opt
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.


Re: Add hint message for check_log_destination()

2023-07-07 Thread Japin Li

On Fri, 07 Jul 2023 at 16:21, Masahiko Sawada  wrote:
> On Fri, Jul 7, 2023 at 4:53 PM Japin Li  wrote:
>>
>>
>> On Fri, 07 Jul 2023 at 14:46, jian he  wrote:
>> > On Fri, Jul 7, 2023 at 1:06 PM Japin Li  wrote:
>> >>
>> >>
>> >> Hi, hackers
>> >>
>> >> When I try to change log_destination using ALTER SYSTEM with the wrong 
>> >> value,
>> >> it complains of the "Unrecognized key word" without available values.  
>> >> This
>> >> patch tries to add a hint message that provides available values for
>> >> log_destination.  Any thoughts?
>
> +1
>
> +   appendStringInfo(, "\"stderr\"");
> +#ifdef HAVE_SYSLOG
> +   appendStringInfo(, ", \"syslog\"");
> +#endif
> +#ifdef WIN32
> +   appendStringInfo(, ", \"eventlog\"");
> +#endif
> +   appendStringInfo(, ", \"csvlog\", and \"jsonlog\"");
>
> I think using appendStringInfoString() is a bit more natural and faster.
>

Thanks for your review! Fixed as per your suggession.

>From e0338c5085e655f9a25f13cd5acea9a3110806fd Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Fri, 7 Jul 2023 15:48:53 +0800
Subject: [PATCH v3 1/1] Add hint message for check_log_destination

---
 src/backend/utils/error/elog.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 5898100acb..a7545dcbd9 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2228,8 +2228,22 @@ check_log_destination(char **newval, void **extra, GucSource source)
 #endif
 		else
 		{
+			StringInfoData errhint;
+
+			initStringInfo();
+			appendStringInfoString(, "\"stderr\"");
+#ifdef HAVE_SYSLOG
+			appendStringInfoString(, ", \"syslog\"");
+#endif
+#ifdef WIN32
+			appendStringInfoString(, ", \"eventlog\"");
+#endif
+			appendStringInfoString(, ", \"csvlog\", and \"jsonlog\"");
+
 			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			GUC_check_errhint("Valid values are combinations of %s.", errhint.data);
 			pfree(rawstring);
+			pfree(errhint.data);
 			list_free(elemlist);
 			return false;
 		}
-- 
2.25.1


-- 
Regrads,
Japin Li.



Re: Disabling Heap-Only Tuples

2023-07-07 Thread Thom Brown
On Thu, 6 Jul 2023 at 21:18, Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> >
> > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> >  wrote:
> > > So what were you thinking of? A session GUC? A table option?
> >
> > Both.
>
> Here's a small patch implementing a new table option max_local_update
> (name very much bikesheddable). Value is -1 (default, disabled) or the
> size of the table in MiB that you still want to allow to update on the
> same page. I didn't yet go for a GUC as I think that has too little
> control on the impact on the system.
>
> I decided that max_local_update would be in MB because there is no
> reloption value that can contain MaxBlockNumber and -1/disabled; and 1
> MiB seems like enough granularity for essentially all use cases.
>
> The added regression tests show how this feature works, that the new
> feature works, and validate that lock levels are acceptable
> (ShareUpdateExclusiveLock, same as for updating fillfactor).

Wow, thanks for working on this.

I've given it a test, and it does what I would expect it to do.

I'm aware of the concerns about the potential for the relocation to
land in an undesirable location, so perhaps that needs addressing.
But this is already considerably better than the current need to
update a row until it gets pushed off its current page.  Ideally there
would be tooling built around this where the user wouldn't need to
figure out how much of the table to UPDATE, or deal with VACUUMing
concerns.

But here's my quick test:

CREATE OR REPLACE FUNCTION compact_table(table_name IN TEXT)
RETURNS VOID AS $$
DECLARE
current_row RECORD;
old_ctid TID;
new_ctid TID;
keys TEXT;
update_query TEXT;
row_counter INTEGER := 0;
BEGIN
SELECT string_agg(a.attname || ' = ' || a.attname, ', ')
INTO keys
FROM
pg_index i
JOIN
pg_attribute a ON a.attnum = ANY(i.indkey)
WHERE
i.indrelid = table_name::regclass
AND a.attrelid = table_name::regclass
AND i.indisprimary;

IF keys IS NULL THEN
RAISE EXCEPTION 'Table % does not have a primary key.', table_name;
END IF;

FOR current_row IN
EXECUTE FORMAT('SELECT ctid, * FROM %I ORDER BY ctid DESC', table_name)
LOOP
old_ctid := current_row.ctid;

update_query := FORMAT('UPDATE %I SET %s WHERE ctid = $1
RETURNING ctid', table_name, keys);
EXECUTE update_query USING old_ctid INTO new_ctid;

row_counter := row_counter + 1;

IF row_counter % 1000 = 0 THEN
RAISE NOTICE '% rows relocated.', row_counter;
END IF;

IF new_ctid <= old_ctid THEN
CONTINUE;
ELSE
RAISE NOTICE 'All non-contiguous rows relocated.';
EXIT;
END IF;
END LOOP;
END; $$
LANGUAGE plpgsql;


postgres=# CREATE TABLE bigtable (id int, content text);
CREATE TABLE
postgres=# INSERT INTO bigtable SELECT x, 'This is just a way to fill
up space.' FROM generate_series(1,1000) a(x);
INSERT 0 1000
postgres=# DELETE FROM bigtable WHERE id % 7 = 0;
DELETE 1428571
postgres=# VACUUM bigtable;
VACUUM
postgres=# ALTER TABLE bigtable SET (max_local_update = 0);
ALTER TABLE
postgres=# ALTER TABLE bigtable ADD PRIMARY KEY (id);
ALTER TABLE
postgres=# \dt+ bigtable
   List of relations
 Schema |   Name   | Type  | Owner | Persistence | Access method |
Size  | Description
+--+---+---+-+---++-
 public | bigtable | table | thom  | permanent   | heap  | 730 MB |
(1 row)

postgres=# SELECT * FROM pgstattuple('bigtable');
 table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
---+-+---+---+--++++--
 765607936 | 8571429 | 557142885 | 72.77 |
0 |  0 |  0 |  105901628 |13.83
(1 row)

postgres=# SELECT compact_table('bigtable');
NOTICE:  1000 rows relocated.
NOTICE:  2000 rows relocated.
NOTICE:  3000 rows relocated.
NOTICE:  4000 rows relocated.
...
NOTICE:  1221000 rows relocated.
NOTICE:  1222000 rows relocated.
NOTICE:  1223000 rows relocated.
NOTICE:  1224000 rows relocated.
NOTICE:  All non-contiguous rows relocated.
 compact_table
---

(1 row)

postgres=# VACUUM bigtable;
VACUUM
postgres=# \dt+ bigtable;
   List of relations
 Schema |   Name   | Type  | Owner | Persistence | Access method |
Size  | Description
+--+---+---+-+---++-
 public | bigtable | table | thom  | permanent   | heap  | 626 MB |
(1 row)

postgres=# SELECT * FROM pgstattuple('bigtable');
 table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | 

Re: [PATCH] Add support function for containment operators

2023-07-07 Thread Laurenz Albe
I wrote:
> You implement both "SupportRequestIndexCondition" and 
> "SupportRequestSimplify",
> but when I experimented, the former was never called.  That does not
> surprise me, since any expression of the shape "expr <@ range constant"
> can be simplified.  Is the "SupportRequestIndexCondition" branch dead code?
> If not, do you have an example that triggers it?

I had an idea about this:
So far, you only consider constant ranges.  But if we have a STABLE range
expression, you could use an index scan for "expr <@ range", for example
Index Cond (expr >= lower(range) AND expr < upper(range)).

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Laurenz Albe
On Fri, 2023-07-07 at 16:27 +0530, Dilip Kumar wrote:
> On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra  
> wrote:
> > I'm imagining either a table option with a couple possible values
> > (default, non-hot, first-page, ...) or maybe something even more
> > elaborate (perhaps even a callback?).
> > 
> > Now, it's not my intention to hijack this thread, but this discussion
> > reminds me one of the ideas from my "BRIN improvements" talk, about
> > maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> > BRIN, making them gradually worse over time. If we could "tell"
> > RelationGetBufferForTuple() which buffers are more suitable (by looking
> > at an index, histogram or some approximate mapping), that might help.
> 
> IMHO that seems like the right direction for this feature to be
> useful.

Right, I agree.  A GUC/storage parameter like "update_strategy"
that is an enum (try-hot | first-page | ...).

To preserve BRIN indexes or CLUSTERed tables, there could be an additional
"insert_strategy", but that would somehow have to be tied to a certain
index.  I think that is out of scope for this effort.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Dilip Kumar
On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra
 wrote:
>

> On 7/7/23 11:55, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
> >>
> >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
> >>  wrote:
> >>>
> >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> 
>  On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
>   wrote:
> > So what were you thinking of? A session GUC? A table option?
> 
>  Both.
> >>>
> >>> Here's a small patch implementing a new table option max_local_update
> >>> (name very much bikesheddable). Value is -1 (default, disabled) or the
> >>> size of the table in MiB that you still want to allow to update on the
> >>> same page. I didn't yet go for a GUC as I think that has too little
> >>> control on the impact on the system.
> >>
> >> So IIUC, this parameter we can control that instead of putting the new
> >> version of the tuple on the same page, it should choose using
> >> RelationGetBufferForTuple(), and that can reduce the fragmentation
> >> because now if there is space then most of the updated tuple will be
> >> inserted in same pages.  But this still can not truncate the pages
> >> from the heap right? because we can not guarantee that the new page
> >> selected by RelationGetBufferForTuple() is not from the end of the
> >> heap, and until we free the pages from the end of the heap, the vacuum
> >> can not truncate any page.  Is my understanding correct?
> >
> > Yes. If you don't have pages with (enough) free space for the updated
> > tuples in your table, or if the FSM doesn't accurately reflect the
> > actual state of free space in your table, this won't help (which is
> > also the reason why I run vacuum in the tests). It also won't help if
> > you don't update the tuples physically located at the end of your
> > table, but in the targeted workload this would introduce a bias where
> > new tuple versions are moved to the front of the table.
> >
> > Something to note is that this may result in very bad bloat when this
> > is combined with a low fillfactor: All blocks past max_local_update
> > will be unable to use space reserved by fillfactor because FSM lookups
> > always take fillfactor into account, and all updates (which ignore
> > fillfactor when local) would go through the FSM instead, thus reducing
> > the space available on each block to exactly the fillfactor. So, this
> > might need some extra code to make sure we don't accidentally blow up
> > the table's size with UPDATEs when max_local_update is combined with
> > low fillfactors. I'm not sure where that would fit best.
> >
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).
>
> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

IMHO that seems like the right direction for this feature to be
useful.  Otherwise just forcing it to select a page using
RelationGetBufferForTuple() without any input or direction to this
function can behave pretty randomly.  In fact, there should be some
way to say insert a new tuple in a smaller block number first
(provided they have free space) and with that, we might get an
opportunity to truncate some heap pages by vacuum.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Tomas Vondra



On 7/7/23 11:55, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
>>
>> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
>>  wrote:
>>>
>>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:

 On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
  wrote:
> So what were you thinking of? A session GUC? A table option?

 Both.
>>>
>>> Here's a small patch implementing a new table option max_local_update
>>> (name very much bikesheddable). Value is -1 (default, disabled) or the
>>> size of the table in MiB that you still want to allow to update on the
>>> same page. I didn't yet go for a GUC as I think that has too little
>>> control on the impact on the system.
>>
>> So IIUC, this parameter we can control that instead of putting the new
>> version of the tuple on the same page, it should choose using
>> RelationGetBufferForTuple(), and that can reduce the fragmentation
>> because now if there is space then most of the updated tuple will be
>> inserted in same pages.  But this still can not truncate the pages
>> from the heap right? because we can not guarantee that the new page
>> selected by RelationGetBufferForTuple() is not from the end of the
>> heap, and until we free the pages from the end of the heap, the vacuum
>> can not truncate any page.  Is my understanding correct?
> 
> Yes. If you don't have pages with (enough) free space for the updated
> tuples in your table, or if the FSM doesn't accurately reflect the
> actual state of free space in your table, this won't help (which is
> also the reason why I run vacuum in the tests). It also won't help if
> you don't update the tuples physically located at the end of your
> table, but in the targeted workload this would introduce a bias where
> new tuple versions are moved to the front of the table.
> 
> Something to note is that this may result in very bad bloat when this
> is combined with a low fillfactor: All blocks past max_local_update
> will be unable to use space reserved by fillfactor because FSM lookups
> always take fillfactor into account, and all updates (which ignore
> fillfactor when local) would go through the FSM instead, thus reducing
> the space available on each block to exactly the fillfactor. So, this
> might need some extra code to make sure we don't accidentally blow up
> the table's size with UPDATEs when max_local_update is combined with
> low fillfactors. I'm not sure where that would fit best.
> 

I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.

I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...

I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).

Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-07 Thread Damir Belyalov
Hello!

The patch does not work for the current version of postgres, it needs to be
updated.
I tested your patch. Everything looks simple and works well.

There is a suggestion to simplify the code: instead of using

if (cstate->opts.force_notnull_all)
{
int i;
for(i = 0; i < num_phys_attrs; i++)
cstate->opt.force_notnull_flags[i] = true;
}

you can use MemSet():

if (cstate->opts.force_notnull_all)
MemSet(cstate->opt.force_notnull_flags, true, num_phys_attrs *
sizeof(bool));

The same for the force_null case.

Regards,
Damir Belyalov,
Postgres Professional


Re: Disabling Heap-Only Tuples

2023-07-07 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
>
> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
>  wrote:
> >
> > On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> > >  wrote:
> > > > So what were you thinking of? A session GUC? A table option?
> > >
> > > Both.
> >
> > Here's a small patch implementing a new table option max_local_update
> > (name very much bikesheddable). Value is -1 (default, disabled) or the
> > size of the table in MiB that you still want to allow to update on the
> > same page. I didn't yet go for a GUC as I think that has too little
> > control on the impact on the system.
>
> So IIUC, this parameter we can control that instead of putting the new
> version of the tuple on the same page, it should choose using
> RelationGetBufferForTuple(), and that can reduce the fragmentation
> because now if there is space then most of the updated tuple will be
> inserted in same pages.  But this still can not truncate the pages
> from the heap right? because we can not guarantee that the new page
> selected by RelationGetBufferForTuple() is not from the end of the
> heap, and until we free the pages from the end of the heap, the vacuum
> can not truncate any page.  Is my understanding correct?

Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.

Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-07 Thread Peter Smith
Hi. Here are some review comments for the patch v16-0001

==
Commit message.

1.
Also; most of the code shared by both worker types are already combined
in LogicalRepApplyLoop(). There is no need to combine the rest in
ApplyWorkerMain() anymore.

~

/are already/is already/

/Also;/Also,/

~~~

2.
This commit introduces TablesyncWorkerMain() as a new entry point for
tablesync workers and separates both type of workers from each other.
This aims to increase code readability and help to maintain logical
replication workers separately.

2a.
/This commit/This patch/

~

2b.
"and separates both type of workers from each other"

Maybe that part can all be removed. The following sentence says the
same again anyhow.

==
src/backend/replication/logical/worker.c

3.
 static void stream_write_change(char action, StringInfo s);
 static void stream_open_and_write_change(TransactionId xid, char
action, StringInfo s);
 static void stream_close_file(void);
+static void set_stream_options(WalRcvStreamOptions *options,
+char *slotname,
+XLogRecPtr *origin_startpos);

~

Maybe a blank line was needed here because this static should not be
grouped with the other functions that are grouped for "Serialize and
deserialize changes for a toplevel transaction." comment.

~~~

4. set_stream_options

+ /* set_stream_options
+  * Set logical replication streaming options.
+  *
+  * This function sets streaming options including replication slot name and
+  * origin start position. Workers need these options for logical replication.
+  */
+static void
+set_stream_options(WalRcvStreamOptions *options,

The indentation is not right for this function comment.

~~~

5. set_stream_options

+ /*
+ * Even when the two_phase mode is requested by the user, it remains as
+ * the tri-state PENDING until all tablesyncs have reached READY state.
+ * Only then, can it become ENABLED.
+ *
+ * Note: If the subscription has no tables then leave the state as
+ * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
+ * work.
+ */
+ if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING &&
+ AllTablesyncsReady())
+ options->proto.logical.twophase = true;
+}

This part of the refactoring seems questionable...

IIUC this new function was extracted from code in originally in
function ApplyWorkerMain()

But in that original code, this fragment above was guarded by the condition
if (!am_tablesync_worker())

But now where is that condition? e.g. What is stopping tablesync
working from getting into this code it previously would not have
executed?

~~~

6.
  AbortOutOfAnyTransaction();
- pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+ pgstat_report_subscription_error(MySubscription->oid,
+ !am_tablesync_worker());

Does this change have anything to do with this patch? Is it a quirk of
running pg_indent?

~~~
7. run_tablesync_worker

Since the stated intent of the patch is the separation of apply and
tablesync workers then shouldn't this function belong in the
tablesync.c file?

~~~
8. run_tablesync_worker

+ * Runs the tablesync worker.
+ * It starts syncing tables. After a successful sync, sets streaming options
+ * and starts streaming to catchup.
+ */
+static void
+run_tablesync_worker(WalRcvStreamOptions *options,

Nicer to have a blank line after the first sentence of that function comment?

~~~
9. run_apply_worker

+/*
+ * Runs the leader apply worker.
+ * It sets up replication origin, streaming options and then starts streaming.
+ */
+static void
+run_apply_worker(WalRcvStreamOptions *options,

Nicer to have a blank line after the first sentence of that function comment?

~~~
10. InitializeLogRepWorker

+/*
+ * Common initialization for logical replication workers; leader apply worker,
+ * parallel apply worker and tablesync worker.
  *
  * Initialize the database connection, in-memory subscription and necessary
  * config options.
  */
 void
-InitializeApplyWorker(void)
+InitializeLogRepWorker(void)

typo:

/workers;/workers:/

~~~
11. TablesyncWorkerMain

Since the stated intent of the patch is the separation of apply and
tablesync workers then shouldn't this function belong in the
tablesync.c file?

==
src/include/replication/worker_internal.h

12.
 #define isParallelApplyWorker(worker) ((worker)->leader_pid != InvalidPid)

+extern void finish_sync_worker(void);

~

I think the macro isParallelApplyWorker is associated with the am_XXX
inline functions that follow it, so it doesn’t seem the best place to
jam this extern in the middle of that.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Initial Schema Sync for Logical Replication

2023-07-07 Thread Kumar, Sachin
> From: Masahiko Sawada 
> So I've implemented a different approach; doing schema synchronization at a
> CREATE SUBSCRIPTION time. The backend executing CREATE SUBSCRIPTION
> uses pg_dump and restores the table schemas including both partitioned tables
> and their partitions regardless of publish_via_partition_root option, and then
> creates pg_subscription_rel entries for tables while respecting
> publish_via_partition_root option.
> 
> There is a window between table creations and the tablesync workers starting 
> to
> process the tables. If DDLs are executed in this window, the tablesync worker
> might fail because the table schema might have already been changed. We need
> to mention this note in the documentation. BTW, I think we will be able to get
> rid of this downside if we support DDL replication. DDLs executed in the 
> window
> are applied by the apply worker and it takes over the data copy to the 
> tablesync
> worker at a certain LSN.

I don’t think even with DDL replication we will be able to get rid of this 
window. 
There are some issues
1. Even with tablesync worker taking over at certain LSN, publisher can make 
more changes till
Table sync acquires lock on publisher table via copy table.
2. how we will make sure that applier worker has caught up will all the changes 
from publisher
Before it starts tableSync worker. It can be lag behind publisher.

I think the easiest option would be to just recreate the table , this way we 
don’t have to worry about 
complex race conditions, tablesync already makes a slot for copy data we can 
use same slot for 
getting upto date table definition, dropping the table won't be much expensive 
since there won't be any data
in it.Apply worker will skip all the DDLs/DMLs till table is synced.

Although for partitioned tables we will be able to keep with published table 
schema changes only when 
publish_by_partition_root is true.

Regards
Sachin
Amazon Web Services: https://aws.amazon.com


Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2023-07-07 Thread Yugo NAGATA
On Fri, 7 Jul 2023 17:21:36 +0900
Yugo NAGATA  wrote:

> Hi Nikita,
> 
> On Wed, 5 Jul 2023 17:49:20 +0300
> Nikita Malakhov  wrote:
> 
> > Hi!
> > 
> > I like the idea of having a standard function which shows a TOAST value ID
> > for a row. I've used my own to handle TOAST errors. Just, maybe, more
> > correct
> > name would be "...value_id", because you actually retrieve valueid field
> > from the TOAST pointer, and chunk ID consists of valueid + chunk_seq.
> 
> Thank you for your review!
> 
> Although, the retrieved field is "va_valueid" and it is called "value ID" in 
> the
> code, I chose the name "..._chunk_id" because I found the description in the
> documentation as followings:
> 
> -
> Every TOAST table has the columns chunk_id (an OID identifying the particular 
> TOASTed value), chunk_seq (a sequence number for the chunk within its value), 
> and chunk_data (the actual data of the chunk). A unique index on chunk_id and 
> chunk_seq provides fast retrieval of the values. A pointer datum representing 
> an out-of-line on-disk TOASTed value therefore needs to store the OID of the 
> TOAST table in which to look and the OID of the specific value (its chunk_id)
> -
> https://www.postgresql.org/docs/devel/storage-toast.html
> 
> Here, chunk_id defined separately from chunk_seq. Therefore, I wonder  
> pg_column_toast_chunk_id would be ok. However, I don't insist on this
> and I would be happy to change it if the other name is more natural for users.

I attached v2 patch that contains the documentation fix.

Regards,
Yugo Nagata

> 
> Regards,
> Yugo Nagata
> 
> > 
> > -- 
> > Regards,
> > Nikita Malakhov
> > Postgres Professional
> > The Russian Postgres Company
> > https://postgrespro.ru/
> 
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5a47ce4343..c2c3156cd4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27608,6 +27608,20 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset

   
 
+  
+   
+
+ pg_column_toast_chunk_id
+
+pg_column_toast_chunk_id ( "any" )
+oid
+   
+   
+Returns the chunk id of the TOASTed value, or NULL
+if the value is not TOASTed.
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 884bfbc8ce..fe8788c1b1 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -5069,6 +5069,47 @@ pg_column_compression(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
+/*
+ * Return the chunk id of the TOASTed value.
+ * Return NULL for unTOASTed value.
+ */
+Datum
+pg_column_toast_chunk_id(PG_FUNCTION_ARGS)
+{
+	int			typlen;
+	struct varlena *attr;
+	struct varatt_external toast_pointer;
+
+	/* On first call, get the input type's typlen, and save at *fn_extra */
+	if (fcinfo->flinfo->fn_extra == NULL)
+	{
+		/* Lookup the datatype of the supplied argument */
+		Oid			argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0);
+
+		typlen = get_typlen(argtypeid);
+		if (typlen == 0)		/* should not happen */
+			elog(ERROR, "cache lookup failed for type %u", argtypeid);
+
+		fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+	  sizeof(int));
+		*((int *) fcinfo->flinfo->fn_extra) = typlen;
+	}
+	else
+		typlen = *((int *) fcinfo->flinfo->fn_extra);
+
+	if (typlen != -1)
+		PG_RETURN_NULL();
+
+	attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0));
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		PG_RETURN_NULL();
+
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	PG_RETURN_OID(toast_pointer.va_valueid);
+}
+
 /*
  * string_agg - Concatenates values and returns string.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6996073989..0cacd0391d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7403,6 +7403,9 @@
 { oid => '2121', descr => 'compression method for the compressed datum',
   proname => 'pg_column_compression', provolatile => 's', prorettype => 'text',
   proargtypes => 'any', prosrc => 'pg_column_compression' },
+{ oid => '8393', descr => 'chunk ID of TOASTed value',
+  proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => 'oid',
+  proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' },
 { oid => '2322',
   descr => 'total disk space usage for the specified tablespace',
   proname => 'pg_tablespace_size', provolatile => 'v', prorettype => 'int8',


Re: Add hint message for check_log_destination()

2023-07-07 Thread Masahiko Sawada
On Fri, Jul 7, 2023 at 4:53 PM Japin Li  wrote:
>
>
> On Fri, 07 Jul 2023 at 14:46, jian he  wrote:
> > On Fri, Jul 7, 2023 at 1:06 PM Japin Li  wrote:
> >>
> >>
> >> Hi, hackers
> >>
> >> When I try to change log_destination using ALTER SYSTEM with the wrong 
> >> value,
> >> it complains of the "Unrecognized key word" without available values.  This
> >> patch tries to add a hint message that provides available values for
> >> log_destination.  Any thoughts?

+1

+   appendStringInfo(, "\"stderr\"");
+#ifdef HAVE_SYSLOG
+   appendStringInfo(, ", \"syslog\"");
+#endif
+#ifdef WIN32
+   appendStringInfo(, ", \"eventlog\"");
+#endif
+   appendStringInfo(, ", \"csvlog\", and \"jsonlog\"");

I think using appendStringInfoString() is a bit more natural and faster.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2023-07-07 Thread Yugo NAGATA
Hi Nikita,

On Wed, 5 Jul 2023 17:49:20 +0300
Nikita Malakhov  wrote:

> Hi!
> 
> I like the idea of having a standard function which shows a TOAST value ID
> for a row. I've used my own to handle TOAST errors. Just, maybe, more
> correct
> name would be "...value_id", because you actually retrieve valueid field
> from the TOAST pointer, and chunk ID consists of valueid + chunk_seq.

Thank you for your review!

Although, the retrieved field is "va_valueid" and it is called "value ID" in the
code, I chose the name "..._chunk_id" because I found the description in the
documentation as followings:

-
Every TOAST table has the columns chunk_id (an OID identifying the particular 
TOASTed value), chunk_seq (a sequence number for the chunk within its value), 
and chunk_data (the actual data of the chunk). A unique index on chunk_id and 
chunk_seq provides fast retrieval of the values. A pointer datum representing 
an out-of-line on-disk TOASTed value therefore needs to store the OID of the 
TOAST table in which to look and the OID of the specific value (its chunk_id)
-
https://www.postgresql.org/docs/devel/storage-toast.html

Here, chunk_id defined separately from chunk_seq. Therefore, I wonder  
pg_column_toast_chunk_id would be ok. However, I don't insist on this
and I would be happy to change it if the other name is more natural for users.

Regards,
Yugo Nagata

> 
> -- 
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/


-- 
Yugo NAGATA 




Re: POC, WIP: OR-clause support for indexes

2023-07-07 Thread Alena Rybakina
Hi! Thank you for your detailed review, your changes have greatly helped 
to improve this patch.


On 06.07.2023 13:20, Andrey Lepikhov wrote:

On 6/7/2023 03:06, Alena Rybakina wrote:

I corrected this constant in the patch.

The patch don't apply cleanly: it contains some trailing spaces.

I fixed it.


Also, quick glance into the code shows some weak points;
1. transformBoolExprOr should have input type BoolExpr.

Agreed.
2. You can avoid the switch operator at the beginning of the function, 
because you only need one option.

Agreed.

3. Stale comments: RestrictIinfos definitely not exists at this point.
Yes, unfortunately, I missed this from the previous version when I tried 
to perform such a transformation at the index creation stage.
4. I don't know, you really need to copy the expr or not, but it is 
better to do as late, as possible.

Yes, I agree with you, copying "expr" is not necessary in this patch

5. You assume, that leftop is non-constant and rightop - constant. Why?

Agreed, It was too presumptuous on my part and I agree with your changes.
6.I doubt about equivalence operator. Someone can invent a custom '=' 
operator with another semantics, than usual. May be better to check 
mergejoinability?
Yes, I agree with you, and I haven't thought about it before. But I 
haven't found any functions to arrange this in PostgreSQL, but using 
mergejoinability turns out to be more beautiful here.
7. I don't know how to confidently identify constant expressions at 
this level. So, I guess, You can only merge here expressions like 
"F(X)=Const", not an 'F(X)=ConstExpression'.
I see, you can find solution for this case, thank you for this, and I 
think it's reliable enough.


On 07.07.2023 05:43, Andrey Lepikhov wrote:

On 6/7/2023 03:06, Alena Rybakina wrote:
The test was performed on the same benchmark database generated by 2 
billion values.


I corrected this constant in the patch.
In attempt to resolve some issues had mentioned in my previous letter 
I used op_mergejoinable to detect mergejoinability of a clause.
Constant side of the expression is detected by call of 
eval_const_expressions() and check each side on the Const type of node.


See 'diff to diff' in attachment.



I notices you remove condition for checking equal operation.

strcmp(strVal(linitial((arg)->name)), "=") == 0

Firstly, it is noticed me not correct, but a simple example convinced me 
otherwise:


postgres=# explain analyze select x from a where x=1 or x>5 or x<3 or x=2;
   QUERY PLAN

 Seq Scan on a  (cost=0.00..2291.00 rows=97899 width=4) (actual 
time=0.038..104.168 rows=99000 loops=1)
   Filter: ((x > '5'::numeric) OR (x < '3'::numeric) OR (x = ANY 
('{1,2}'::numeric[])))

   Rows Removed by Filter: 1000
 Planning Time: 9.938 ms
 Execution Time: 113.457 ms
(5 rows)

It surprises me that such check I can write such similar way:

eval_const_expressions(NULL, orqual).


Yes, I see we can remove this code:

bare_orarg = transformExprRecurse(pstate, (Node *)arg);
bare_orarg = coerce_to_boolean(pstate, bare_orarg, "OR");

because we will provide similar manipulation in this:

foreach(l, gentry->consts)
{
      Node       *rexpr = (Node *) lfirst(l);

      rexpr = coerce_to_common_type(pstate, rexpr,
                                            scalar_type,
                                            "IN");
     aexprs = lappend(aexprs, rexpr);
}

--
Regards,
Alena Rybakina
Postgres Professional
From 9134099ef9808ca27494bc131558d04b24d9bdeb Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Thu, 29 Jun 2023 17:46:58 +0300
Subject: [PATCH] Replace OR clause to ANY expressions. Replace (X=N1) OR
 (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than 500 or
 expressions. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

---
 src/backend/parser/parse_expr.c  | 269 ++-
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 346fd272b6d..961ca3e482c 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -95,6 +95,273 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
 
+typedef struct OrClauseGroupEntry
+{
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Expr 		   *expr;
+} OrClauseGroupEntry;
+
+static int 

Re: Implement missing join selectivity estimation for range types

2023-07-07 Thread Damir Belyalov
Hello!

Thank you for the patch, very interesting article.
The patch doesn't apply to the current postgres version. Could you please
update it?

Regards,
Damir Belyalov,
Postgres Professional


Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-07 Thread Masahiko Sawada
On Fri, Jul 7, 2023 at 10:55 AM Peter Smith  wrote:
>
> Hi, Here are my review comments for patch v2.
>
> ==
> 1. doc/src/sgml/logical-replication.sgml
>
> Candidate indexes must be btree, non-partial, and have at least one
> column reference to the published table column at the leftmost column
> (i.e. cannot consist of only expressions).
>
> ~
>
> There is only one column which can be the leftmost one, so the wording
> "At least one ... at the leftmost"  seemed a bit strange to me.
> Personally, I would phrase it something like below:
>
> SUGGESTION #1
> Candidate indexes must be btree, non-partial, and the leftmost index
> column must reference a published table column (i.e. the index cannot
> consist of only expressions).

I prefer the first suggestion. I've attached the updated patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v3-0001-Doc-clarify-the-conditions-of-usable-indexes-for-.patch
Description: Binary data


RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-07 Thread Hayato Kuroda (Fujitsu)
Dear Önder,

Thank you for your analysis!

>
Yes, I agree, it is (and was before my patch as well) un-documented limitation 
of REPLICA IDENTITY FULL.
And, as far as I can see, my patch actually didn't have any impact on the 
limitation. The unsupported
cases are still unsupported, but now the same error is thrown in a slightly 
different place.
I think that is a minor limitation, but maybe should be listed [1]?
>

Yes, your modification did not touch the restriction. It has existed before the
commit. I (or my colleague) will post the patch to add the description, maybe
after [1] is committed.

>
For this one, I did some research in the code, but  I'm not very
comfortable with the answer. Still, I wanted to share my observations so
that it might be useful for the discussion.

First, I checked if the function get_op_btree_interpretation() could be
used here. But, I think that is again btree-only and I couldn't find
anything generic that does something similar.
>

Thanks for checking. The function seems to return the list of operator family 
and
its strategy number when the oid of the operator is given. But what we want to 
do
here is get the operator oid. I think that the input and output of the function
seems opposite. And as you mentioned, the index must be btree.

>
Then, I tried to come up with a SQL query, actually based on the link [2]
you shared. I think we should always have an "equality" strategy (e.g.,
"same", "overlaps", "contains" etc sounds wrong to me).
>

I could agree that "overlaps", "contains", are not "equal", but not sure about
the "same". Around here we must discuss, but not now.

>
And, it seems btree, hash and brin supports "equal". So, a query like the
following seems to provide the list of (index type, strategy_number,
data_type) that we might be allowed to use.
>

Note that strategy numbers listed in the doc are just an example - Other than 
BTree
and Hash do not have a fixed set of strategies at all.
E.g., operator classes for Btree, Hash and BRIN (Minmax) has "equal" and the
strategy number is documented. But other user-defined operator classes for BRIN
may have another number, or it does not have equality comparison.

>
  SELECT
am.amname AS index_type,
 amop.amoplefttype::regtype,amop.amoprighttype::regtype,
op.oprname AS operator,
amop.amopstrategy AS strategy_number
FROM
pg_amop amop
JOIN
pg_am am ON am.oid = amop.amopmethod
JOIN
pg_operator op ON op.oid = amop.amopopr
WHERE
(am.amname = 'btree' and amop.amopstrategy = 3) OR
(am.amname = 'hash' and amop.amopstrategy = 1) OR
(am.amname = 'brin' and amop.amopstrategy = 3)
ORDER BY
index_type,
strategy_number;

What do you think?
>

Good SQL. You have listed the equality operator and related strategy number for 
given
operator classes.

While analyzing more, however, I found that it might be difficult to support 
GIN, BRIN,
and bloom indexes in the first version. These indexes does not implement the
"amgettuple" function, which is called in 
RelationFindReplTupleByIndex()->index_getnext_slot()->index_getnext_tid().
For example, in the brinhandler():

```
/*
 * BRIN handler function: return IndexAmRoutine with access method parameters
 * and callbacks.
 */
Datum
brinhandler(PG_FUNCTION_ARGS)
{
...
amroutine->amgettuple = NULL;
amroutine->amgetbitmap = bringetbitmap;
...
```

According to the document [2], all of index access methods must implement either
of amgettuple or amgetbitmap  API. "amgettuple" is used when the table is scaned
and tuples are fetched one by one, RelationFindReplTupleByIndex() do that.
"amgetbitmap" is used when all tuples are fetched at once and 
RelationFindReplTupleByIndex()
does not support such indexes. To do that the implemented API must be checked 
and
the function must be changed depends on that. It may be difficult to add them in
the first step so that I want not to support them. Fortunately, Hash, GiST, and
SP-GiST has implemented then so we can focus on them.
In the next patch I will add the mechanism for rejecting such indexes.

Anyway, thank you for keeping the interest to the patch, nevertheless it is 
difficult theme.

[1]: 
https://www.postgresql.org/message-id/CAHut%2BPsFdTZJ7DG1jyu7BpA_1d4hwEd-Q%2BmQAPWcj1ZLD_X5Dw%40mail.gmail.com
[2]: https://www.postgresql.org/docs/current/index-functions.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Add hint message for check_log_destination()

2023-07-07 Thread Japin Li

On Fri, 07 Jul 2023 at 14:46, jian he  wrote:
> On Fri, Jul 7, 2023 at 1:06 PM Japin Li  wrote:
>>
>>
>> Hi, hackers
>>
>> When I try to change log_destination using ALTER SYSTEM with the wrong value,
>> it complains of the "Unrecognized key word" without available values.  This
>> patch tries to add a hint message that provides available values for
>> log_destination.  Any thoughts?
>>
>
> select  * from pg_settings where name ~* 'log.*destin*' \gx
>
> short_desc  | Sets the destination for server log output.
> extra_desc  | Valid values are combinations of "stderr", "syslog",
> "csvlog", "jsonlog", and "eventlog", depending on the platform.
>
> you can just reuse extra_desc in the pg_settings (view) column ?

Thanks for your review!

Yeah, the description of extra_desc is more accurate. Updated.

-- 
Regrads,
Japin Li.

>From 715f9811717c9d27f6b2f41db639b18e6ba58625 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Fri, 7 Jul 2023 15:48:53 +0800
Subject: [PATCH v2 1/1] Add hint message for check_log_destination

---
 src/backend/utils/error/elog.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 5898100acb..a32f9613be 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2228,8 +2228,22 @@ check_log_destination(char **newval, void **extra, GucSource source)
 #endif
 		else
 		{
+			StringInfoData errhint;
+
+			initStringInfo();
+			appendStringInfo(, "\"stderr\"");
+#ifdef HAVE_SYSLOG
+			appendStringInfo(, ", \"syslog\"");
+#endif
+#ifdef WIN32
+			appendStringInfo(, ", \"eventlog\"");
+#endif
+			appendStringInfo(, ", \"csvlog\", and \"jsonlog\"");
+
 			GUC_check_errdetail("Unrecognized key word: \"%s\".", tok);
+			GUC_check_errhint("Valid values are combinations of %s.", errhint.data);
 			pfree(rawstring);
+			pfree(errhint.data);
 			list_free(elemlist);
 			return false;
 		}
-- 
2.25.1



Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-07-07 Thread Masahiko Sawada
On Wed, Jul 5, 2023 at 8:21 PM John Naylor  wrote:
>
> On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada  wrote:
> >
> > As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> > are unused, but we use less memory since we use less slab contexts and
> > save fragmentations.
>
> Thanks for testing. This tree is sparse enough that most of the space is 
> taken up by small inner nodes, and not by leaves. So, it's encouraging to see 
> a small space savings even here.
>
> > I've also tested some large value cases (e.g. the value is 80-bytes)
> > and got a similar result.
>
> Interesting. With a separate allocation per value the overhead would be 8 
> bytes, or 10% here. It's plausible that savings elsewhere can hide that, 
> globally.
>
> > Regarding the codes, there are many todo and fixme comments so it
> > seems to me that your recent work is still in-progress. What is the
> > current status? Can I start reviewing the code or should I wait for a
> > while until your recent work completes?
>
> Well, it's going to be a bit of a mess until I can demonstrate it working 
> (and working well) with bitmap heap scan. Fixing that now is just going to 
> create conflicts. I do have a couple small older patches laying around that 
> were quick experiments -- I think at least some of them should give a 
> performance boost in loading speed, but haven't had time to test. Would you 
> like to take a look?

Yes, I can experiment with these patches in the meantime.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Initial Schema Sync for Logical Replication

2023-07-07 Thread Masahiko Sawada
On Wed, Jul 5, 2023 at 11:14 AM Masahiko Sawada  wrote:
>
> On Mon, Jun 19, 2023 at 5:29 PM Peter Smith  wrote:
> >
> > Hi,
> >
> > Below are my review comments for the PoC patch 0001.
> >
> > In addition,  the patch needed rebasing, and, after I rebased it
> > locally in my private environment there were still test failures:
> > a) The 'make check' tests fail but only in a minor way due to changes 
> > colname
> > b) the subscription TAP test did not work at all for me -- many errors.
>
> Thank you for reviewing the patch.
>
> While updating the patch, I realized that the current approach won't
> work well or at least has the problem with partition tables. If a
> publication has a partitioned table with publish_via_root = false, the
> subscriber launches tablesync workers for its partitions so that each
> tablesync worker copies data of each partition. Similarly, if it has a
> partition table with publish_via_root = true, the subscriber launches
> a tablesync worker for the parent table. With the current design,
> since the tablesync worker is responsible for both schema and data
> synchronization for the target table, it won't be possible to
> synchronize both the parent table's schema and partitions' schema. For
> example, there is no pg_subscription_rel entry for the parent table if
> the publication has publish_via_root = false. In addition to that, we
> need to be careful about the order of synchronization of the parent
> table and its partitions. We cannot start schema synchronization for
> partitions before its parent table. So it seems to me that we need to
> consider another approach.

So I've implemented a different approach; doing schema synchronization
at a CREATE SUBSCRIPTION time. The backend executing CREATE
SUBSCRIPTION uses pg_dump and restores the table schemas including
both partitioned tables and their partitions regardless of
publish_via_partition_root option, and then creates
pg_subscription_rel entries for tables while respecting
publish_via_partition_root option.

There is a window between table creations and the tablesync workers
starting to process the tables. If DDLs are executed in this window,
the tablesync worker might fail because the table schema might have
already been changed. We need to mention this note in the
documentation. BTW, I think we will be able to get rid of this
downside if we support DDL replication. DDLs executed in the window
are applied by the apply worker and it takes over the data copy to the
tablesync worker at a certain LSN.

I've attached PoC patches. It has regression tests but doesn't have
the documentation yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-PoC-Add-no-table-dependents-option-to-pg_dump.patch
Description: Binary data


v2-0002-PoC-intitial-table-schema-synchronization-in-logi.patch
Description: Binary data


Re: Add hint message for check_log_destination()

2023-07-07 Thread jian he
On Fri, Jul 7, 2023 at 1:06 PM Japin Li  wrote:
>
>
> Hi, hackers
>
> When I try to change log_destination using ALTER SYSTEM with the wrong value,
> it complains of the "Unrecognized key word" without available values.  This
> patch tries to add a hint message that provides available values for
> log_destination.  Any thoughts?
>
> --
> Regrads,
> Japin Li.
>

select  * from pg_settings where name ~* 'log.*destin*' \gx

short_desc  | Sets the destination for server log output.
extra_desc  | Valid values are combinations of "stderr", "syslog",
"csvlog", "jsonlog", and "eventlog", depending on the platform.

you can just reuse extra_desc in the pg_settings (view) column ?