date:20221106

Re: Allow single table VACUUM in transaction block

2022-11-06 Thread Simon Riggs

On Sun, 6 Nov 2022 at 18:50, Peter Geoghegan  wrote:
>
> On Thu, Oct 27, 2022 at 2:31 AM Simon Riggs
>  wrote:
> > Fix, so that this works without issue:
> >
> > BEGIN;
> > 
> > VACUUM (ANALYZE) vactst;
> > 
> > COMMIT;
> >
> > Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL.
> >
> > When in a xact block, we do not set PROC_IN_VACUUM,
> > nor update datfrozenxid.
>
> It doesn't seem like a good idea to add various new special cases to
> VACUUM just to make scripts like this work.

Usability is a major concern that doesn't get a high enough priority.

> I'm pretty sure that there
> are several deep, subtle reasons why VACUUM cannot be assumed safe to
> run in a user transaction.

I expected there were, so it's good to discuss them. Thanks for the input.

> If we absolutely have to do this, then the least worst approach might
> be to make VACUUM into a no-op rather than throwing an ERROR -- demote
> the ERROR into a WARNING. You could argue that we're just arbitrarily
> deciding to not do a VACUUM just to be able to avoid throwing an error
> if we do that. But isn't that already true with the patch that we
> have? Is it really a "true VACUUM" if the operation can never advance
> datfrozenxid? At least selectively demoting the ERROR to a WARNING is
> "transparent" about it.

I'll answer that part in my reply to Tom, since there are good ideas in both.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Tracking last scan time

2022-11-06 Thread Michael Paquier

On Thu, Nov 03, 2022 at 04:44:16PM -0400, Dave Page wrote:
> Here's a patch to fix this issue. Many thanks to Peter Eisentraut who
> figured it out in a few minutes after I spent far too long looking down
> rabbit holes in entirely the wrong place.

FWIW, all the other areas of pgstatfuncs.c manipulate timestamptz
fields with a style like the attached.  That's a nit, still per the
role of consistency with the surroundings..

Anyway, it seems to me that a regression test is in order before a
scan happens just after the relation creation, and the same problem
shows up with last_idx_scan.
--
Michael
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index 96bffc0f2a..ae3365d917 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -56,12 +56,18 @@ Datum
 pg_stat_get_lastscan(PG_FUNCTION_ARGS)
 {
Oid relid = PG_GETARG_OID(0);
+   TimestampTz result;
PgStat_StatTabEntry *tabentry;
 
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+   result = 0;
+   else
+   result = tabentry->lastscan;
+
+   if (result == 0)
PG_RETURN_NULL();
else
-   PG_RETURN_TIMESTAMPTZ(tabentry->lastscan);
+   PG_RETURN_TIMESTAMPTZ(result);
 }
 
 
diff --git a/src/test/regress/expected/stats.out 
b/src/test/regress/expected/stats.out
index 257a6a9da9..1d84407a03 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -573,6 +573,12 @@ SELECT pg_stat_force_next_flush();
  
 (1 row)
 
+SELECT last_seq_scan, last_idx_scan FROM pg_stat_all_tables WHERE relid = 
'test_last_scan'::regclass;
+ last_seq_scan | last_idx_scan 
+---+---
+   | 
+(1 row)
+
 COMMIT;
 SELECT pg_stat_reset_single_table_counters('test_last_scan'::regclass);
  pg_stat_reset_single_table_counters 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index f6270f7bad..b4d6753c71 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -303,6 +303,7 @@ BEGIN;
 CREATE TEMPORARY TABLE test_last_scan(idx_col int primary key, noidx_col int);
 INSERT INTO test_last_scan(idx_col, noidx_col) VALUES(1, 1);
 SELECT pg_stat_force_next_flush();
+SELECT last_seq_scan, last_idx_scan FROM pg_stat_all_tables WHERE relid = 
'test_last_scan'::regclass;
 COMMIT;
 
 SELECT pg_stat_reset_single_table_counters('test_last_scan'::regclass);


signature.asc
Description: PGP signature

Re: Add semi-join pushdown to postgres_fdw

2022-11-06 Thread Alexander Pyhalov


Ian Lawrence Barwick писал 2022-11-04 02:21:


This entry was marked as "Needs review" in the CommitFest app but cfbot
reports the patch no longer applies.

We've marked it as "Waiting on Author". As CommitFest 2022-11 is
currently underway, this would be an excellent time update the patch.

Once you think the patchset is ready for review again, you (or any
interested party) can  move the patch entry forward by visiting

https://commitfest.postgresql.org/40/3838/

and changing the status to "Needs review".



Hi. I've rebased the patch.
--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 1ac2a9e3611f716da688c04a4ec36888f62078ce Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Mon, 7 Nov 2022 10:23:32 +0300
Subject: [PATCH] postgres_fdw: add support for deparsing semi joins

We deparse semi-joins as EXISTS subqueries. So, deparsing
semi-join leads to generating addl_conds condition,
which is then added to the uppermost JOIN's WHERE clause.
---
 contrib/postgres_fdw/deparse.c| 198 +---
 .../postgres_fdw/expected/postgres_fdw.out| 297 --
 contrib/postgres_fdw/postgres_fdw.c   |  78 -
 contrib/postgres_fdw/postgres_fdw.h   |   3 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql | 119 ++-
 5 files changed, 613 insertions(+), 82 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 95247656504..45885442418 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -179,12 +179,13 @@ static void appendLimitClause(deparse_expr_cxt *context);
 static void appendConditions(List *exprs, deparse_expr_cxt *context);
 static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
   RelOptInfo *foreignrel, bool use_alias,
-  Index ignore_rel, List **ignore_conds,
+  Index ignore_rel, List **ignore_conds, StringInfo addl_conds,
   List **params_list);
+static void appendWhereClause(List *exprs, StringInfo addl_conds, deparse_expr_cxt *context);
 static void deparseFromExpr(List *quals, deparse_expr_cxt *context);
 static void deparseRangeTblRef(StringInfo buf, PlannerInfo *root,
 			   RelOptInfo *foreignrel, bool make_subquery,
-			   Index ignore_rel, List **ignore_conds, List **params_list);
+			   Index ignore_rel, List **ignore_conds, StringInfo addl_conds, List **params_list);
 static void deparseAggref(Aggref *node, deparse_expr_cxt *context);
 static void appendGroupByClause(List *tlist, deparse_expr_cxt *context);
 static void appendOrderBySuffix(Oid sortop, Oid sortcoltype, bool nulls_first,
@@ -1370,23 +1371,20 @@ deparseFromExpr(List *quals, deparse_expr_cxt *context)
 {
 	StringInfo	buf = context->buf;
 	RelOptInfo *scanrel = context->scanrel;
+	StringInfoData addl_conds;
 
 	/* For upper relations, scanrel must be either a joinrel or a baserel */
 	Assert(!IS_UPPER_REL(context->foreignrel) ||
 		   IS_JOIN_REL(scanrel) || IS_SIMPLE_REL(scanrel));
 
+	initStringInfo(&addl_conds);
 	/* Construct FROM clause */
 	appendStringInfoString(buf, " FROM ");
 	deparseFromExprForRel(buf, context->root, scanrel,
 		  (bms_membership(scanrel->relids) == BMS_MULTIPLE),
-		  (Index) 0, NULL, context->params_list);
-
-	/* Construct WHERE clause */
-	if (quals != NIL)
-	{
-		appendStringInfoString(buf, " WHERE ");
-		appendConditions(quals, context);
-	}
+		  (Index) 0, NULL, &addl_conds, context->params_list);
+	appendWhereClause(quals, &addl_conds, context);
+	pfree(addl_conds.data);
 }
 
 /*
@@ -1598,6 +1596,33 @@ appendConditions(List *exprs, deparse_expr_cxt *context)
 	reset_transmission_modes(nestlevel);
 }
 
+/*
+ * Append WHERE clause, containing conditions
+ * from exprs and addl_conds, to context->buf.
+ */
+static void
+appendWhereClause(List *exprs, StringInfo addl_conds, deparse_expr_cxt *context)
+{
+	StringInfo	buf = context->buf;
+	bool		need_and = false;
+
+	if (exprs != NIL || addl_conds->len > 0)
+		appendStringInfoString(buf, " WHERE ");
+
+	if (exprs != NIL)
+	{
+		appendConditions(exprs, context);
+		need_and = true;
+	}
+
+	if (addl_conds->len > 0)
+	{
+		if (need_and)
+			appendStringInfoString(buf, " AND ");
+		appendStringInfo(buf, "(%s)", addl_conds->data);
+	}
+}
+
 /* Output join name for given join type */
 const char *
 get_jointype_name(JoinType jointype)
@@ -1616,6 +1641,9 @@ get_jointype_name(JoinType jointype)
 		case JOIN_FULL:
 			return "FULL";
 
+		case JOIN_SEMI:
+			return "SEMI";
+
 		default:
 			/* Shouldn't come here, but protect from buggy code. */
 			elog(ERROR, "unsupported join type %d", jointype);
@@ -1715,7 +1743,7 @@ deparseSubqueryTargetList(deparse_expr_cxt *context)
  */
 static void
 deparseFromExprForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *foreignrel,
-	  bool use_alias, Index ignore_rel, List **ignore_conds,
+	  bool use_alias, Index ignore_rel, List **ignore_conds, StringInfo addl_conds,
 	  List **params_list)
 {
 	PgFdwRela

Re: New docs chapter on Transaction Management and related changes

2022-11-06 Thread Bruce Momjian

On Fri, Nov  4, 2022 at 04:17:28PM +0100, Laurenz Albe wrote:
> On Sat, 2022-10-15 at 21:08 -0400, Bruce Momjian wrote:
> > I therefore merged all three paragraphs into
> > one and tried to make the text saner;  release_savepoint.sgml diff
> > attached, URL content updated.
> 
> I wanted to have a look at this, but I am confused.  The original patch
> was much bigger.  Is this just an incremental patch?  If yes, it would
> be nice to have a "grand total" patch, so that I can read it all
> in one go.

Yeah, I said:

Yes, I didn't like global either --- I like your wording.  I added your
other changes too, with slight rewording.  Merged patch to be posted in
   -
a later email.

but that was unclear.  Let me post one now, and Simon posted one too.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: [PATCH] Add native windows on arm64 support

2022-11-06 Thread Michael Paquier

On Sat, Nov 05, 2022 at 11:31:36AM -0700, Andres Freund wrote:
> On 2022-11-03 11:06:46 +, Niyas Sait wrote:
> Note that we're planning to remove the custom windows build scripts before the
> next release, relying on the meson build instead.

Youpi.

>>  #if defined(_WIN64)
>>  static __forceinline void
>>  spin_delay(void)
>>  {
>> +#ifdef _M_ARM64
>> +/*
>> + * arm64 way of hinting processor for spin loops optimisations
>> + * ref: 
>> https://community.arm.com/support-forums/f/infrastructure-solutions-forum/48654/ssetoneon-faq
>> +*/
>> +__isb(_ARM64_BARRIER_SY);
>> +#else
>>  _mm_pause();
>> +#endif
>>  }
>>  #else
>>  static __forceinline void
> 
> I think we should just apply this, there seems very little risk of making
> anything worse, given the gating to _WIN64 && _M_ARM64.

Seems so.  Hmm, where does _ARM64_BARRIER_SY come from?  Perhaps it
would be better to have a comment referring to it from a different
place than the forums of arm, like some actual docs?

> The meson checking logic is used both for msvc and other compilers, so this
> will need to work with both.

The patch has been switched as waiting on author for now.
--
Michael


signature.asc
Description: PGP signature

Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-06 Thread Tom Lane

Michael Paquier  writes:
> On Sun, Nov 06, 2022 at 09:23:01PM +0900, Ian Lawrence Barwick wrote:
>> Fair enough, make sense.

> Fine by me and the patch looks OK.  I'd like to apply this if there
> are no objections.

WFM.

regards, tom lane

Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-06 Thread Michael Paquier

On Sun, Nov 06, 2022 at 09:23:01PM +0900, Ian Lawrence Barwick wrote:
> Fair enough, make sense.

Fine by me and the patch looks OK.  I'd like to apply this if there
are no objections.
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Reducing planning time when tables have many partitions

2022-11-06 Thread Zhang Mingli

HI,

Regards,
Zhang Mingli
On Nov 7, 2022, 14:26 +0800, Tom Lane , wrote:
> Andrey Lepikhov  writes:
> > I'm still in review of your patch now. At most it seems ok, but are you
> > really need both eq_sources and eq_derives lists now?
>
> Didn't we just have this conversation? eq_sources needs to be kept
> separate to support the "broken EC" logic. We don't want to be
> regurgitating derived clauses as well as originals in that path.
>
Aha, we have that conversation in another thread(Reducing duplicativeness of 
EquivalenceClass-derived clauses
) : https://www.postgresql.org/message-id/644164.1666877342%40sss.pgh.pa.us

Re: [PoC] Reducing planning time when tables have many partitions

2022-11-06 Thread Tom Lane

Andrey Lepikhov  writes:
> I'm still in review of your patch now. At most it seems ok, but are you 
> really need both eq_sources and eq_derives lists now?

Didn't we just have this conversation?  eq_sources needs to be kept
separate to support the "broken EC" logic.  We don't want to be
regurgitating derived clauses as well as originals in that path.

regards, tom lane

Re: [PoC] Reducing planning time when tables have many partitions

2022-11-06 Thread Andrey Lepikhov


On 2/11/2022 15:27, Yuya Watari wrote:

Hello,

I noticed that the previous patch does not apply to the current HEAD.
I attached the rebased version to this email.

I'm still in review of your patch now. At most it seems ok, but are you 
really need both eq_sources and eq_derives lists now? As I see, 
everywhere access to these lists guides by eclass_source_indexes and 
eclass_derive_indexes correspondingly. Maybe to merge them?


--
regards,
Andrey Lepikhov
Postgres Professional

Re: archive modules

2022-11-06 Thread Michael Paquier

On Sat, Nov 05, 2022 at 02:08:58PM -0700, Nathan Bossart wrote:
> Such a module could define a custom GUC that accepts a shell command.  I
> don't think we should overload the meaning of archive_command based on the
> whims of whatever archive module is loaded.  Besides the potential end-user
> confusion, your archive_command might be unexpectedly used incorrectly if
> you forget to set archive_library.

While mostly copying the logic from shell_archive.c to build the
command to execute (aka shell_archive_file), which is not great as
well.  But well, perhaps my whole line of argument is just moot..  

> Perhaps we could eventually move the archive_command functionality to a
> contrib module (i.e., "shell_archive") so that users must always set
> archive_library.  But until then, I suspect it's better to treat modules
> and commands as two separate interfaces to ease migration from older major
> versions (even though archive_command is now essentially a built-in archive
> module).

I agree that this is a fine long-term goal, removing all traces of the
archive_command from the backend core code.  This is actually an
argument in favor of having no traces of XLogArchiveCommand in
pgarch.c, no? ;p

I am not sure how long we should wait before being able to do that,
perhaps a couple of years of least?  I'd like to think the sooner the
better (like v17?) but we are usually conservative, and the removal of
the exclusive backup mode took 5~6 years if I recall correctly..
--
Michael


signature.asc
Description: PGP signature

Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-06 Thread Michael Paquier

On Thu, Nov 03, 2022 at 08:55:05PM +0900, Michael Paquier wrote:
> On Wed, Nov 02, 2022 at 09:06:02PM +0800, Julien Rouhaud wrote:
>>> The addition of a check for the depth in two places seems unnecessary,
>>> and it looks like we should do this kind of check in only one place.
>> 
>> I usually prefer to add a maybe unnecessary depth check since it's basically
>> free rather than risking an unfriendly stack size error (including in 
>> possible
>> later refactoring), but no objection to get rid of it.
> 
> The depth check is a good idea, though I guess that there is an
> argument for having it in only one code path, not two.

So, I have looked at this specific part, and I think that we should
try to solve this issue on its own as the error message leading to a
failure at startup is confusing now, giving:
FATAL:  exceeded maxAllocatedDescs (166) while trying to open file \"blah\".

It took me some time to ponder about what kind of interface we should
use for this job, but I think that it comes down to three things:
- The depth ought to be checked just before doing AllocateFile() for a
file.
- We should have one code path calling AllocateFile().
- The same interface should be used for hba.c and hbafuncs.c.

I have settled down to the following in the attached patch 0002:
open_auth_file(const char *filename, int elevel, int depth, char
**err_msg)

More checks could be added in this code path (guc-file.l does more
with immediate recursions, for example), but the depth level should be
enough anyway.  I am aware of the fact that this reduces a bit the
information provided for what's called a "secondary" file in your
patch set, but I tend toward moving into something that's
minimalistic.  Currently, we provide the file and its '@' shortcut,
and we don't mention at all the file where it comes from.  This is a
minimal problem now because we only have pg_hba.conf to worry about
for most users, still I am wondering whether it would be better in the
long run to complete that with an error context that mentions the
file(s) where this comes from and report a full chain of the events
that led to this opening failure.  At the end, I think that 0002 moves
us closer toward the goal of this thread.  Not been a fan of
HbaIncludeKind since the patch was proposed, TBH, as well.

Another thing is that this removes the need for open_inc_file(), my
opinion is that it is overcomplicated, and actually it seems like
under the non-strict mode (aka include_if_exists), we would generate a
LOG to mention that a configuration file is skipped, but there would
be nothing in err_msg for the TokenizedAuthLine, which is a mistake I
guess?

>>> I have not tried yet, but if we actually move the AllocateFile() and
>>> FreeFile() calls within tokenize_auth_file(), it looks like we may be
>>> able to get to a simpler logic without the need of the with_context()
>>> flavor (even no process_included_auth_file required)?  That could make
>>> the interface easier to follow as a whole, while limiting the presence
>>> of AllocateFile() and FreeFile() to a single code path, impacting
>>> open_inc_file() that relies on what the patch uses for
>>> SecondaryAuthFile and IncludedAuthFile (we could attempt to use the
>>> same error message everywhere as well, as one could expect that
>>> expanded and included files have different names which is enough to
>>> guess which one is an inclusion and which one is a secondary).
>> 
>> You meant tokenize_auth_file_internal?  Yes possibly, in general I tried
>> to avoid changing too much the existing code to ease the patch acceptance, 
>> but
>> I agree it would make things simpler.
> 
> I *guess* that my mind implied tokenize_auth_file() as of
> yesterday's.

While thinking more about the logic, I saw that your approach to use
the same MemoryContext "linecxt" and pass it down to the various
layers with tokenize_file_with_context() has the advantage to remove
the need to copy a list of TokenizedAuthLine coming from other
included files and/or dirs, which could easily lead to bugs especially
for the sourcefile if someone is not careful enough.  There is no need
to touch the existing tokenize_inc_file() that copies a set of
AuthTokens to an existing list in TokenizedAuthLine, of course.

Should we do without tokenize_file_with_context() though?  There is an
argument for making everything go through tokenize_auth_file(), having
a MemoryContext as argument (aka if NULL create it, if not use it),
while still returning the memory context used?  Similarly, it looks
like we should have no need for process_included_authfile() at the
end.

I have applied as a1a7bb8 the refactoring of the directory logic for
configuration files, and noticed that this could also be used for
tokenize_inc_file().  This reduces the whole patch by ~15%.

Attached is a set of three patches:
- 0001 changes tokenize_inc_file() to use AbsoluteConfigLocation().
AbsoluteConfigLocation() uses a static buffer and a MAXPGPATH, but
we'd rather change it to use a

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Amit Kapila

On Mon, Nov 7, 2022 at 10:02 AM Masahiko Sawada  wrote:
>
> On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila  wrote:
> >
> > > > I agree that it could be better to have a new lock tag. Another point 
> > > > is that
> > > > the remote xid and Local xid could be the same in some rare cases, so I 
> > > > think
> > > > we might need to add another identifier to make it unique.
> > > >
> > > > Maybe :
> > > > locktag_field1 : subscription oid
> > > > locktag_field2 : xid(remote or local)
> > > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> > >
> > > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > > for local xid.
> > >
> >
> > We can do that way as well but OTOH, I think for the local
> > transactions we don't need subscription oid, so field1 could be
> > InvalidOid and field2 will be xid of local xact. Won't that be better?
>
> This would work. But I'm a bit concerned that we cannot identify which
> subscriptions the lock belongs to when checking pg_locks view.
>

Fair point. I think if the user wants, she can join with
pg_stat_subscription based on PID and find the corresponding
subscription. However, if we want to identify everything via pg_locks
then I think we should also mention classid or database id as field1.
So, it would look like: field1: (pg_subscription's oid or current db
id); field2: OID of subscription in pg_subscription; field3: local or
remote xid; field4: 0/1 to differentiate between remote and local xid.

-- 
With Regards,
Amit Kapila.

Re: Free list same_input_transnos in preprocess_aggref

2022-11-06 Thread Zhang Mingli

HI,


On Nov 7, 2022, 04:12 +0800, Tom Lane , wrote:
>
> The NIL lists are of course occupying no storage. The two one-element
> lists are absolutely, completely negligible in the context of planning
> any nontrivial statement. Even the aggtransinfos list that is the
> primary output of preprocess_aggref will dwarf that; and we leak
> similarly small data structures in probably many hundred places in
> the planner.
>
> I went a bit further and ran the core regression tests, then aggregated
> the results:
>
> $ grep 'leaking list' postmaster.log | sed 's/.*] //' | sort | uniq -c
> 4516 LOG: leaking list of length 0
> 95 LOG: leaking list of length 1
> 15 LOG: leaking list of length 2
>
> You can quibble of course about how representative the regression tests
> are, but there's sure no evidence at all here that we'd be saving
> anything measurable.
>
> If anything, I'd be inclined to get rid of the
>
> list_free(*same_input_transnos);
>
> in find_compatible_agg, because it seems like a waste of code on
> the same grounds. Instrumenting that in the same way, I find
> that it's not reached at all in your example, while the
> regression tests give
>
> 49 LOG: freeing list of length 0
> 2 LOG: freeing list of length 1
>
Thanks for the investigation.
Yeah, this patch is negligible. I’ll withdraw it in CF.

Regards,
Zhang Mingli

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada

On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila  wrote:
>
> On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada  wrote:
> >
> > On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> > > 
> > > >
> > > > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > > >  wrote:
> > > > > >
> > > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > > > >  wrote:
> > > > > > >
> > > > > > > Thanks for the analysis and summary !
> > > > > > >
> > > > > > > I tried to implement the above idea and here is the patch set.
> > > > > > >
> > > > > >
> > > > > > Few comments on v42-0001
> > > > > > ===
> > > > >
> > > > > Thanks for the comments.
> > > > >
> > > > > >
> > > > > > 10.
> > > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > > + winfo->shared->transaction_lock_id =
> > > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > > >
> > > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > > (one generated by parallel apply) for the other?
> > > > ...
> > > > ...
> > > > >
> > > > > I also considered using xid for these locks, but it seems the objsubid
> > > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > > to generate a unique 16bit id here.
> > > > >
> > > >
> > > > Okay, I see your point. Can we think of having a new lock tag for this 
> > > > with classid,
> > > > objid, objsubid for the first three fields of locktag field? We can use 
> > > > a new
> > > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > > tag and acquire the lock. One more point related to this is that I am 
> > > > suggesting
> > > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the 
> > > > current
> > > > patch but do you think we need it for our purpose, won't subscription 
> > > > id and
> > > > xid can uniquely identify the tag?
> > >
> > > I agree that it could be better to have a new lock tag. Another point is 
> > > that
> > > the remote xid and Local xid could be the same in some rare cases, so I 
> > > think
> > > we might need to add another identifier to make it unique.
> > >
> > > Maybe :
> > > locktag_field1 : subscription oid
> > > locktag_field2 : xid(remote or local)
> > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> >
> > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > for local xid.
> >
>
> We can do that way as well but OTOH, I think for the local
> transactions we don't need subscription oid, so field1 could be
> InvalidOid and field2 will be xid of local xact. Won't that be better?

This would work. But I'm a bit concerned that we cannot identify which
subscriptions the lock belongs to when checking pg_locks view.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: explain analyze rows=%.0f

2022-11-06 Thread Andrey Lepikhov


On 22/7/2022 16:47, Amit Kapila wrote:

I feel the discussion has slightly deviated which makes it unclear
whether this patch is required or not?


After quick review I want to express my thoughts.
At first, We have been waiting for this feature for years. Often clients 
give an explain to us where we see something like:

"rows=0, loops=100".
Without verbose mode, I can't even understand whether this node produces 
any rows or not.

So, I think this feature is useful for parameterized plans mostly.
Also, printing two decimal digits or even three isn't meaningful - 
sometimes we have a plan where number of loops is about 1E6 or even 1E7, 
but number of real rows is equal 100 or 1000.

To overcome this issue, I see two options:
1. Show the exact number of tuples without division by loops (fair case 
but invasive and bad for automation tools).

2. Show rows in scientific format like X.XXEXX.
I vote for second option.

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Amit Kapila

On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada  wrote:
>
> On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> > >
> > > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > >  wrote:
> > > > >
> > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > > >  wrote:
> > > > > >
> > > > > > Thanks for the analysis and summary !
> > > > > >
> > > > > > I tried to implement the above idea and here is the patch set.
> > > > > >
> > > > >
> > > > > Few comments on v42-0001
> > > > > ===
> > > >
> > > > Thanks for the comments.
> > > >
> > > > >
> > > > > 10.
> > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > + winfo->shared->transaction_lock_id =
> > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > >
> > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > (one generated by parallel apply) for the other?
> > > ...
> > > ...
> > > >
> > > > I also considered using xid for these locks, but it seems the objsubid
> > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > to generate a unique 16bit id here.
> > > >
> > >
> > > Okay, I see your point. Can we think of having a new lock tag for this 
> > > with classid,
> > > objid, objsubid for the first three fields of locktag field? We can use a 
> > > new
> > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > tag and acquire the lock. One more point related to this is that I am 
> > > suggesting
> > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > > patch but do you think we need it for our purpose, won't subscription id 
> > > and
> > > xid can uniquely identify the tag?
> >
> > I agree that it could be better to have a new lock tag. Another point is 
> > that
> > the remote xid and Local xid could be the same in some rare cases, so I 
> > think
> > we might need to add another identifier to make it unique.
> >
> > Maybe :
> > locktag_field1 : subscription oid
> > locktag_field2 : xid(remote or local)
> > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
>
> Or I think we can use locktag_field2 for remote xid and locktag_field3
> for local xid.
>

We can do that way as well but OTOH, I think for the local
transactions we don't need subscription oid, so field1 could be
InvalidOid and field2 will be xid of local xact. Won't that be better?

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada

On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
 wrote:
>
> On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> >
> > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > Thanks for the analysis and summary !
> > > > >
> > > > > I tried to implement the above idea and here is the patch set.
> > > > >
> > > >
> > > > Few comments on v42-0001
> > > > ===
> > >
> > > Thanks for the comments.
> > >
> > > >
> > > > 10.
> > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > + winfo->shared->transaction_lock_id =
> > > > + winfo->shared->parallel_apply_get_unique_id();
> > > >
> > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > (one generated by parallel apply) for the other?
> > ...
> > ...
> > >
> > > I also considered using xid for these locks, but it seems the objsubid
> > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > to generate a unique 16bit id here.
> > >
> >
> > Okay, I see your point. Can we think of having a new lock tag for this with 
> > classid,
> > objid, objsubid for the first three fields of locktag field? We can use a 
> > new
> > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > tag and acquire the lock. One more point related to this is that I am 
> > suggesting
> > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > patch but do you think we need it for our purpose, won't subscription id and
> > xid can uniquely identify the tag?
>
> I agree that it could be better to have a new lock tag. Another point is that
> the remote xid and Local xid could be the same in some rare cases, so I think
> we might need to add another identifier to make it unique.
>
> Maybe :
> locktag_field1 : subscription oid
> locktag_field2 : xid(remote or local)
> locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Or I think we can use locktag_field2 for remote xid and locktag_field3
for local xid.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Do we need to pass down nonnullable_vars when reducing outer joins?

2022-11-06 Thread Richard Guo

On Sun, Nov 6, 2022 at 4:00 AM Tom Lane  wrote:

> Richard Guo  writes:
> > AFAICS, the Vars forced nonnullable by given clause are only used to
> > check if we can reduce JOIN_LEFT to JOIN_ANTI, and it is checking the
> > join's own quals there. It seems to me we do not need to pass down
> > nonnullable_vars by upper quals to the children of a join.
>
> Hmm, you are right, we are not doing anything useful with that data.
> I can't remember if I had a concrete plan for doing something with it
> or not, but we sure aren't using it now.  So pushed.


Thanks for pushing it!

Thanks
Richard

Re: Check SubPlan clause for nonnullable rels/Vars

2022-11-06 Thread Richard Guo

On Sun, Nov 6, 2022 at 3:33 AM Tom Lane  wrote:

> Richard Guo  writes:
> > [ v2-0001-Check-SubPlan-clause-for-nonnullable-rels-Vars.patch ]
>
> Pushed with cosmetic changes:
>
> * I don't believe in "add at the end" as a principle for placement
> of new code.  There's usually some other logic that will give more
> consistent results.  In cases like this, ordering the treatment of
> Node types in the same way as they appear in the include/nodes/
> headers is the standard answer.  (Not that everybody's been totally
> consistent about that :-( ... but that's not an argument for
> introducing even more entropy.)
>
> * I rewrote the comments a bit.
>
> * I didn't like the test case too much: spinning up a whole new set
> of tables seems like a lot of useless cycles.  Plus it makes it
> harder to experiment with the test query manually.  I usually like
> to write such queries using the regression database's standard tables,
> so I rewrote this example that way.


Thanks for the changes. They make the patch look better.  And thanks for
pushing it.

Thanks
Richard

Re: O(n) tasks cause lengthy startups and checkpoints

2022-11-06 Thread Nathan Bossart

On Fri, Sep 23, 2022 at 10:41:54AM -0700, Nathan Bossart wrote:
> v11 adds support for building with meson.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 367c5f3863457cfbd0fe8add0e8df3e630aaaea9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 5 Jan 2022 19:24:22 +
Subject: [PATCH v12 1/6] Introduce custodian.

The custodian process is a new auxiliary process that is intended
to help offload tasks could otherwise delay startup and
checkpointing.  This commit simply adds the new process; it does
not yet do anything useful.
---
 src/backend/postmaster/Makefile |   1 +
 src/backend/postmaster/auxprocess.c |   8 +
 src/backend/postmaster/custodian.c  | 383 
 src/backend/postmaster/meson.build  |   1 +
 src/backend/postmaster/postmaster.c |  44 ++-
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/lmgr/proc.c |   1 +
 src/backend/utils/activity/wait_event.c |   3 +
 src/backend/utils/init/miscinit.c   |   3 +
 src/include/miscadmin.h |   3 +
 src/include/postmaster/custodian.h  |  32 ++
 src/include/storage/proc.h  |  11 +-
 src/include/utils/wait_event.h  |   1 +
 13 files changed, 489 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/postmaster/custodian.c
 create mode 100644 src/include/postmaster/custodian.h

diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 3a794e54d6..e1e1d1123f 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -18,6 +18,7 @@ OBJS = \
 	bgworker.o \
 	bgwriter.o \
 	checkpointer.o \
+	custodian.o \
 	fork_process.o \
 	interrupt.o \
 	pgarch.o \
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 7765d1c83d..c275271c95 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -20,6 +20,7 @@
 #include "pgstat.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/custodian.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
@@ -74,6 +75,9 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 		case CheckpointerProcess:
 			MyBackendType = B_CHECKPOINTER;
 			break;
+		case CustodianProcess:
+			MyBackendType = B_CUSTODIAN;
+			break;
 		case WalWriterProcess:
 			MyBackendType = B_WAL_WRITER;
 			break;
@@ -153,6 +157,10 @@ AuxiliaryProcessMain(AuxProcType auxtype)
 			CheckpointerMain();
 			proc_exit(1);
 
+		case CustodianProcess:
+			CustodianMain();
+			proc_exit(1);
+
 		case WalWriterProcess:
 			WalWriterMain();
 			proc_exit(1);
diff --git a/src/backend/postmaster/custodian.c b/src/backend/postmaster/custodian.c
new file mode 100644
index 00..e90f5d0d1f
--- /dev/null
+++ b/src/backend/postmaster/custodian.c
@@ -0,0 +1,383 @@
+/*-
+ *
+ * custodian.c
+ *
+ * The custodian process handles a variety of non-critical tasks that might
+ * otherwise delay startup, checkpointing, etc.  Offloaded tasks should not
+ * be synchronous (e.g., checkpointing shouldn't wait for the custodian to
+ * complete a task before proceeding).  However, tasks can be synchronously
+ * executed when necessary (e.g., single-user mode).  The custodian is not
+ * an essential process and can shutdown quickly when requested.  The
+ * custodian only wakes up to perform its tasks when its latch is set.
+ *
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/postmaster/custodian.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/custodian.h"
+#include "postmaster/interrupt.h"
+#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
+#include "storage/fd.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+static void DoCustodianTasks(bool retry);
+static CustodianTask CustodianGetNextTask(void);
+static void CustodianEnqueueTask(CustodianTask task);
+static const struct cust_task_funcs_entry *LookupCustodianFunctions(CustodianTask task);
+
+typedef struct
+{
+	slock_t		cust_lck;
+
+	CustodianTask task_queue_elems[NUM_CUSTODIAN_TASKS];
+	int			task_queue_head;
+} CustodianShmemStruct;
+
+static CustodianShmemStruct *CustodianShmem;
+
+typedef void (*CustodianTaskFunction) (void);
+typedef void (*CustodianTaskHandleArg) (Datum arg);
+
+struct cust_task_funcs_entry
+{
+	CustodianTask task;
+	CustodianTaskFunction task_func;		/* performs task */
+	CustodianTaskHandleArg handle_arg_func;	/* handles additional info in request */
+};
+
+/*
+ * Add new tasks here.
+ *
+ * task_func is the logic that will be executed via DoCustodianTasks() when the
+ * matching task is r

Re: Allow single table VACUUM in transaction block

2022-11-06 Thread Peter Geoghegan

On Sun, Nov 6, 2022 at 11:14 AM Tom Lane  wrote:
> In general, I do not believe in encouraging users to run VACUUM
> manually in the first place.  We would be far better served by
> spending our effort to improve autovacuum's shortcomings.

I couldn't agree more. A lot of problems seem related to the idea that
VACUUM is just a command that the DBA periodically runs to get a
predictable fixed result, a little like CREATE INDEX. That conceptual
model isn't exactly wrong; it just makes it much harder to apply any
kind of context about the needs of the table over time. There is a
natural cycle to how VACUUM (really autovacuum) is run, and the
details matter.

There is a significant amount of relevant context that we can't really
use right now. That wouldn't be true if VACUUM only ran within an
autovacuum worker (by definition). The VACUUM command itself would
still be available, and support the same user interface, more or less.
Under the hood the VACUUM command would work by enqueueing a VACUUM
job, to be performed asynchronously by an autovacuum worker. Perhaps
the initial enqueue operation could be transactional, fixing Simon's complaint.

"No more VACUUMs outside of autovacuum" would enable more advanced
autovacuum.c scheduling, allowing us to apply a lot more context about
the costs and benefits, without having to treat manual VACUUM as an
independent thing. We could coalesce together redundant VACUUM jobs,
suspend and resume VACUUM operations, and have more strategies to deal
with problems as they emerge.

> I'd like to see some sort of direct attack on its inability to deal
> with temp tables, for instance.  (Force the owning backend to
> do it?  Temporarily change the access rules so that the data
> moves to shared buffers?  Dunno, but we sure haven't tried hard.)

This is a good example of the kind of thing I have in mind. Perhaps it
could work by killing the backend that owns the temp relation when
things truly get out of hand? I think that that would be a perfectly
reasonable trade-off.

Another related idea: better behavior in the event of a manually
issued VACUUM (now just an enqueued autovacuum) that cannot do useful
work due to the presence of a long running snapshot. The VACUUM
doesn't have to dutifully report "success" when there is no practical
sense in which it was successful. There could be a back and forth
conversation between autovacuum.c and vacuumlazy.c that makes sure
that something useful happens sooner or later. The passage of time
really matters here.

As a bonus, we might be able to get rid of the autovacuum GUC
variants. Plus the current autovacuum logging would just be how we'd
log every VACUUM.

--
Peter Geoghegan

Re: Free list same_input_transnos in preprocess_aggref

2022-11-06 Thread Tom Lane

Zhang Mingli  writes:
> Correction: SaveBytes = Sum results of accumulate_list_size: 24(4+4+8+8),

What I did was to stick in

elog(LOG, "leaking list of length %d", 
list_length(same_input_transnos));

at the end of preprocess_aggref.  What I see on your five-aggregate
example is

2022-11-06 14:59:25.666 EST [3046253] LOG:  leaking list of length 0
2022-11-06 14:59:25.666 EST [3046253] STATEMENT:  explain select max(id), 
min(id), sum(id), count(id), avg(id) from t1;
2022-11-06 14:59:25.666 EST [3046253] LOG:  leaking list of length 1
2022-11-06 14:59:25.666 EST [3046253] STATEMENT:  explain select max(id), 
min(id), sum(id), count(id), avg(id) from t1;
2022-11-06 14:59:25.666 EST [3046253] LOG:  leaking list of length 0
2022-11-06 14:59:25.666 EST [3046253] STATEMENT:  explain select max(id), 
min(id), sum(id), count(id), avg(id) from t1;
2022-11-06 14:59:25.666 EST [3046253] LOG:  leaking list of length 1
2022-11-06 14:59:25.666 EST [3046253] STATEMENT:  explain select max(id), 
min(id), sum(id), count(id), avg(id) from t1;
2022-11-06 14:59:25.666 EST [3046253] LOG:  leaking list of length 0
2022-11-06 14:59:25.666 EST [3046253] STATEMENT:  explain select max(id), 
min(id), sum(id), count(id), avg(id) from t1;

The NIL lists are of course occupying no storage.  The two one-element
lists are absolutely, completely negligible in the context of planning
any nontrivial statement.  Even the aggtransinfos list that is the
primary output of preprocess_aggref will dwarf that; and we leak
similarly small data structures in probably many hundred places in
the planner.

I went a bit further and ran the core regression tests, then aggregated
the results:

$ grep 'leaking list' postmaster.log | sed 's/.*] //' | sort | uniq -c 
   4516 LOG:  leaking list of length 0
 95 LOG:  leaking list of length 1
 15 LOG:  leaking list of length 2

You can quibble of course about how representative the regression tests
are, but there's sure no evidence at all here that we'd be saving
anything measurable.

If anything, I'd be inclined to get rid of the

list_free(*same_input_transnos);

in find_compatible_agg, because it seems like a waste of code on
the same grounds.  Instrumenting that in the same way, I find
that it's not reached at all in your example, while the
regression tests give

 49 LOG:  freeing list of length 0
  2 LOG:  freeing list of length 1

regards, tom lane

Re: Allow single table VACUUM in transaction block

2022-11-06 Thread Tom Lane

Peter Geoghegan  writes:
> My guess is that there are more things like that. Possibly even things
> that were never directly considered. VACUUM evolved in a world where
> we absolutely took not running in a transaction for granted. Changing
> that now is a pretty big deal. Maybe it would all be worth it if the end
> result was a super compelling feature. But I for one don't think that
> it will be.

Yeah.  To be blunt, this proposal scares the sh*t out of me.  I agree
that there are tons of subtle dependencies on our current assumptions
about how VACUUM works, and I strongly suspect that we won't find all of
them until after users have lost data.  I cannot believe that running
VACUUM inside transactions is valuable enough to take that risk ...
especially if it's a modified limited kind of VACUUM that doesn't
eliminate the need for periodic real VACUUMs.

In general, I do not believe in encouraging users to run VACUUM
manually in the first place.  We would be far better served by
spending our effort to improve autovacuum's shortcomings.  I'd
like to see some sort of direct attack on its inability to deal
with temp tables, for instance.  (Force the owning backend to
do it?  Temporarily change the access rules so that the data
moves to shared buffers?  Dunno, but we sure haven't tried hard.)
However many bugs such a thing might have initially, at least
they'd only put temporary data at hazard.

regards, tom lane

Re: Allow single table VACUUM in transaction block

2022-11-06 Thread Peter Geoghegan

On Thu, Oct 27, 2022 at 2:31 AM Simon Riggs
 wrote:
> Fix, so that this works without issue:
>
> BEGIN;
> 
> VACUUM (ANALYZE) vactst;
> 
> COMMIT;
>
> Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL.
>
> When in a xact block, we do not set PROC_IN_VACUUM,
> nor update datfrozenxid.

It doesn't seem like a good idea to add various new special cases to
VACUUM just to make scripts like this work. I'm pretty sure that there
are several deep, subtle reasons why VACUUM cannot be assumed safe to
run in a user transaction.

For example, the whole way that index page deletion is decoupled from
recycling in access methods like nbtree (see "Placing deleted pages in
the FSM" from the nbtree README) rests upon delicate assumptions about
whether or not there could be an "in-flight" B-tree descent that is
at risk of landing on a deleted page as it is concurrently recycled.
In general the deleted page has to remain in place as a tombstone,
until that is definitely not possible anymore. This relies on the backend
that runs VACUUM having no references to the page pending deletion.
(Commit 9dd963ae25 added an optimization that heavily leaned on the
idea that the state within the backend running VACUUM couldn't
possibly have a live page reference that is at risk of being broken by
the optimization, though I'm pretty sure that you'd have problems even
without that commit/optimization in place.)

My guess is that there are more things like that. Possibly even things
that were never directly considered. VACUUM evolved in a world where
we absolutely took not running in a transaction for granted. Changing
that now is a pretty big deal. Maybe it would all be worth it if the end
result was a super compelling feature. But I for one don't think that
it will be.

If we absolutely have to do this, then the least worst approach might
be to make VACUUM into a no-op rather than throwing an ERROR -- demote
the ERROR into a WARNING. You could argue that we're just arbitrarily
deciding to not do a VACUUM just to be able to avoid throwing an error
if we do that. But isn't that already true with the patch that we
have? Is it really a "true VACUUM" if the operation can never advance
datfrozenxid? At least selectively demoting the ERROR to a WARNING is
"transparent" about it.

--
Peter Geoghegan

Re: [DOCS] Stats views and functions not in order?

2022-11-06 Thread Tom Lane

Peter Smith  writes:
> Sorry, I forgot the attachments in the previous post. PSA.

I spent a bit of time looking at this.  I agree that a lot of the
current ordering choices here look like they were made with the
advice of a dartboard, and there's a number of things that are
pretty blatantly just sloppy merging (like the out-of-order
wait-event items).  However, I'm not a big fan of "alphabetical
order at all costs", because that frequently leads to ordering
decisions that are not a lot better than random from a semantic
standpoint.  For example, I resist the idea that it's sensible
to put pg_stat_all_indexes before pg_stat_all_tables.
I'm unconvinced that putting pg_stat_sys_tables and
pg_stat_user_tables far away from pg_stat_all_tables is great,
either.

So ... how do we proceed?

One thing I'm unhappy about that you didn't address is that
the subsection ordering in "28.4. Progress Reporting" could
hardly have been invented even with a dartboard.  Perhaps it
reflects development order, but that's a poor excuse.
I'd be inclined to alphabetize by SQL command name, but maybe
leave Base Backup to the end since it's not a SQL command.

regards, tom lane

Re: remap the .text segment into huge pages at run time

2022-11-06 Thread Andres Freund

Hi,

On 2022-11-06 13:56:10 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:
> > I don't think the dummy functions are a good approach, there were plenty
> > things after it when I played with them.
>
> To be technical, the point wasn't to have no code after it, but to have no
> *hot* code *before* it, since with the iodlr approach the first 1.99MB of
> .text is below the first aligned boundary within that section. But yeah,
> I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.


> > > > With these flags the "R E" segments all start on a 0x20/2MiB
> boundary
> > > and
> > > > are padded to the next 2MiB boundary. However the OS / dynamic loader
> only
> > > > maps the necessary part, not all the zero padding.
> > > >
> > > > This means that if we were to issue a MADV_COLLAPSE, we can before it
> do
> > > an
> > > > mremap() to increase the length of the mapping.
> > >
> > > I see, interesting. What location are you passing for madvise() and
> > > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > > aligned boundary within .text?
>
> >/*
> > * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> > * much in older kernels.
> > */
>
> About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
> THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise.  My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.


> In the support work I've done, the standard recommendation is to turn THP
> off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 
'madvise'.


> If explicit HP's are used for shared mem, maybe THP is less of a risk? I
> need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.


> > A real version would have to open /proc/self/maps and do this for at least
>
> I can try and generalize your above sketch into a v2 patch.

Cool.


> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
>
> It looks like plpgsql is only 27 standard pages in size...
>
> Regarding glibc, we could try moving a couple of the hotter functions into
> PG, using smaller and simpler coding, if that has better frontend cache
> behavior. The paper "Understanding and Mitigating Front-End Stalls in
> Warehouse-Scale Computers" talks about this, particularly section 4.4
> regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.


> > > I quickly tried to align the segments with the linker and then in my
> patch
> > > have the address for mmap() rounded *down* from the .text start to the
> > > beginning of that segment. It refused to start without logging an error.
> >
> > Hm, what linker was that? I did note that you need some additional flags
> for
> > some of the linkers.
>
> BFD, but I wouldn't worry about that failure too much, since the
> mremap()/madvise() strategy has a lot fewer moving parts.
>
> On the subject of linkers, though, one thing that tripped me up was trying
> to change the linker with Meson. First I tried
>
> -Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...


> but that led to warnings like this when :
> /usr/bin/ld: warning: -z separate-loadable-segments ignored
>
> When using this in the top level meson.build
>
> elif host_system == 'linux'
>   sema_kind = 'unnamed_posix'
>   cppflags += '-D_GNU_SOURCE'
>   # Align the loadable segments to 2MB boundaries to support remapping to
>   # huge pages.
>   ldflags += cc.get_supported_link_arguments([
> '-Wl,-zmax-page-size=0x20',
> '-Wl,-zcommon-page-size=0x20',
> '-Wl,-zseparate-loadable-segments'
>   ])
>
>
> According to
>
> https://mesonbuild.com/howtox.html#set-linker
>
> I need to add CC_LD=lld to the env vars before invoking, which got rid of
> the warning. Then I wanted to verify that lld was actually used, and in
>
> https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, be

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-06 Thread Tom Lane

Yugo NAGATA  writes:
>> The attached patch tries to add comments explaining it on the functions.

> I forward it to the hackers list because the patch is to fix comments.

What do you think of the attached wording?

I don't think the pipeline angle is of concern to anyone who might be
reading these comments with the aim of understanding what guarantees
they have.  Perhaps there should be more about that in the user-facing
docs, though.

regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 883d6c0f70..8086b857b9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3448,6 +3448,10 @@ AbortCurrentTransaction(void)
  *	a transaction block, typically because they have non-rollback-able
  *	side effects or do internal commits.
  *
+ *	If this routine completes successfully, then the calling statement is
+ *	guaranteed that if it completes without error, its results will be
+ *	committed immediately.
+ *
  *	If we have already started a transaction block, issue an error; also issue
  *	an error if we appear to be running inside a user-defined function (which
  *	could issue more commands and possibly cause a failure after the statement
@@ -3573,6 +3577,10 @@ CheckTransactionBlock(bool isTopLevel, bool throwError, const char *stmtType)
  *	a transaction block than when running as single commands.  ANALYZE is
  *	currently the only example.
  *
+ *	If this routine returns "false", then the calling statement is
+ *	guaranteed that if it completes without error, its results will be
+ *	committed immediately.
+ *
  *	isTopLevel: passed down from ProcessUtility to determine whether we are
  *	inside a function.
  */

Re: ssl tests aren't concurrency safe due to get_free_port()

2022-11-06 Thread Andrew Dunstan

On 2022-11-05 Sa 14:36, Andres Freund wrote:
>>  
>>  use Carp;
>>  use Config;
>> -use Fcntl qw(:mode);
>> +use Fcntl qw(:mode :flock :seek O_CREAT O_RDWR);
> Does this do anything useful on windows?

All we're doing here on Windows and elsewhere is getting access to some
constants used in calls to flock(), seek() and sysopen(). It's not
actually doing anything else anywhere.

>
>> +if ($pid +0 > 0)
> Gotta love perl.

Think of it as a typecast.

>
>
>> +{
>> +if (kill 0, $pid)
> Does this work on windows?
>
Yes, it's supposed to. It doesn't actually send a signal, it checks if
the process exists. There's some suggestion it might give false
positives on Windows, but that won't really hurt us here, we'll just
look for a different port.

One possible addition would be to add removing the reservation files in
an END handler. That would be pretty simple.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: bug: pg_dump use strange tag for trigger

2022-11-06 Thread Pavel Stehule

ne 6. 11. 2022 v 15:52 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > When I played with regression tests for pg_restore, I tested -T filtering
> > triggers too. I had problems with restoring triggers. I found that the
> name
> > for trigger uses the pattern "tablename triggername" (not just (and
> > correct) triggername).
>
> > I propose to generate tag just like trigger name
>
> Trigger names by themselves aren't even a little bit unique, so that
> doesn't seem like a great idea to me.  There's backwards compatibility
> to worry about, too.  Maybe we need a documentation adjustment, instead?
>

I understand, but the space is a little bit non intuitive. Maybe use dot
there and better documentation.

Regards

Pavel


>
> regards, tom lane
>

Re: Draft back-branch release notes are up

2022-11-06 Thread Tom Lane

Justin Pryzby  writes:
> +  Fix planner failure with extended statistics on partitioned tables
> +  (Richard Guo, Justin Pryzby)

> This can also happen with inheritance tables.

> +  Add missing guards for NULL connection pointer

> Maybe should be NULL or NULL ?

Done, thanks.

regards, tom lane

Re: Pluggable toaster

2022-11-06 Thread Nikita Malakhov

Hi,

Pluggable TOAST is provided as an interface to allow developers to plug
in custom TOAST mechanics. It does not determines would it be universal
Toaster or one data type, but syntax for universal Toaster is out of scope
for this patchset.

>True, but besides Jsonb Toaster one can implement a universal one as
>well (your own example: encryption, or a Toaster that bypasses a 1 GB
>value limitation). However we can probably keep this out of scope of
>this particular patchset for now. As mentioned before this is going to
>be just an additional syntax for the user convenience.

To transparently bypass the 1 Gb limit you have to increase size of data
VARLENA type is able to hold. This is out if scope for this patchset too,
but as I mentioned before, there are means to do this with Pluggable
TOAST using toast and detoast iterators.

>Compression is actually a part of the four-stage TOAST algorithm. So
>what you're doing is using the default TOAST most of the time, but if
>the storage strategy is EXTERNAL and a custom TOASTer is specified
>then you use a type-aware "TOASTer".

We provide several custom Toasters for particular types of data causing
a lot of problems in storage. The main idea behind Pluggable TOAST is
using special data-aware Toasters where it is needed.
Extended storage mode supports only 2 compression algorithms, though
there are more efficient ones for different types of data.

>If the goal is to implement true "Pluggable TOASTers" then the TOAST
>should be substituted entirely. If the goal is to implement type-aware
>sub-TOASTers we should be honest about this and call it otherwise:
>e.g. "Type-aware TOASTer" or perhaps "subTOASTer". Additionally in
>this case there should be no validate() method since this is going to
>work only with the default heapam that implements the default TOASTer.

>So to clarify, the goal is to deliver true "Pluggable TOASTers" or
>only "type-aware sub-TOASTers"?

Pluggable TOAST does not supposes complete substitution of existing
TOAST mechanics - this is out of scope for this patchset. It proposes
means to substitute it or plug in additional custom TOAST mechanics
for particular data types.

>I don't see any reason why the semantics for Toasters should be any
>different from user-defined types implemented in an extension. If
>there are columns that use a given Toaster we should forbid DROPping
>the extension. Otherwise "DROP extension" should succeed. No one says
>that this should be a fast operation but it should be possible.

I'm currently working on a revision of Pluggable TOAST that would make
dropping Toaster possible if there is no data TOASTed with it, along with
several other major changes. It will be available in this (I hope so) or the
following, if I won't make it in time, commitfest.

On Fri, Nov 4, 2022 at 12:35 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi again,
>
> > [1]:
> https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22=5xvbg7s4vr...@mail.gmail.com
>
> I added a link but forgot to use it, d'oh!
>
> Please note that if the goal is to merely implement "type-aware
> sub-TOASTers" then compression dictionaries [1] arguably provide the
> same value with MUCH less complexity.
>
> --
> Best regards,
> Aleksander Alekseev
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

Re: Add LZ4 compression in pg_dump

2022-11-06 Thread gkokolatos

On Wed, Nov 2, 2022 at 14:28, Justin Pryzby  wrote:

> Checking if you'll be able to submit new patches soon ?

Thank you for checking up. Expect new versions within this commitfest cycle.

Re: bug: pg_dump use strange tag for trigger

2022-11-06 Thread Tom Lane

Pavel Stehule  writes:
> When I played with regression tests for pg_restore, I tested -T filtering
> triggers too. I had problems with restoring triggers. I found that the name
> for trigger uses the pattern "tablename triggername" (not just (and
> correct) triggername).

> I propose to generate tag just like trigger name

Trigger names by themselves aren't even a little bit unique, so that
doesn't seem like a great idea to me.  There's backwards compatibility
to worry about, too.  Maybe we need a documentation adjustment, instead?

regards, tom lane

Re: Small TAP improvements

2022-11-06 Thread Álvaro Herrera

On 2022-Jun-14, Andrew Dunstan wrote:

> OK, here's a more principled couple of patches. For config_data, if you
> give multiple options it gives you back the list of values. If you don't
> specify any, in scalar context it just gives you back all of pg_config's
> output, but in array context it gives you a map, so you should be able
> to say things like:
> 
> my %node_config = $node->config_data;

Hi, it looks to me like these were forgotten?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Improve logging when using Huge Pages

2022-11-06 Thread Justin Pryzby

On Sun, Nov 06, 2022 at 02:04:29PM +0700, John Naylor wrote:
> On Thu, Nov 3, 2022 at 8:11 AM Thomas Munro  wrote:
> >
> > I wonder if the problem here is that people are reluctant to add noise
> > to every starting system.   There are people who have not configured
> > their system and don't want to see that noise, and then some people
> > have configured their system and would like to know about it if it
> > doesn't work so they can be aware of that, but don't want to use "off"
> > because they don't want a hard failure.  Would it be better if there
> > were a new level "try_log" (or something), which only logs a message
> > if it tries and fails?
> 
> I think the best thing to do is change huge_pages='on' to log a WARNING and
> fallback to regular pages if the mapping fails. That way, both dev and prod
> can keep the same settings, since 'on' will have both visibility and
> robustness. I don't see a good reason to refuse to start -- seems like an
> anti-pattern.

I'm glad to see there's still discussion on this topic :)

Another idea is to add a RUNTIME_COMPUTED GUC to *display* the state of
huge pages, so I can stop parsing /proc/maps to find it.

-- 
Justin

Re: [patch] Have psql's \d+ indicate foreign partitions

2022-11-06 Thread Ian Lawrence Barwick

2022年11月6日(日) 1:39 Tom Lane :
>
> Michael Paquier  writes:
> > On Mon, Oct 24, 2022 at 09:44:18PM +0900, Ian Lawrence Barwick wrote:
> >> Recently I have been working a lot with partitioned tables which contain a 
> >> mix
> >> of local and foreign partitions, and find it would be very useful to be 
> >> able to
> >> easily obtain an overview of which partitions are foreign and where they 
> >> are
> >> located.
>
> > Hmm.  I am not sure that we should add this much amount of
> > information, particularly for the server bits.
>
> FWIW, I am also in favor of adding ", FOREIGN" but no more.
> My concern is that as submitted, the patch greatly increases
> the cost of the underlying query by adding two more catalogs
> to the join.  I don't think imposing such a cost on everybody
> (whether they use foreign partitions or not) is worth that.  But
> we can add ", FOREIGN" for free since we have the relkind anyway.

Fair enough, make sense.

 Revised version added per suggestions, which produces output like this:

postgres=# \d+ parttest
   Partitioned table "public.parttest"
 Column |  Type   | Collation | Nullable | Default | Storage  |
Compression | Stats target | Description

+-+---+--+-+--+-+--+-
 id | integer |   | not null | | plain|
 |  |
 val1   | text|   |  | | extended |
 |  |
 val2   | text|   |  | | extended |
 |  |
Partition key: HASH (id)
Partitions: parttest_10_0 FOR VALUES WITH (modulus 10, remainder 0),
parttest_10_1 FOR VALUES WITH (modulus 10, remainder
1), FOREIGN,
parttest_10_2 FOR VALUES WITH (modulus 10, remainder 2),
parttest_10_3 FOR VALUES WITH (modulus 10, remainder
3), FOREIGN,
parttest_10_4 FOR VALUES WITH (modulus 10, remainder 4),
parttest_10_5 FOR VALUES WITH (modulus 10, remainder
5), FOREIGN,
parttest_10_6 FOR VALUES WITH (modulus 10, remainder 6),
parttest_10_7 FOR VALUES WITH (modulus 10, remainder
7), FOREIGN,
parttest_10_8 FOR VALUES WITH (modulus 10, remainder 8),
parttest_10_9 FOR VALUES WITH (modulus 10, remainder 9), FOREIGN


Regards

Ian Barwick
commit 0b330a67e5941bacb815fa6dfae914c56563f7a9
Author: Ian Barwick 
Date:   Sun Nov 6 21:08:26 2022 +0900

psql: in \d+, indicate foreign partitions

Currently with a partitioned table, \d+ lists the partitions and their
partition key, but it would be useful to see which ones, if any, are
foreign partitions.

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c645d66418..2eae519b1d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3445,6 +3445,8 @@ describeOneTableDetails(const char *schemaname,
 if (child_relkind == RELKIND_PARTITIONED_TABLE ||
 	child_relkind == RELKIND_PARTITIONED_INDEX)
 	appendPQExpBufferStr(&buf, ", PARTITIONED");
+else if (child_relkind == RELKIND_FOREIGN_TABLE)
+	appendPQExpBufferStr(&buf, ", FOREIGN");
 if (strcmp(PQgetvalue(result, i, 2), "t") == 0)
 	appendPQExpBufferStr(&buf, " (DETACH PENDING)");
 if (i < tuples - 1)
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 9d7610b948..47bf56adbf 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -1404,7 +1404,7 @@ CREATE FOREIGN TABLE ft2 () INHERITS (fd_pt1)
  c1 | integer |   | not null | | plain|  | 
  c2 | text|   |  | | extended |  | 
  c3 | date|   |  | | plain|  | 
-Child tables: ft2
+Child tables: ft2, FOREIGN
 
 \d+ ft2
Foreign table "public.ft2"
@@ -1449,7 +1449,7 @@ ALTER FOREIGN TABLE ft2 INHERIT fd_pt1;
  c1 | integer |   | not null | | plain|  | 
  c2 | text|   |  | | extended |  | 
  c3 | date|   |  | | plain|  | 
-Child tables: ft2
+Child tables: ft2, FOREIGN
 
 \d+ ft2
Foreign table "public.ft2"
@@ -1483,7 +1483,7 @@ Server: s0
 FDW options: (delimiter ',', quote '"', "be quoted" 'value')
 Inherits: fd_pt1
 Child tables: ct3,
-  ft3
+  ft3, FOREIGN
 
 \d+ ct3
 Table "public.ct3"
@@ -1522,7 +1522,7 @@ ALTER TABLE fd_pt1 ADD COLUMN c8 integer;
  c6 | integer |   |  | | plain|  | 
  c7 | integer |   | not null | | plain|  | 
  c8 | integer |

pg_upgrade, tables_with_oids.txt -> tables_with_oids.sql?

2022-11-06 Thread Daniel Westermann (DWE)

Hi,

as I've just upgraded an instance which contained tables "WITH OIDS" I wonder 
if it would make sense if pg_upgrade directly creates a script to fix those. I 
know it is easy to that with e.g. sed over tables_with_oids.txt but it would be 
more convenient to have the script generated directly.

Thoughts?

Regards
Daniel

Re: Improve logging when using Huge Pages

2022-11-06 Thread John Naylor

On Thu, Nov 3, 2022 at 8:11 AM Thomas Munro  wrote:
>
> I wonder if the problem here is that people are reluctant to add noise
> to every starting system.   There are people who have not configured
> their system and don't want to see that noise, and then some people
> have configured their system and would like to know about it if it
> doesn't work so they can be aware of that, but don't want to use "off"
> because they don't want a hard failure.  Would it be better if there
> were a new level "try_log" (or something), which only logs a message
> if it tries and fails?

I think the best thing to do is change huge_pages='on' to log a WARNING and
fallback to regular pages if the mapping fails. That way, both dev and prod
can keep the same settings, since 'on' will have both visibility and
robustness. I don't see a good reason to refuse to start -- seems like an
anti-pattern.

--
John Naylor
EDB: http://www.enterprisedb.com

39 matches

Mail list logo