Re: [PoC] Let libpq reject unexpected authentication requests

2022-06-13 Thread Michael Paquier
On Thu, Jun 09, 2022 at 04:29:38PM -0700, Jacob Champion wrote:
> On Wed, Jun 8, 2022 at 9:58 PM Michael Paquier  wrote:
>> Er, this one could be a problem protocol-wise for SASL, because that
>> would mean that the authentication gets completed but that the
>> exchange has begun and is not finished?
> 
> I think it's already a problem, if you're not using channel_binding.
> The cert behavior here makes it even less intuitive.

Ah right.  I forgot about the part where we need to have the backend
publish the set of supported machanisms.  If we don't get back
SCRAM-SHA-256-PLUS we are currently complaining after the exchange has
been initialized, true.  Maybe I should look at the RFC more closely.
The backend is very strict regarding that and needs to return an error
back to the client only when the exchange is done, but I don't recall
all the bits about the client handling.

> Personally I think the ability to provide a default of `!password` is
> very compelling. Any allowlist we hardcode won't be future-proof (see
> also my response to Laurenz upthread [1]).

Hm, perhaps.  You could use that as a default at application level,
but the default set in libpq should be backward-compatible (aka allow
everything, even trust where the backend just sends AUTH_REQ_OK).
This is unfortunate but there is a point in not breaking any user's
application, as well, particularly with ldap, and note that we have to
keep a certain amount of things backward-compatible as a very common
practice of packaging with Postgres is to have libpq link to binaries
across *multiple* major versions (Debian is one if I recall), with the
newest version of libpq used for all.  One argument in favor of
!password would be to control whether one does not want to use ldap,
but I'd still see most users just specify one or more methods in line
with their HBA policy as an approved list.

> I'm pretty motivated to provide the ability to say "I want cert auth
> only, nothing else." Using a separate parameter would mean we'd need
> something like `require_auth=none`, but I think that makes a certain
> amount of sense.

If the default of require_auth is backward-compatible and allows
everything, using a different parameter for the certs won't matter
anyway?

>> +   appendPQExpBuffer(>errorMessage,
>> + libpq_gettext("auth method \"%s\" required, 
>> but %s\n"),
>> + conn->require_auth, reason);
>> This one is going to make translation impossible.  One way to tackle
>> this issue is to use "auth method \"%s\" required: %s".
> 
> Yeah, I think I have a TODO somewhere about that somewhere. I'm
> confused about your suggested fix though; can you elaborate?

My suggestion is to reword the error message so as the reason and the
main error message can be treated as two independent things.  You are
right to apply two times libpq_gettext(), once to "reason" and once to
the main string.

>> +/*
>> + * Convenience macro for checking the allowed_auth_methods bitmask. Caller 
>> must
>> + * ensure that type is not greater than 31 (high bit of the bitmask).
>> + */
>> +#define auth_allowed(conn, type) \
>> +   (((conn)->allowed_auth_methods & (1 << (type))) != 0)
>> Better to add a compile-time check with StaticAssertDecl() then?  Or
>> add a note about that in pqcomm.h?
> 
> If we only passed constants, that would work, but we also determine
> the type at runtime, from the server message. Or am I missing
> something?

My point would be to either register a max flag in the set of
AUTH_REQ_* in pqcomm.h so as we never go above 32 with an assertion to
make sure that this would never overflow, but I would add a comment in
pqcomm.h telling that we also do bitwise operations, relying on the
number of AUTH_REQ_* flags, and that we'd better be careful once the
number of flags gets higher than 32.  There is some margin, still that
could be easily forgotten.

>> +   /*
>> +* Sanity check; a duplicated method probably indicates a typo 
>> in a
>> +* setting where typos are extremely risky.
>> +*/
>> Not sure why this is a problem.  Fine by me to check that, but this is
>> an OR list, so specifying one element twice means the same as once.
> 
> Since this is likely to be a set-and-forget sort of option, and it
> needs to behave correctly across server upgrades, I'd personally
> prefer that the client tell me immediately if I've made a silly
> mistake. Even for something relatively benign like this (but arguably,
> it's more important for the NOT case).

This is just a couple of lines.  Fine by me to keep it if you feel
that's better in the long run.
--
Michael


signature.asc
Description: PGP signature


RE: Skipping schema changes in publication

2022-06-13 Thread houzj.f...@fujitsu.com
On Wednesday, June 8, 2022 7:04 PM Amit Kapila  wrote:
> 
> On Fri, Jun 3, 2022 at 3:37 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v8 patch has the changes for the
> same.
> >
> 
> AFAICS, the summary of this proposal is that we want to support
> exclude of certain objects from publication with two kinds of
> variants. The first variant is to add support to exclude specific
> tables from ALL TABLES PUBLICATION. Without this feature, users need
> to manually add all tables for a database even when she wants to avoid
> only a handful of tables from the database say because they contain
> sensitive information or are not required. We have seen that other
> database like MySQL also provides similar feature [1] (See
> REPLICATE_WILD_IGNORE_TABLE). The proposed syntax for this is as
> follows:
> 
> CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT TABLE t1,t2;
> or
> ALTER PUBLICATION pub1 ADD ALL TABLES EXCEPT TABLE t1,t2;
> 
> This will allow us to publish all the tables in the current database
> except t1 and t2. Now, I see that pg_dump has a similar option
> provided by switch --exclude-table but that allows tables matching
> patterns which is not the case here. I am not sure if we need a
> similar variant here.
> 
> Then users will be allowed to reset the publication by:
> ALTER PUBLICATION pub1 RESET;
> 
> This will reset the publication to the default state which includes
> resetting the publication parameters, setting the ALL TABLES flag to
> false, and dropping the relations and schemas that are associated with
> the publication. I don't know if we want to go further with allowing
> to RESET specific parameters and if so which parameters and what would
> its syntax be?
> 
> The second variant is to add support to exclude certain columns of a
> table while publishing a particular table. Currently, users need to
> list all required columns' names even if they don't want to hide most
> of the columns in the table (for example Create Publication pub For
> Table t1 (c1, c2)). Consider user doesn't want to publish the 'salary'
> or other sensitive information of executives/employees but would like
> to publish all other columns. I feel in such cases it will be a lot of
> work for the user especially when the table has many columns. I see
> that Oracle has a similar feature [2]. I think without this it will be
> difficult for users to use this feature in some cases. The patch for
> this is not proposed but I would imagine syntax for it to be something
> like "Create Publication pub For Table t1 Except (c3)" and similar
> variants for Alter Publication.

I think the feature to exclude certain columns of a table would be useful.

In some production scenarios, we usually do not want to replicate
sensitive fields(column) in the table. Although we already can achieve
this by specify all replicated columns in the list[1], but that seems a
hard work when the table has hundreds of columns.

[1]
CREATE TABLE test(a int, b int, c int,..., sensitive text);
CRAETE PUBLICATION pub FOR TABLE test(a,b,c,...);

In addition, it's not easy to maintain the column list like above. Because
we sometimes need to add new fields or delete fields due to business
needs. Every time we add a column(or delete a column in column list), we
need to update the column list.

If we support Except:
CRAETE PUBLICATION pub FOR TABLE test EXCEPT (sensitive);

We don't need to update the column list in most cases.

Thanks for "hametan" for providing the use case off-list.

Best regards,
Hou zj





RE: tablesync copy ignores publication actions

2022-06-13 Thread shiy.f...@fujitsu.com
On Wed, Jun 8, 2022 12:10 PM Amit Kapila  wrote:
> 
> On Tue, Jun 7, 2022 at 7:08 PM Euler Taveira  wrote:
> >
> > On Tue, Jun 7, 2022, at 1:10 AM, Peter Smith wrote:
> >
> > The logical replication tablesync ignores the publication 'publish'
> > operations during the initial data copy.
> >
> > This is current/known PG behaviour (e.g. as recently mentioned [1])
> > but it was not documented anywhere.
> >
> > initial data synchronization != replication. publish parameter is a 
> > replication
> > property; it is not a initial data synchronization property. Maybe we should
> > make it clear like you are suggesting.
> >
> 
> +1 to document it. We respect some other properties of publication
> like the publish_via_partition_root parameter, column lists, and row
> filters. So it is better to explain about 'publish' parameter which we
> ignore during the initial sync.
> 

I also agree to add it to the document.

> > You could expand the
> > suggested sentence to make it clear what publish parameter is or even add
> a
> > link to the CREATE PUBLICATION synopsis (that explains about publish
> > parameter).
> >
> 
> +1. I suggest that we should add some text about the behavior of
> initial sync in CREATE PUBLICATION doc (along with the 'publish'
> parameter) or otherwise, we can explain it where we are talking about
> publications [1].
> 

I noticed that CREATE SUBSCRIPTION doc mentions that row filter will affect
initial sync along with "copy_data" parameter.[1] Maybe we can add some text for
"publish" parameter there.

[1] https://www.postgresql.org/docs/devel/sql-createsubscription.html

Regards,
Shi yu


Typo in ro.po file?

2022-06-13 Thread Peter Smith
Hi.

FWIW, I stumbled on this obscure possible typo (?) in src/pl/plperl/po/ro.po:

~~~

#: plperl.c:788
msgid "while parsing Perl initialization"
msgstr "în timpul parsing inițializării Perl"
#: plperl.c:793
msgid "while running Perl initialization"
msgstr "în timpul rulării intializării Perl"

~~~

(Notice the missing 'i' -  "inițializării" versus "intializării")

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: pltcl crash on recent macOS

2022-06-13 Thread Tom Lane
Peter Eisentraut  writes:
> Switching the order of -bundle_loader and -lc did not help.

Meh.  Well, it was worth a try.

I'd be okay with just dropping the -lc from pl/tcl/Makefile and seeing
what the buildfarm says.  The fact that we needed it in 1998 doesn't
mean that we still need it on supported versions of Tcl; nor was it
ever anything but a hack for us to be overriding what TCL_LIBS says.

As a quick check, I tried it on prairiedog's host (which has the oldest
Tcl installation I still have in captivity), and it seemed fine.

regards, tom lane




Re: "buffer too small" or "path too long"?

2022-06-13 Thread Bruce Momjian
On Mon, Jun 13, 2022 at 10:41:41PM -0400, Tom Lane wrote:
> * logging.c believes it should prefix every line of output with the
> program's name and so on.  This doesn't seem terribly appropriate
> for pg_upgrade's use --- at least, not unless we make pg_upgrade
> WAY less chatty.  Perhaps that'd be fine, I dunno.

pg_upgrade was designed to be chatty because it felt it could fail under
unpredictable circumstances --- I am not sure how true that is today.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch

2022-06-13 Thread Michael Paquier
On Fri, Jun 10, 2022 at 05:45:11PM -0400, Andrew Dunstan wrote:
> The module is already a noop if there's a TAP test for pg_upgrade. So I
> don't understand the point of the PR at all.

Oh.  I thought that the old path was still taken as long as
--enable-tap-tests was not used.  I was wrong, then.  I'll go and
remove the pull request.
--
Michael


signature.asc
Description: PGP signature


Re: "buffer too small" or "path too long"?

2022-06-13 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Mon, 13 Jun 2022 13:25:01 -0400, Tom Lane  wrote in 
>> I was about to question that, but now I remember that pg_upgrade has
>> its own logging facility with a different idea about who provides
>> the trailing newline than common/logging.[hc] has.  Undoubtedly
>> that's the source of this mistake.  We really need to get pg_upgrade
>> out of the business of having its own logging conventions.

> Yes... I don't find a written reason excluding pg_upgrade in both the
> commit 9a374b77fb and or the thread [1].

Well, as far as 9a374b77fb went, Peter had left pg_upgrade out of the
mix in the original creation of common/logging.c, and I didn't think
that dealing with that was a reasonable part of my update patch.

> But I guess that we decided
> that we first provide the facility in the best style ignoring the
> current impletent in pg_upgrade then let pg_upgrade use it.  So I
> think it should emerge in the next cycle?  I'll give it a shot if no
> one is willing to do that for now. (I believe it is straightforward..)

Actually, I spent some time earlier today looking into that, and I can
see why Peter stayed away from it :-(.  There are a few issues:

* The inconsistency with the rest of the world about trailing newlines.
That aspect actually seems fixable fairly easily, and I have a patch
mostly done for it.

* logging.c believes it should prefix every line of output with the
program's name and so on.  This doesn't seem terribly appropriate
for pg_upgrade's use --- at least, not unless we make pg_upgrade
WAY less chatty.  Perhaps that'd be fine, I dunno.

* pg_upgrade's pg_log_v duplicates all (well, most) stdout messages
into the INTERNAL_LOG_FILE log file, something logging.c has no
provision for (and it'd not be too easy to do, because of the C
standard's restrictions on use of va_list).  Personally I'd be okay
with nuking the INTERNAL_LOG_FILE log file from orbit, but I bet
somebody will fight to keep it.

* pg_log_v has also got a bunch of specialized rules around how
to format PG_STATUS message traffic.  Again I wonder how useful
that whole behavior really is, but taking it out would be a big
user-visible change.

In short, it seems like pg_upgrade's logging habits are sufficiently
far out in left field that we couldn't rebase it on top of logging.c
without some seriously large user-visible behavioral changes.
I have better things to spend my time on than advocating for that.

However, the inconsistency in newline handling is a problem:
I found that there are already other bugs with missing or extra
newlines, and it will only get worse if we don't unify that
behavior.  So my inclination for now is to fix that and let the
other issues go.  Patch coming.

regards, tom lane




Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-13 Thread Michael Paquier
On Mon, Jun 13, 2022 at 06:41:17PM -0700, Peter Geoghegan wrote:
> On Mon, Jun 13, 2022 at 1:51 PM Greg Stark  wrote:
>> By "relatively common" I think we're talking "nigh universal". Afaics
>> there are no warnings in the docs about worrying about search_path on
>> IMMUTABLE functions. There is for SECURITY DEFINER but I have to admit
>> I wasn't aware myself of all the gotchas described there.
> 
> I didn't realize that it was that bad. Even if it's only 10% as bad as
> you say, it would still be very valuable to do something about it
> (ideally with an approach that is non-invasive).

Having checks implemented so as users cannot bite themselves back is a
good idea in the long term, but I have also seen cases where abusing
of immutable functions was useful:
- Enforce the push down of function expressions to remote server.
- Regression tests.  Just a few weeks ago I have played with an
advisory lock within an index expression.

Perhaps I never should have done what the first point was doing
anyway, but having a way to disable any of that, be it just a
developer option for the purpose of some regression tests, would be
nice.  Skimming quickly through the patch, any of the checks related
to search_path would not apply to the fancy cases I saw, though.
--
Michael


signature.asc
Description: PGP signature


Re: "buffer too small" or "path too long"?

2022-06-13 Thread Tom Lane
Michael Paquier  writes:
> I have noticed this thread and 4e54d23 as a result this morning.  If
> you want to spread this style more, wouldn't it be better to do that
> in all the places of pg_upgrade where we store paths to files?  I can
> see six code paths with log_opts.basedir that could do the same, as of
> the attached.  The hardcoded file names have various lengths, and some
> of them are quite long making the generated paths more exposed to
> being cut in the middle.

Well, I just fixed the ones in make_outputdirs because it seemed weird
that that part of the function was not doing something the earlier parts
did.  I didn't look around for more trouble.

I think that pg_fatal'ing on the grounds of path-too-long once we've
already started the upgrade isn't all that great.  Really we want to
fail on that early on --- so coding make_outputdirs like this is
fine, but maybe we need a different plan for files made later.

regards, tom lane




Re: "buffer too small" or "path too long"?

2022-06-13 Thread Michael Paquier
On Tue, Jun 14, 2022 at 09:52:52AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 14 Jun 2022 09:48:26 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
>> Yeah, I feel so and it is what I wondered about recently when I saw
>> some complete error messages.  Is that because of the length of the
>> subject?
> 
> And I found that it is alrady done. Thanks!

I have noticed this thread and 4e54d23 as a result this morning.  If
you want to spread this style more, wouldn't it be better to do that
in all the places of pg_upgrade where we store paths to files?  I can
see six code paths with log_opts.basedir that could do the same, as of
the attached.  The hardcoded file names have various lengths, and some
of them are quite long making the generated paths more exposed to
being cut in the middle.
--
Michael
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index ace7387eda..e1dc031c24 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -704,12 +704,15 @@ check_proper_datallowconn(ClusterInfo *cluster)
 	FILE	   *script = NULL;
 	char		output_path[MAXPGPATH];
 	bool		found = false;
+	int			len;
 
 	prep_status("Checking database connection settings");
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir,
-			 "databases_with_datallowconn_false.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir,
+   "databases_with_datallowconn_false.txt");
+	if (len >= sizeof(output_path))
+		pg_fatal("directory path for new cluster is too long\n");
 
 	conn_template1 = connectToServer(cluster, "template1");
 
@@ -823,6 +826,7 @@ check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster)
 	FILE	   *script = NULL;
 	bool		found = false;
 	char		output_path[MAXPGPATH];
+	int			len;
 
 	prep_status("Checking for contrib/isn with bigint-passing mismatch");
 
@@ -834,9 +838,11 @@ check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster)
 		return;
 	}
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir,
-			 "contrib_isn_and_int8_pass_by_value.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir,
+   "contrib_isn_and_int8_pass_by_value.txt");
+	if (len >= sizeof(output_path))
+		pg_fatal("directory path for new cluster is too long\n");
 
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
 	{
@@ -909,12 +915,15 @@ check_for_user_defined_postfix_ops(ClusterInfo *cluster)
 	FILE	   *script = NULL;
 	bool		found = false;
 	char		output_path[MAXPGPATH];
+	int			len;
 
 	prep_status("Checking for user-defined postfix operators");
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir,
-			 "postfix_ops.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir,
+   "postfix_ops.txt");
+	if (len >= sizeof(output_path))
+		pg_fatal("directory path for new cluster is too long\n");
 
 	/* Find any user defined postfix operators */
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
@@ -1009,12 +1018,15 @@ check_for_tables_with_oids(ClusterInfo *cluster)
 	FILE	   *script = NULL;
 	bool		found = false;
 	char		output_path[MAXPGPATH];
+	int			len;
 
 	prep_status("Checking for tables WITH OIDS");
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir,
-			 "tables_with_oids.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir,
+   "tables_with_oids.txt");
+	if (len >= sizeof(output_path))
+		pg_fatal("directory path for new cluster is too long\n");
 
 	/* Find any tables declared WITH OIDS */
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
@@ -1265,12 +1277,15 @@ check_for_user_defined_encoding_conversions(ClusterInfo *cluster)
 	FILE	   *script = NULL;
 	bool		found = false;
 	char		output_path[MAXPGPATH];
+	int			len;
 
 	prep_status("Checking for user-defined encoding conversions");
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir,
-			 "encoding_conversions.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir,
+   "encoding_conversions.txt");
+	if (len >= sizeof(output_path))
+		pg_fatal("directory path for new cluster is too long\n");
 
 	/* Find any user defined encoding conversions */
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
diff --git a/src/bin/pg_upgrade/function.c b/src/bin/pg_upgrade/function.c
index ea785df771..4f65726ed9 100644
--- a/src/bin/pg_upgrade/function.c
+++ b/src/bin/pg_upgrade/function.c
@@ -125,11 +125,14 @@ check_loadable_libraries(void)
 	FILE	   *script = NULL;
 	bool		found = false;
 	char		output_path[MAXPGPATH];
+	int			len;
 
 	prep_status("Checking for presence of required libraries");
 
-	snprintf(output_path, sizeof(output_path), "%s/%s",
-			 log_opts.basedir, "loadable_libraries.txt");
+	len = snprintf(output_path, sizeof(output_path), "%s/%s",
+   log_opts.basedir, "loadable_libraries.txt");
+	if (len >= 

Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 1:51 PM Greg Stark  wrote:
> By "relatively common" I think we're talking "nigh universal". Afaics
> there are no warnings in the docs about worrying about search_path on
> IMMUTABLE functions. There is for SECURITY DEFINER but I have to admit
> I wasn't aware myself of all the gotchas described there.

I didn't realize that it was that bad. Even if it's only 10% as bad as
you say, it would still be very valuable to do something about it
(ideally with an approach that is non-invasive).

-- 
Peter Geoghegan




Re: 2022-06-16 release announcement draft

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 6:15 PM Jonathan S. Katz  wrote:
> > Perhaps it is also worth mentioning that you can use REINDEX without
> > CONCURRENTLY, even before upgrading.
>
> I'm hesitant on giving too many options. We did put out the "warning"
> announcement providing this as an option. I do think that folks who are
> running CIC/RIC are sensitive to locking, and a plain old "REINDEX" may
> be viable except in an emergency.

The locking implications for plain REINDEX are surprising IMV -- and
so I suggest sticking with what you have here.

In many cases using plain REINDEX is not meaningfully different to
taking a full AccessExclusiveLock on the table (we only acquire an AEL
on the index, but in practice that can be a distinction without a
difference). We at least went some way towards making the situation
with REINDEX locking clearer in a doc patch that recently became
commit 8ac700ac.

--
Peter Geoghegan




Re: 2022-06-16 release announcement draft

2022-06-13 Thread Jonathan S. Katz

On 6/13/22 1:38 PM, Tom Lane wrote:

"Jonathan S. Katz"  writes:

Please review for technical accuracy and omissions.


A few minor thoughts:


The PostgreSQL Global Development Group has released PostgreSQL 14.4 to fix an
issue that could cause silent data corruption when using the
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
and [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
commands.


Maybe s/and/or/ ?


Fixed.


PostgreSQL 14.4 fixes an issue with the
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
and [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
that could cause silent data corruption of indexes.


Either leave out "the" or add "commands".  That is, "the FOO and BAR
commands" reads fine, "the FOO and BAR" less so. 


Yeah, that was likely an edit-o. Fixed.


Also, I'm inclined
to be a bit more specific and say that the problem is missing index
entries, so maybe like "... fixes an issue that could cause the [CIC]
and [RIC] commands to omit index entries for some rows".


Agreed. Edited in attached.


Once you upgrade your system to PostgreSQL 14.4, you can fix any silent data
corruption using `REINDEX CONCURRENTLY`.


Perhaps it is also worth mentioning that you can use REINDEX without
CONCURRENTLY, even before upgrading.


I'm hesitant on giving too many options. We did put out the "warning" 
announcement providing this as an option. I do think that folks who are 
running CIC/RIC are sensitive to locking, and a plain old "REINDEX" may 
be viable except in an emergency.



* Report implicitly-created operator families (`CREATE OPERATOR CLASS`) to event
triggers.


Maybe "(generated by `CREATE OPERATOR CLASS`)"?  As-is, the parenthetical
comment looks more like a mistake than anything else.


Fixed.


The rest looks good.


Thanks for the review! Next version attached.

Jonathan
The PostgreSQL Global Development Group has released PostgreSQL 14.4 to fix an
issue that could cause silent data corruption when using the
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
or [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
commands. Please see the following section for information on how to detect and
correct silent data corruption in your indexes.

This release also fixes over 15 bugs since PostgreSQL 14.3 was released in May.
For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

This release is only for PostgreSQL 14. The PostgreSQL Global Development Group
will make a scheduled update release on August 11, 2022 for all supported
versions of PostgreSQL (10 - 14).

Detect and Fix `CREATE INDEX CONCURRENTLY` / `REINDEX CONCURRENTLY` Corruption
--

PostgreSQL 14.4 fixes an issue with
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
and [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
that could cause silent data corruption of indexes. Prior to the fix,
`CREATE INDEX CONCURRENTLY` and `REINDEX CONCURRENTLY` could build indexes that
would have missing entries, causing `SELECT` queries that used the index to not
find certain rows. This issue may not have corrupted your indexes, but if you
are unsure, we advise you to reindex using the instructions below.

You can detect if a B-tree index (the default index type) has data corruption
using the 
[`pg_amcheck`](https://www.postgresql.org/docs/current/app-pgamcheck.html)
command with the `--heapallindexed` flag. For example:

```
pg_amcheck --heapallindexed database
```

If `pg_amcheck` detects corruption or if you ran `CREATE INDEX CONCURRENTLY` or
`REINDEX CONCURRENTLY` on any other index type (e.g. GiST, GIN, etc.), please
follow the instructions below.

Once you upgrade your system to PostgreSQL 14.4, you can fix any silent data
corruption using `REINDEX CONCURRENTLY`. For example, if you have an index named
`elephant_idx` that has data corruption, you can run the following command on
PostgreSQL 14.4 to fix it:

```
REINDEX CONCURRENTLY elephant_idx;
```

You can use the
[`reindexdb`](https://www.postgresql.org/docs/current/app-reindexdb.html)
command to reindex all indexes across your cluster. `reindexdb` also has a
`--jobs` flag that lets you run reindex operations in parallel. For example, to
reindex your entire PostgreSQL cluster using `` parallel jobs, you can run
the following command:

```
reindexdb --all --concurrently --jobs 
```

Bug Fixes and Improvements
--

This update fixes over 15 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 14.

Included in this release:

* Several fixes for query plan memoization.
* Fix queries in which a 

Re: Collation version tracking for macOS

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 5:41 PM Thomas Munro  wrote:
> It'd clearly be a terrible idea for us to try to use any of that, and
> Mac users should be very happy with the new support for ICU as DB
> default.

This suggests something that I already suspected: nobody particularly
expects the system lib C to be authoritative for the OS as a whole, in
the way that Postgres supposes. At least in the case of Mac OS, which
is after all purely a desktop operating system.

-- 
Peter Geoghegan




Re: "buffer too small" or "path too long"?

2022-06-13 Thread Kyotaro Horiguchi
At Tue, 14 Jun 2022 09:48:26 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 13 Jun 2022 13:25:01 -0400, Tom Lane  wrote in 
> > +1, but I'm inclined to make it read "... is too long".
> 
> Yeah, I feel so and it is what I wondered about recently when I saw
> some complete error messages.  Is that because of the length of the
> subject?

And I found that it is alrady done. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: "buffer too small" or "path too long"?

2022-06-13 Thread Kyotaro Horiguchi
At Mon, 13 Jun 2022 13:25:01 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > The root cause of the errors is that the user-provided directory path
> > of new cluster's root was too long.  Anywhich one of the four buffers
> > is overflowed, it doesn't makes any difference for users and doesn't
> > offer any further detail to suppoerters/developers.  I see "output
> > directory path of new cluster too long" clear enough.
> 
> +1, but I'm inclined to make it read "... is too long".

Yeah, I feel so and it is what I wondered about recently when I saw
some complete error messages.  Is that because of the length of the
subject?

> > # And the messages are missing trailing line breaks.
> 
> I was about to question that, but now I remember that pg_upgrade has
> its own logging facility with a different idea about who provides
> the trailing newline than common/logging.[hc] has.  Undoubtedly
> that's the source of this mistake.  We really need to get pg_upgrade
> out of the business of having its own logging conventions.

Yes... I don't find a written reason excluding pg_upgrade in both the
commit 9a374b77fb and or the thread [1].  But I guess that we decided
that we first provide the facility in the best style ignoring the
current impletent in pg_upgrade then let pg_upgrade use it.  So I
think it should emerge in the next cycle?  I'll give it a shot if no
one is willing to do that for now. (I believe it is straightforward..)

[1] https://www.postgresql.org/message-id/941719.1645587865%40sss.pgh.pa.us

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Collation version tracking for macOS

2022-06-13 Thread Thomas Munro
On Thu, Jun 9, 2022 at 11:33 AM Thomas Munro  wrote:
> On Thu, Jun 9, 2022 at 5:42 AM Tom Lane  wrote:
> > I'm sure that Apple are indeed updating the UTF8 data behind
> > their proprietary i18n APIs, but the libc APIs are mostly getting benign
> > neglect.
>
> As for how exactly they might be doing that, I don't know, but a bit
> of light googling tells me that a private, headerless,
> please-don't-call-me-directly copy of ICU arrived back in macOS
> 10.3[1].  I don't see it on my 12.4 system, but I also know that 12.x
> started hiding system libraries completely (the linker is magic and
> pulls libraries from some parallel dimension, there is no
> /usr/lib/libSystem.B.dylib file on disk, and yet otool -L
>  references it).

The other thread about a macOS linking problem nerd-sniped me back
into here to find out how to see breadcrumbs between hidden libraries
on this super weird UNIX™ and confirm that they are indeed still
shipping a private ICU for use by their Core Foundation stuff that's
used by fancy ObjC/Swift/... etc GUI apps.  The following command was
an interesting discovery for me because otool -L can't see any of the
new kind of ghost libraries:

% dyld_info -dependents
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation

Though I can't get my hands on the hidden ICU library itself to
disassemble (without installing weird extra tools, apparently [1]),
that at least revealed its name, which I could then dlopen out of
curiosity.  It seems they jammed all the ICU sub-libraries into one,
and configured it with --disable-renaming so it doesn't have major
version suffixes on symbol names.

It'd clearly be a terrible idea for us to try to use any of that, and
Mac users should be very happy with the new support for ICU as DB
default.

[1] https://lapcatsoftware.com/articles/bigsur.html




Re: Finer grain log timestamps

2022-06-13 Thread David Fetter
On Mon, Jun 13, 2022 at 04:22:42PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Sun, May 8, 2022 at 4:45 PM David Fetter  wrote:
> >> Please find attached a patch to change the sub-second granularity of
> >> log timestamps from milliseconds to  microseconds.
> 
> > Why is this a good idea?
> 
> I can imagine that some people would have a use for microsecond
> resolution in log files, and I can also imagine that as machines
> get faster more people will want that.

Your imagination matches situations I've seen in production where
there was some ambiguity as to what happened when inside a millisecond
boundary, and I'm sure I'm not alone in this.  I've gotten this
request from at least three people who to my knowledge knew nothing
about each other, and as I recall, the first time someone brought it
up to me was over five years back.

> As against that, this will bloat log files by a non-microscopic
> amount, and it's pretty likely to break some log-scanning tools too.

Three bytes per line, and log-scanning parsers that finicky are
already breaking all the time, respectively.

> It's unclear to me that that's a tradeoff we should force on
> everyone.

The tradeoff we're forcing on people at the moment is a loss of
precision they didn't ask for, implemented by some extra instructions
they didn't ask us to execute in a part of the code that's a hot path
at exactly the times when the machine is busiest.

> I think a proposal less likely to have push-back would be to invent
> a different log_line_prefix %-escape to produce microseconds.
> Sadly, "%u" is already taken, but perhaps we could use "%U"?
> 
> A different line of thought is to extend %t to provide a precision
> field a la sprintf, so that for example "%.3t" is equivalent to
> "%m" and "%.6t" does what David wants, and we won't have to
> search for a new escape letter when the day arrives that
> somebody wants nanosecond resolution.  The same could be done
> with %n, avoiding the need to find a different escape letter
> for that.

I'll build this more sprintf-like thing if not doing so prevents the
change from happening, but frankly, I don't really see a point in it
because the next "log timestamps at some random negative power of 10
second granularity" requirement I see will be the first.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Checking for missing heap/index files

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 4:15 PM Bruce Momjian  wrote:
> I agree --- it would be nice, but might be hard to justify the
> engineering and overhead involved.  I guess I was just checking that I
> wasn't missing something obvious.

I suspect that the cost of being sloppy about this kind of thing
outweighs any benefit -- it's a false economy.

I believe we ought to eventually have crash-safe relation extension
and file allocation. Right now we're held back by concerns about
leaking a large number of empty pages (at least until the next
VACUUM). If leaking space was simply not possible in the first place,
we could afford to be more aggressive in code like
RelationAddExtraBlocks() -- it currently has a conservative cap of 512
pages per extension right now. This would require work in the FSM of
the kind I've been working on, on and off.

Each relation extension is bound to be more expensive when the process
is made crash safe, obviously -- but only if no other factor changes.
With larger batch sizes per relation extension, it could be very
different. Once you factor in lock contention, then having fewer
individual relation extensions for a fixed number of pages may make
all the difference.

-- 
Peter Geoghegan




Re: Checking for missing heap/index files

2022-06-13 Thread Bruce Momjian
On Mon, Jun 13, 2022 at 04:06:12PM -0400, Robert Haas wrote:
> One idea might be for each heap table to have a metapage and store the
> length - or an upper bound on the length - in the metapage. That'd
> probably be cheaper than updating pg_class, but might still be
> expensive in some scenarios, and it's a fairly large amount of
> engineering.

I agree --- it would be nice, but might be hard to justify the
engineering and overhead involved.  I guess I was just checking that I
wasn't missing something obvious.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: better page-level checksums

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 3:06 PM Bruce Momjian  wrote:
> That is encryption done in a virtual file system independent of
> Postgres.  So, I guess the answer to your question is that this is not
> how EDB Advanced Server does it.

Okay, thanks for clearing that up. The term "block based" does appear
in the article I linked to, so you can see why I didn't understand it
that way initially.

Anyway, I can see how it would be useful to be able to know the offset
of a nonce or of a hash digest on any given page, without access to a
running server. But why shouldn't that be possible with other designs,
including designs closer to what I've outlined?

A known fixed offset in the special area already assumes that all
pages must have a value in the first place, even though that won't be
true for the majority of individual Postgres servers. There is
implicit information involved in a design like the one Robert has
proposed; your backup tool (or whatever) already has to understand to
expect something other than no encryption at all, or no checksum at
all. Tools like pg_filedump already rely on implicit information about
the special area.

I'm not against the idea of picking a handful of checksum/encryption
schemes, with the understanding that we'll be committing to those
particular schemes indefinitely -- it's not reasonable to expect
infinite flexibility here (and so I don't). But why should we accept
something that seems to me to be totally inflexible, and doesn't
compose with other things?

-- 
Peter Geoghegan




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-06-13 Thread David Zhang




A little confused here, does this patch V3 intend to solve this problem "record 
length 2145386550 at 0/360 too long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.


I set up a simple Primary and Standby stream replication environment, and use 
the above query to run the test for before and after patch v3. The error 
message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG:  record length 2145386550 at 0/360 
too long
2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver process due 
to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 0/360 
too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG:  record length 2145386550 at 0/360 
too long
2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from WAL 
stream: ERROR:  requested WAL segment 00010045 has already been 
removed
2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 0/360 
too long

And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.


Thanks a lot for the clarification. My testing environment is pretty 
simple, initdb for Primary, run basebackup and set the connection string 
for Standby, then run the "pg_logical_emit_message" query and tail the 
log on standby side.


Best regards,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca




Re: better page-level checksums

2022-06-13 Thread Bruce Momjian
On Mon, Jun 13, 2022 at 03:03:17PM -0700, Peter Geoghegan wrote:
> On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian  wrote:
> > On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> > > Is that the how block-level encryption feature from EDB Advanced Server 
> > > does it?
> >
> > Uh, EDB Advanced Server doesn't have a block-level encryption feature.
> 
> Apparently there is something called "Vormetric Transparent Encryption
> (VTE) – Transparent block-level encryption with access controls":
> 
> https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform
> 
> Perhaps there is some kind of confusion around the terminology here?

That is encryption done in a virtual file system independent of
Postgres.  So, I guess the answer to your question is that this is not
how EDB Advanced Server does it.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: better page-level checksums

2022-06-13 Thread Peter Geoghegan
On Mon, Jun 13, 2022 at 2:54 PM Bruce Momjian  wrote:
> On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> > Is that the how block-level encryption feature from EDB Advanced Server 
> > does it?
>
> Uh, EDB Advanced Server doesn't have a block-level encryption feature.

Apparently there is something called "Vormetric Transparent Encryption
(VTE) – Transparent block-level encryption with access controls":

https://www.enterprisedb.com/blog/enhanced-security-edb-postgres-advanced-server-vormetric-data-security-platform

Perhaps there is some kind of confusion around the terminology here?

-- 
Peter Geoghegan




Re: better page-level checksums

2022-06-13 Thread Bruce Momjian
On Mon, Jun 13, 2022 at 02:44:41PM -0700, Peter Geoghegan wrote:
> On Fri, Jun 10, 2022 at 6:16 AM Robert Haas  wrote:
> > > My preference is for an approach that builds on that, or at least
> > > doesn't significantly complicate it. So a cryptographic hash or nonce
> > > can go in the special area proper (structs like BTPageOpaqueData don't
> > > need any changes), but at a page offset before the special area proper
> > > -- not after.
> > >
> > > What disadvantages does that approach have, if any, from your point of 
> > > view?
> >
> > I think it would be an extremely good idea to store the extended
> > checksum at the same offset in every page. Right now, code that wants
> > to compute checksums, or a tool like pg_checksums that wants to verify
> > them, can find the checksum without needing to interpret any of the
> > remaining page contents. Things get sticky if you have to interpret
> > the page contents to locate the checksum that's going to tell you
> > whether the page contents are messed up. Perhaps this could be worked
> > around if you tried hard enough, but I don't see what we get out of
> > it.
> 
> Is that the how block-level encryption feature from EDB Advanced Server does 
> it?

Uh, EDB Advanced Server doesn't have a block-level encryption feature.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: better page-level checksums

2022-06-13 Thread Peter Geoghegan
On Fri, Jun 10, 2022 at 6:16 AM Robert Haas  wrote:
> > My preference is for an approach that builds on that, or at least
> > doesn't significantly complicate it. So a cryptographic hash or nonce
> > can go in the special area proper (structs like BTPageOpaqueData don't
> > need any changes), but at a page offset before the special area proper
> > -- not after.
> >
> > What disadvantages does that approach have, if any, from your point of view?
>
> I think it would be an extremely good idea to store the extended
> checksum at the same offset in every page. Right now, code that wants
> to compute checksums, or a tool like pg_checksums that wants to verify
> them, can find the checksum without needing to interpret any of the
> remaining page contents. Things get sticky if you have to interpret
> the page contents to locate the checksum that's going to tell you
> whether the page contents are messed up. Perhaps this could be worked
> around if you tried hard enough, but I don't see what we get out of
> it.

Is that the how block-level encryption feature from EDB Advanced Server does it?

-- 
Peter Geoghegan




Re: pltcl crash on recent macOS

2022-06-13 Thread Thomas Munro
On Tue, Jun 14, 2022 at 8:21 AM Peter Eisentraut
 wrote:
> The difference is that I use CC=gcc-11.  I have change to CC=cc, then it
> works (nm output shows "from executable").  So it's gcc that gets thrown
> off by the -lc.

Hrmph, I changed my CC to "ccache gcc-mp-11" (what MacPorts calls GCC
11), and I still can't reproduce the problem.  I still get "(from
executable)".  In your original quote you showed "gcc", not "gcc-11",
which (assuming it is found as /usr/bin/gcc) is just a little binary
that redirects to clang... trying that, this time without ccache in
the mix... and still no cigar.  So something is different about GCC 11
from homebrew, or the linker invocation it produces under the covers,
or the linker it's using?




Re: better page-level checksums

2022-06-13 Thread Matthias van de Meent
On Fri, 10 Jun 2022 at 15:58, Robert Haas  wrote:
>
> On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent
>  wrote:
> > Why so? We already dole out per-page space in 4-byte increments
> > through pd_linp, and I see no reason why we can't reserve some line
> > pointers for per-page metadata if we decide that we need extra
> > per-page ~overhead~ metadata.
>
> Hmm, that's an interesting approach. I was thinking that putting data
> after the PageHeaderData struct would be a non-starter because the
> code that looks up a line pointer by index is currently just
> multiply-and-add and complicating it seems bad for performance.
> However, if we treated the space there as overlapping the line pointer
> array and making some line pointers unusable rather than something
> inserted prior to the line pointer array, we could avoid that. I still
> think it would be kind of complicated, though, because we'd have to
> find every bit of code that loops over the line pointer array or
> accesses it by index and make sure that it doesn't try to access the
> low-numbered line pointers.
>
> > Isn't the goal of a checksum to find - and where possible, correct -
> > bit flips and other broken pages? I would suggest not to use
> > cryptographic hash functions for that, as those are rarely
> > error-correcting.
>
> I wasn't thinking of trying to do error correction, just error
> detection. See also my earlier reply to Peter Geoghegan.

The use of CRC in our current page format implies that we can correct
(some) bit errors, which is why I presumed that that was a goal of
page checksums. I stand corrected.

> > Isn't that expected for most of those places? With the current
> > bufpage.h description of Page, it seems obvious that all bytes on a
> > page except those in the "hole" and those in the page header are under
> > full control of the AM. Of course AMs will pre-calculate limits and
> > offsets during compilation, that saves recalculation cycles and/or
> > cache lines with constants to keep in L1.
>
> Yep.
>
> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata in a double-buffered
> > format, so that both before the actual page is written the metadata
> > too is written to disk, while the old metadata is available too for
> > recovery purposes. This allows us to maintain the current format with
> > its low per-page overhead, and only have extra overhead (up to 2x
> > writes for each page, but the writes for these metadata pages need not
> > be BLCKSZ in size) for those that opt-in to the more computationally
> > expensive features of larger checksums, nonces, and/or other non-AM
> > per-page ~overhead~ metadata.
>
> It's not impossible, I'm sure, but it doesn't seem very appealing to
> me. Those extra reads and writes could be expensive, and there's no
> place to cleanly integrate them into the code structure. A function
> like PageIsVerified() -- which is where we currently validate
> checksums -- only gets the page. It can't go off and read some other
> page from disk to perform the checksum calculation.

It could be part of the buffer IO code to provide
PageIsVerifiedExtended with a pointer to the block metadata buffer.

> I'm not exactly sure what you have in mind when you say that the
> writes need not be BLCKSZ in size.

What I meant was that when the extra metadata is stored seperately
from the block itself, it could be written directly to the file offset
instead of having to track BLCKSZ data for N blocks, so the
metadata-write would be << BLCKSZ in length, while the block itself
would still be the normal BLCKSZ write.

> Technically I guess that's true,
> but then the checksums have to be crash safe, or they're not much
> good. If they're not part of the page, how do they get updated in a
> way that makes them crash safe? I guess it could be done: every time
> we write a FPW, enlarge the page image by the number of bytes that are
> stored in this location. When replaying an FPW, update those bytes
> too. And every time we read or write a page, also read or write those
> bytes. In essence, we'd be deciding that pages are 8192+n bytes, but
> the last n bytes are stored in a different file - and, in memory, a
> different buffer pool. I think that would be hugely invasive and
> unpleasant to make work and I think the performance would be poor,
> too.

I agree that this wouldn't be as performant from a R/W perspective as
keeping that metadata inside the block. But on the other hand, that is
only for block R/W operations, and not for in-memory block
manipulations.

> > I'd prefer if we didn't change the way pages are presented to AMs.
> > Currently, it is clear what area is available to you if you write an
> > AM that uses the bufpage APIs. Changing the page format to have the
> > buffer manager also touch / reserve space in the special areas seems
> > like a break of abstraction: Quoting from bufpage.h:
> >
> >  * AM-generic per-page information is kept in 

Re: pltcl crash on recent macOS

2022-06-13 Thread Peter Eisentraut

On 13.06.22 18:01, Tom Lane wrote:

Having said that, I wonder whether the position of the -bundle_loader
switch in the command line is relevant to which way the hash_search
reference is resolved.  Seems like we could put it in front of the
various -l options if that'd help.


Switching the order of -bundle_loader and -lc did not help.




Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-13 Thread Greg Stark
On Thu, 9 Jun 2022 at 16:12, Peter Geoghegan  wrote:
>
> Presumably there is still significant value in detecting cases where
> some user-defined code provably does the wrong thing. Especially by
> targeting mistakes that experience has shown are relatively common.
> That's what the search_path case seems like to me.

By "relatively common" I think we're talking "nigh universal". Afaics
there are no warnings in the docs about worrying about search_path on
IMMUTABLE functions. There is for SECURITY DEFINER but I have to admit
I wasn't aware myself of all the gotchas described there.

For that matter the gotchas are a bit  convoluted

If you leave out pg_catalog from search_path that's fine but if you
leave out pg_temp that's a security disaster. If you put pg_catalog in
it better be first or else it might be ok or might be a security issue
but when you put pg_temp in it better be last or else it's
*definitely* a disaster. $user is in search_path by default and that's
fine for SECURITY DEFINER functions but it's a disaster for IMMUTABLE
functions...

I kind of feel like perhaps all the implicit stuff is unnecessary
baroque frills. We should just put pg_temp and pg_catalog into the
default postgresql.conf search_path and assume users will keep them
there. And I'm not sure why we put pg_temp *first* -- I mean it sort
of seems superficially sensible but it doesn't seem like there's any
real reason to name your temporary tables the same as your permanent
ones so why not just always add it last?


I've attached a very WIP patch that implements the checks I'm leaning
towards making (as elogs currently). They cause a ton of regression
failures so probably we need to think about how to reduce the pain for
users upgrading...

Perhaps we should automatically fix up the current search patch and
attach it to functions where necessary for users instead of just
whingeing at them
From 5cfccb714e7d9fd8f623700701e960abd54af25c Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Mon, 13 Jun 2022 16:47:49 -0400
Subject: [PATCH] Add checks on search_path for IMMUTABLE and SECURITY DEFINER
 functions

---
 src/backend/commands/functioncmds.c | 76 +
 src/backend/utils/misc/guc.c| 59 +-
 src/include/utils/guc.h |  1 +
 3 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index 00a6d282cf..13d5c668b7 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -675,6 +675,78 @@ update_proconfig_value(ArrayType *a, List *set_items)
 	return a;
 }
 
+/* We only enforce constraints on IMMUTABLE functions (because they can be
+ * used in indexes and search_path hacking can corrupt indexes for everyone)
+ * or SECURITY DEFINER functions (for obvious reasons). For these functions we
+ * enforce that the proconfig GUC settings should include search_path and that
+ * that search path should follow the recommendations listed at
+ * https://www.postgresql.org/docs/current/sql-createfunction.html#SQL-CREATEFUNCTION-SECURITY
+ */ 
+static void
+verify_proconfig_value(ArrayType *a, char provolatile, bool prosecdef) {
+	const char *search_path = NULL;
+	
+	if (provolatile != PROVOLATILE_IMMUTABLE && !prosecdef) {
+		return;
+	}
+
+	if (a != NULL) {
+		search_path = GUCArrayLookup(a, "search_path");
+	}
+
+	if (!search_path)
+	{
+		elog(WARNING, "IMMUTABLE and SECURITY DEFINER functions must have search_path set");
+	}
+
+	/* XXX This logic should perhaps be moved to namespace.c XXX */
+
+	char *rawname;
+	List	   *namelist;
+	ListCell   *l;
+	/* Need a modifiable copy of namespace_search_path string */
+	rawname = pstrdup(namespace_search_path);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawname, ',', ))
+	{
+		/* syntax error in name list */
+		/* this should not happen if GUC checked check_search_path */
+		elog(ERROR, "invalid list syntax in proconfig");
+	}
+
+	bool has_dollar_user = false;
+	bool pg_catalog_first = true;
+	bool pg_temp_last = false;
+	bool is_first = true;
+	foreach(l, namelist)
+	{
+		char	   *curname = (char *) lfirst(l);
+		Oid			namespaceId;
+
+		/* It's not last unless it's last */
+		pg_temp_last = false;
+
+		if (strcmp(curname, "$user") == 0)
+			has_dollar_user = true;
+		if (strcmp(curname, "pg_catalog") == 0 && !is_first)
+			pg_catalog_first = false;
+		if (strcmp(curname,"pg_temp") == 0)
+			pg_temp_last = true;
+		
+		is_first = false;
+	}
+
+	if (has_dollar_user)
+		elog(WARNING, "IMMUTABLE or SECURITY DEFINER functions should not have a search path including $user");
+
+	if (!pg_catalog_first)
+		elog(WARNING, "IMMUTABLE or SECURITY DEFINER functions should have pg_catalog first in search_path (or omit it implicitly placing it first)");
+
+	if (!pg_temp_last)
+		elog(WARNING, "IMMUTABLE or SECURITY DEFINER functions should explicitly place pg_temp last or else it's implicitly 

Re: pltcl crash on recent macOS

2022-06-13 Thread Tom Lane
Peter Eisentraut  writes:
> The difference is that I use CC=gcc-11.  I have change to CC=cc, then it 
> works (nm output shows "from executable").  So it's gcc that gets thrown 
> off by the -lc.

Hah, that makes sense.  So does changing the option order help?

regards, tom lane




Re: Finer grain log timestamps

2022-06-13 Thread Tom Lane
Robert Haas  writes:
> On Sun, May 8, 2022 at 4:45 PM David Fetter  wrote:
>> Please find attached a patch to change the sub-second granularity of
>> log timestamps from milliseconds to  microseconds.

> Why is this a good idea?

I can imagine that some people would have a use for microsecond
resolution in log files, and I can also imagine that as machines
get faster more people will want that.  As against that, this
will bloat log files by a non-microscopic amount, and it's pretty
likely to break some log-scanning tools too.  It's unclear to me
that that's a tradeoff we should force on everyone.

I think a proposal less likely to have push-back would be to invent
a different log_line_prefix %-escape to produce microseconds.
Sadly, "%u" is already taken, but perhaps we could use "%U"?

A different line of thought is to extend %t to provide a precision
field a la sprintf, so that for example "%.3t" is equivalent to
"%m" and "%.6t" does what David wants, and we won't have to
search for a new escape letter when the day arrives that
somebody wants nanosecond resolution.  The same could be done
with %n, avoiding the need to find a different escape letter
for that.

regards, tom lane




Re: pltcl crash on recent macOS

2022-06-13 Thread Peter Eisentraut

On 13.06.22 13:27, Thomas Munro wrote:

On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut
 wrote:

  frame #1: 0x7ff803a28751 libsystem_c.dylib`hash_search + 215
  frame #2: 0x000110357700
pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0,


Hmm, I can’t reproduce that….  although that symbol is present in my
libSystem.B.dylib according to dlsym() and callable from a simple
program not linked to anything else, pltcl.so is apparently reaching
postgres’s hash_search for me, based on the fact that make -C
src/pl/tcl check succeeds and nm -m on pltcl.so shows it as "from
executable".  It would be interesting to see what nm -m shows for you.


...
 (undefined) external _get_call_result_type (from executable)
 (undefined) external _getmissingattr (from executable)
 (undefined) external _hash_create (from libSystem)
 (undefined) external _hash_search (from libSystem)
...


I’m using tcl 8.6.12 installed by MacPorts on macOS 12.4, though, hmm,
SDK 12.3.  I see the explicit -lc when building pltcl.so, and I see
that libSystem.B.dylib is explicitly mentioned here, whether or not I
have -lc:

% otool -L ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so
./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so:
/opt/local/lib/libtcl8.6.dylib (compatibility version 8.6.0, current
version 8.6.12)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1311.100.3)


Looks the same here:

pltcl.so:
	/usr/local/opt/tcl-tk/lib/libtcl8.6.dylib (compatibility version 8.6.0, 
current version 8.6.12)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
version 1311.100.3)



Here’s the complete link line:

ccache cc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new -Wendif-labels
-Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument
-Wno-compound-token-split-by-macro -g -O0  -bundle -multiply_defined
suppress -o pltcl.so  pltcl.o -L../../../src/port
-L../../../src/common  -isysroot
/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk
-Wl,-dead_strip_dylibs   -L/opt/local/lib -ltcl8.6 -lz -lpthread
-framework CoreFoundation  -lc -bundle_loader
../../../src/backend/postgres


The difference is that I use CC=gcc-11.  I have change to CC=cc, then it 
works (nm output shows "from executable").  So it's gcc that gets thrown 
off by the -lc.





Re: Checking for missing heap/index files

2022-06-13 Thread Robert Haas
On Wed, Jun 8, 2022 at 8:46 AM Bruce Momjian  wrote:
> We currently can check for missing heap/index files by comparing
> pg_class with the database directory files.  However, I am not clear if
> this is safe during concurrent DDL.  I assume we create the file before
> the update to pg_class is visible, but do we always delete the file
> after the update to pg_class is visible?  I assume any external checking
> tool would need to lock the relation to prevent concurrent DDL.

If you see an entry in pg_class, then there should definitely be a
file present on disk. The reverse is not true: just because you don't
see an entry in pg_class for a file that's on disk doesn't mean it's
safe to remove that file.

> Also, how would it check if the number of extents is correct?  Seems we
> would need this value to be in pg_class, and have the same update
> protections outlined above.  Seems that would require heavier locking.

Yeah, and it's not just the number of extents but the length of the
last one. If the last extent is supposed to be 700MB and it gets
truncated to 200MB, it would be nice if we could notice that.

One idea might be for each heap table to have a metapage and store the
length - or an upper bound on the length - in the metapage. That'd
probably be cheaper than updating pg_class, but might still be
expensive in some scenarios, and it's a fairly large amount of
engineering.

> Is this something anyone has even needed or had requested?

Definitely. And also the reverse: figuring out which files on disk are
old garbage that can be safely nuked.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Finer grain log timestamps

2022-06-13 Thread Robert Haas
On Sun, May 8, 2022 at 4:45 PM David Fetter  wrote:
> Please find attached a patch to change the sub-second granularity of
> log timestamps from milliseconds to  microseconds.

Why is this a good idea?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: replacing role-level NOINHERIT with a grant-level option

2022-06-13 Thread Robert Haas
On Mon, Jun 13, 2022 at 2:42 PM David G. Johnston
 wrote:
> Agreed.  Moving the inherit flag to the many-to-many join relation provides 
> flexibility, while representing the present behavior is trivial - every row 
> for a given member role has the same value for said flag.

Precisely.

> One seemingly missing feature is to specify for a role that its privileges 
> cannot be inherited.  In this case associations where it is the group role 
> mustn't be flagged inherit.  Symmetrically, "inherit only" seems like a 
> plausible option for pre-defined group roles.

Yeah, I was kind of wondering about that, although I hadn't thought so
much of making it mandatory as having some kind of way of setting the
default. One could do either, but I think that can be left for a
future patch that builds on what I am proposing here. No sense trying
to do too many things all at once.

> I agree that granting membership makes the pg_auth_members record appear and 
> revoking membership makes it disappear.

Great.

> I dislike having GRANT do stuff when membership is already established.
>
> ALTER MEMBER role IN group ALTER [SET | ASSUME] [TO | =] [TRUE | FALSE]

I thought about this, too. We could definitely do something like that.
I imagine it would be called ALTER GRANT rather than ALTER MEMBER, but
other than that I don't see an issue. However, there's existing
precedent for the way I proposed it: if you repeat the same GRANT
command but write WITH ADMIN OPTION only the second time, it modifies
the existing grant and adds the admin option to it. If you repeat it
verbatim the second time, it instead gives you a warning that your
command was redundant. That to me establishes the precedent that the
way you modify the options associated with a GRANT is to issue a new
GRANT command. I don't find changing that existing behavior very
appealing, even if we might not pick the same thing if we were doing
it over. We could add something else alongside that to provide another
way of accomplishing the same thing, but that seemed more complicated
for not much benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: replacing role-level NOINHERIT with a grant-level option

2022-06-13 Thread David G. Johnston
On Mon, Jun 13, 2022 at 11:01 AM Robert Haas  wrote:

> Some
>
syntax would be a bit different on the new releases and that would
> unlock some new options we don't currently have, but there's no
> behavior that you can get today which you wouldn't be able to get any
> more under this proposal.
>

Agreed.  Moving the inherit flag to the many-to-many join relation provides
flexibility, while representing the present behavior is trivial - every row
for a given member role has the same value for said flag.

One seemingly missing feature is to specify for a role that its privileges
cannot be inherited.  In this case associations where it is the group role
mustn't be flagged inherit.  Symmetrically, "inherit only" seems like a
plausible option for pre-defined group roles.

I agree that granting membership makes the pg_auth_members record appear
and revoking membership makes it disappear.

I dislike having GRANT do stuff when membership is already established.

ALTER MEMBER role IN group ALTER [SET | ASSUME] [TO | =] [TRUE | FALSE]

David J.


Re: better page-level checksums

2022-06-13 Thread Robert Haas
On Mon, Jun 13, 2022 at 12:59 PM Aleksander Alekseev
 wrote:
> So, to clarify, what we are trying to achieve here is to reduce the
> probability of an event when a page gets corrupted but the checksum is
> accidentally the same as it was before the corruption, correct? And we
> also assume that neither file system nor hardware catched this
> corruption.

Yeah, I think so, although it also depends on what the filesystem and
the hardware would do if they did catch the corruption. If they would
have made our read() or write() operation fail, then any checksum
feature at the PostgreSQL level is superfluous. If they would have
noticed the operation but not caused a failure, and say just logged
something in the system log, a PostgreSQL check could still be useful,
because the PostgreSQL user might not be looking at the system log,
but will definitely notice if they get an ERROR rather than a query
result from PostgreSQL. And if the lower-level systems wouldn't have
caught the failure at all, then checksums are useful in that case,
too.

> If that's the case I would say that using something like SHA256 would
> be an overkill, not only because of the consumed disk space but also
> because SHA256 is expensive. Allowing the user to choose from 16-bit,
> 32-bit and maybe 64-bit checksums should be enough. I would also
> suggest that no matter how we do it, if the user chooses 16-bit
> checksums the performance and the disk consumption should remain as
> they currently are.

If the user wants 16-bit checksums, the feature we've already got
seems good enough -- and, as you say, it doesn't use any extra disk
space. This proposal is just about making people happy if they want a
bigger checksum.

On the topic of which algorithm to use, I'd be inclined to think that
it is going to be more useful to offer checksums that are 64 bits or
more, since IMHO 32 is not all that much more than 16, and I still
think there are going to be alignment issues. Beyond that I don't have
anything against your specific suggestions, but I'd like to hear what
other people think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: replacing role-level NOINHERIT with a grant-level option

2022-06-13 Thread Robert Haas
On Fri, Jun 10, 2022 at 4:36 PM Peter Eisentraut
 wrote:
> I think this would create a conflict with what role-based access control
> normally means (outside of PostgreSQL specifically).  A role is a
> collection of privileges that you dynamically enable (e.g., with SET
> ROLE).  That corresponds to the NOINHERIT behavior.  It's also what the
> SQL standard specifies (which in term references other standards for
> RBAC).  The INHERIT behavior basically emulates the groups that used to
> exist in PostgreSQL.

I don't think this creates any more of a conflict than we've already
got. In fact, I'd go so far as to say it resolves a problem that we
currently have. As far as I can see, we are stuck with a situation
where we have to support both the INHERIT behavior and the NOINHERIT
behavior. Removing either one would be a pretty major compatibility
break. And even if some people were willing to endorse that, it seems
clear from previous discussions that there are people who like the
NOINHERIT behavior and would object to its removal, and other people
(like me!) who like the INHERIT behavior and would object to removing
that. If you think it's feasible to get rid of either of these
behaviors, I'd be interested in hearing your thoughts on that, but to
me it looks like we are stuck with supporting both. From my point of
view, the question is how to make the best of that situation.

Consider a user who in general prefers the NOINHERIT behavior but also
wants to use predefined roles. Perhaps user 'peter' is to be granted
both 'paul' and 'pg_execute_server_programs'. If role 'peter' is set
to INHERIT, Peter will be sad, because his love for NOINHERIT probably
means that he doesn't want to exercise Paul's privileges
automatically. However, he needs to inherit the privileges of
'pg_execute_server_programs' or they are of no use to him. Peter
presumably wants to use COPY TO/FROM program to put data into a table
owned by 'peter', not a table owned by 'pg_execute_server_programs'.
If so, being able to SET ROLE to 'pg_execute_server_programs' is of no
use to him at all, but inheriting the privilege is useful. What is
really needed here is to have the grant of 'paul' be NOINHERIT and the
grant of 'pg_execute_server_programs' be INHERIT, but there's no way
to do that presently. My proposal fixes that problem.

If you don't agree with that analysis, I'd like to know which part of
it seems wrong to you. To me, it seems like an open and shut case: a
grant of a predefined role, or of a group, should have the INHERIT
behavior. A grant of a login role could plausibly have either one,
depending on user preferences. As long as the INHERIT property is a
property of the role to which the permission is granted, you have
trouble as soon as you try to do one grant that should have INHERIT
behavior and another that should have NOINHERIT behavior targeting the
same role.

> There are also possibly other properties you could derive from that
> distinction.  For example, you could  argue that something like
> pg_hba.conf should have group matching but not (noinherit-)role
> matching, since those roles are not actually activated all the time.
>
> There are also more advanced concepts like roles that are mutually
> exclusive so that you can't activate them both at once.

I don't see how my proposal makes any of that any harder than it is
today. Or any easier, either. Seems pretty much orthogonal.

> Now, what PostgreSQL currently implements is a bit of a mess that's a
> somewhat incompatible subset of all that.  But you can basically get
> those standard behaviors somehow.  So I don't think we should break this
> and go into a totally opposite direction.

I don't think I'm proposing to break anything or go in a totally
opposite direction from anything, and to be honest I'm kind of
confused as to why you think that what I'm proposing would have that
effect. As far as I can see, the changes that I'm proposing are
upward-compatible and would permit easy migration to new releases via
either pg_dump or pg_upgrade with no behavior changes at all. Some
syntax would be a bit different on the new releases and that would
unlock some new options we don't currently have, but there's no
behavior that you can get today which you wouldn't be able to get any
more under this proposal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: 2022-06-16 release announcement draft

2022-06-13 Thread Tom Lane
"Jonathan S. Katz"  writes:
> Please review for technical accuracy and omissions.

A few minor thoughts:

> The PostgreSQL Global Development Group has released PostgreSQL 14.4 to fix an
> issue that could cause silent data corruption when using the
> [`CREATE INDEX 
> CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
> and [`REINDEX 
> CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
> commands.

Maybe s/and/or/ ?

> PostgreSQL 14.4 fixes an issue with the
> [`CREATE INDEX 
> CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
> and [`REINDEX 
> CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
> that could cause silent data corruption of indexes.

Either leave out "the" or add "commands".  That is, "the FOO and BAR
commands" reads fine, "the FOO and BAR" less so.  Also, I'm inclined
to be a bit more specific and say that the problem is missing index
entries, so maybe like "... fixes an issue that could cause the [CIC]
and [RIC] commands to omit index entries for some rows".

> Once you upgrade your system to PostgreSQL 14.4, you can fix any silent data
> corruption using `REINDEX CONCURRENTLY`.

Perhaps it is also worth mentioning that you can use REINDEX without
CONCURRENTLY, even before upgrading.

> * Report implicitly-created operator families (`CREATE OPERATOR CLASS`) to 
> event
> triggers.

Maybe "(generated by `CREATE OPERATOR CLASS`)"?  As-is, the parenthetical
comment looks more like a mistake than anything else.

The rest looks good.

regards, tom lane




Re: "buffer too small" or "path too long"?

2022-06-13 Thread Tom Lane
Kyotaro Horiguchi  writes:
> The root cause of the errors is that the user-provided directory path
> of new cluster's root was too long.  Anywhich one of the four buffers
> is overflowed, it doesn't makes any difference for users and doesn't
> offer any further detail to suppoerters/developers.  I see "output
> directory path of new cluster too long" clear enough.

+1, but I'm inclined to make it read "... is too long".

> # And the messages are missing trailing line breaks.

I was about to question that, but now I remember that pg_upgrade has
its own logging facility with a different idea about who provides
the trailing newline than common/logging.[hc] has.  Undoubtedly
that's the source of this mistake.  We really need to get pg_upgrade
out of the business of having its own logging conventions.

regards, tom lane




Re: better page-level checksums

2022-06-13 Thread Aleksander Alekseev
Hi Robert,

> I don't think that a separate fork is a good option for reasons that I
> articulated previously: I think it will be significantly more complex
> to implement and add extra I/O.
>
> I am not completely opposed to the idea of making the algorithm
> pluggable but I'm not very excited about it either. Making the
> algorithm pluggable probably wouldn't be super-hard, but allowing a
> checksum of arbitrary size rather than one of a short list of fixed
> sizes might complicate efforts to ensure this doesn't degrade
> performance. And I'm not sure what the benefit is, either. This isn't
> like archive modules or custom backup targets where the feature
> proposes to interact with things outside the server and we don't know
> what's happening on the other side and so need to offer an interface
> that can accommodate what the user wants to do. Nor is it like a
> custom background worker or a custom data type which lives fully
> inside the database but the desired behavior could be anything. It's
> not even like column compression where I think that the same small set
> of strategies are probably fine for everybody but some people think
> that customizing the behavior by datatype would be a good idea. All
> it's doing is taking a fixed size block of data and checksumming it. I
> don't see that as being something where there's a lot of interesting
> things to experiment with from an extension point of view.

I see your point. Makes sense.

So, to clarify, what we are trying to achieve here is to reduce the
probability of an event when a page gets corrupted but the checksum is
accidentally the same as it was before the corruption, correct? And we
also assume that neither file system nor hardware catched this
corruption.

If that's the case I would say that using something like SHA256 would
be an overkill, not only because of the consumed disk space but also
because SHA256 is expensive. Allowing the user to choose from 16-bit,
32-bit and maybe 64-bit checksums should be enough. I would also
suggest that no matter how we do it, if the user chooses 16-bit
checksums the performance and the disk consumption should remain as
they currently are.

Regarding the particular choice of a hash function I would suggest the
MurmurHash family [1]. This is basically the industry standard (it's
good, it's fast, and relatively simple), and we already have
murmurhash32() in the core. We also have hash_bytes_extended() to get
64-bit checksums, but I have no strong opinion on whether this
particular hash function should be used for pages or not. I believe
some benchmarking is appropriate.

There is also a 128-bit version of MurmurHash. Personally I doubt that
it may be of value in practice, but it will not hurt to support it
either, while we are at it. (Probably not in the MVP, though). And if
we are going to choose this path, I see no reason not to support
SHA256 as well, for the paranoid users.

[1]: https://en.wikipedia.org/wiki/MurmurHash

-- 
Best regards,
Aleksander Alekseev




Re: pgcon unconference / impact of block size on performance

2022-06-13 Thread Tomas Vondra
On 6/13/22 17:42, Merlin Moncure wrote:
> On Sat, Jun 4, 2022 at 6:23 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> Hi,
> 
> At on of the pgcon unconference sessions a couple days ago, I presented
> a bunch of benchmark results comparing performance with different
> data/WAL block size. Most of the OLTP results showed significant gains
> (up to 50%) with smaller (4k) data pages.
> 
> 
> Wow.  Random numbers are fantastic,  Significant reduction in sequential
> throughput is a little painful though, I see 40% reduction in some cases
> if I'm reading that right.  Any thoughts on why that's the case?  Are
> there mitigations possible?
> 

I think you read that right - given a fixed I/O depth, the throughput
for sequential access gets reduced. Consider for example the attached
chart with sequential read/write results for the Optane 900P. The IOPS
increases for smaller blocks, but not enough to compensate for the
bandwidth drop.

Regarding the mitigations - I think prefetching (read-ahead) should do
the trick. Just going to iodepth=2 mostly makes up for the bandwidth
difference. You might argue prefetching would improve the random I/O
results too, but I don't think that's the same thing - read-ahead for
sequential workloads is much easier to implement (even transparently).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pltcl crash on recent macOS

2022-06-13 Thread Tom Lane
Thomas Munro  writes:
> On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut
>  wrote:
>> frame #1: 0x7ff803a28751 libsystem_c.dylib`hash_search + 215
>> frame #2: 0x000110357700
>> pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0,

> Hmm, I can’t reproduce that….

I can't either, although I'm using the macOS-provided Tcl code,
which still works fine for me.  (I grant that Apple might desupport
that someday, but they haven't yet.)  sifaka and longfin aren't
unhappy either; although sifaka is close to identical to my laptop.

Having said that, I wonder whether the position of the -bundle_loader
switch in the command line is relevant to which way the hash_search
reference is resolved.  Seems like we could put it in front of the
various -l options if that'd help.

regards, tom lane




Re: pgcon unconference / impact of block size on performance

2022-06-13 Thread Merlin Moncure
On Sat, Jun 4, 2022 at 6:23 PM Tomas Vondra 
wrote:

> Hi,
>
> At on of the pgcon unconference sessions a couple days ago, I presented
> a bunch of benchmark results comparing performance with different
> data/WAL block size. Most of the OLTP results showed significant gains
> (up to 50%) with smaller (4k) data pages.
>

Wow.  Random numbers are fantastic,  Significant reduction in sequential
throughput is a little painful though, I see 40% reduction in some cases if
I'm reading that right.  Any thoughts on why that's the case?  Are there
mitigations possible?

merlin


Re: better page-level checksums

2022-06-13 Thread Robert Haas
On Mon, Jun 13, 2022 at 9:23 AM Aleksander Alekseev
 wrote:
> Should it necessarily be a fixed list? Why not support plugable algorithms?
>
> An extension implementing a checksum algorithm is going to need:
>
> - several hooks: check_page_after_reading, calc_checksum_before_writing
> - register_checksum()/deregister_checksum()
> - an API to save the checksums to a seperate fork
>
> By knowing the block number and the hash size the extension knows
> exactly where to look for the checksum in the fork.

I don't think that a separate fork is a good option for reasons that I
articulated previously: I think it will be significantly more complex
to implement and add extra I/O.

I am not completely opposed to the idea of making the algorithm
pluggable but I'm not very excited about it either. Making the
algorithm pluggable probably wouldn't be super-hard, but allowing a
checksum of arbitrary size rather than one of a short list of fixed
sizes might complicate efforts to ensure this doesn't degrade
performance. And I'm not sure what the benefit is, either. This isn't
like archive modules or custom backup targets where the feature
proposes to interact with things outside the server and we don't know
what's happening on the other side and so need to offer an interface
that can accommodate what the user wants to do. Nor is it like a
custom background worker or a custom data type which lives fully
inside the database but the desired behavior could be anything. It's
not even like column compression where I think that the same small set
of strategies are probably fine for everybody but some people think
that customizing the behavior by datatype would be a good idea. All
it's doing is taking a fixed size block of data and checksumming it. I
don't see that as being something where there's a lot of interesting
things to experiment with from an extension point of view.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-06-13 Thread Tom Lane
=?UTF-8?Q?=C3=81lvaro_Herrera?=  writes:
> Sadly, it looks like I won't be able to get this patched pushed for 14.4.

I think that's a good thing actually; this isn't urgent enough to
risk a last-minute commit.  Please wait till the release freeze
lifts.

regards, tom lane




Re: Add connection active, idle time to pg_stat_activity

2022-06-13 Thread Sergey Dudoladov
Hello,

I've updated the patch in preparation for the upcoming commitfest.

Regards,
Sergey.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..cf00685c96 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -979,6 +979,26 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
additional types.
   
  
+
+ 
+  
+   active_time double precision
+  
+  
+   Time in milliseconds this backend spent in active and
+   fastpath states.
+  
+ 
+
+ 
+  
+   idle_in_transaction_time double precision
+  
+  
+   Time in milliseconds this backend spent in idle in transaction
+   and idle in transaction (aborted) states.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..06cea5f01c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -864,7 +864,9 @@ CREATE VIEW pg_stat_activity AS
 s.backend_xmin,
 S.query_id,
 S.query,
-S.backend_type
+S.backend_type,
+S.active_time,
+S.idle_in_transaction_time
 FROM pg_stat_get_activity(NULL) AS S
 LEFT JOIN pg_database AS D ON (S.datid = D.oid)
 LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c7ed1e6d7a..dbb7a0aec6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -468,6 +468,13 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
 	beentry->st_procpid = 0;	/* mark invalid */
 
+	/*
+	 * Reset per-backend counters so that accumulated values for the current
+	 * backend are not used for future backends.
+	 */
+	beentry->st_total_active_time = 0;
+	beentry->st_total_transaction_idle_time = 0;
+
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 
 	/* so that functions can check if backend_status.c is up via MyBEEntry */
@@ -524,6 +531,8 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	TimestampTz start_timestamp;
 	TimestampTz current_timestamp;
 	int			len = 0;
+	int64		active_time_diff = 0;
+	int64		transaction_idle_time_diff = 0;
 
 	TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
 
@@ -550,6 +559,9 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 			beentry->st_xact_start_timestamp = 0;
 			beentry->st_query_id = UINT64CONST(0);
 			proc->wait_event_info = 0;
+
+			beentry->st_total_active_time = 0;
+			beentry->st_total_transaction_idle_time = 0;
 			PGSTAT_END_WRITE_ACTIVITY(beentry);
 		}
 		return;
@@ -575,24 +587,31 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	 * If the state has changed from "active" or "idle in transaction",
 	 * calculate the duration.
 	 */
-	if ((beentry->st_state == STATE_RUNNING ||
-		 beentry->st_state == STATE_FASTPATH ||
-		 beentry->st_state == STATE_IDLEINTRANSACTION ||
-		 beentry->st_state == STATE_IDLEINTRANSACTION_ABORTED) &&
+	if ((PGSTAT_IS_ACTIVE(beentry) || PGSTAT_IS_IDLEINTRANSACTION(beentry)) &&
 		state != beentry->st_state)
 	{
 		long		secs;
 		int			usecs;
+		int64		usecs_diff;
 
 		TimestampDifference(beentry->st_state_start_timestamp,
 			current_timestamp,
 			, );
+		usecs_diff = secs * 100 + usecs;
 
-		if (beentry->st_state == STATE_RUNNING ||
-			beentry->st_state == STATE_FASTPATH)
-			pgstat_count_conn_active_time((PgStat_Counter) secs * 100 + usecs);
+		/*
+		 * We update per-backend st_total_active_time and st_total_transaction_idle_time
+		 * separately from pgStatActiveTime and pgStatTransactionIdleTime
+		 * used in pg_stat_database to provide per-DB statistics because
+		 * 1. Changing the former values implies modifying beentry and thus
+		 * have to be wrapped into PGSTAT_*_WRITE_ACTIVITY macros (see below).
+		 * 2. The latter values are reset to 0 once the data has been sent
+		 * to the statistics collector.
+		 */
+		if (PGSTAT_IS_ACTIVE(beentry))
+			active_time_diff = usecs_diff;
 		else
-			pgstat_count_conn_txn_idle_time((PgStat_Counter) secs * 100 + usecs);
+			transaction_idle_time_diff = usecs_diff;
 	}
 
 	/*
@@ -618,6 +637,9 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 		beentry->st_activity_start_timestamp = start_timestamp;
 	}
 
+	beentry->st_total_active_time += active_time_diff;
+	beentry->st_total_transaction_idle_time += transaction_idle_time_diff;
+
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
@@ -1046,6 +1068,48 @@ pgstat_get_my_query_id(void)
 }
 
 
+/* --
+ * pgstat_get_my_active_time() -
+ *
+ * Return current backend's accumulated active time.
+ */
+uint64
+pgstat_get_my_active_time(void)
+{
+	if (!MyBEEntry)
+		return 0;
+
+	/*
+	 * There's no need for a lock around pgstat_begin_read_activity /
+	 * pgstat_end_read_activity here 

Re: Parse CE and BCE in dates and times

2022-06-13 Thread David Fetter
On Mon, Jun 13, 2022 at 09:11:56AM +0200, Erik Rijkers wrote:
> Op 13-06-2022 om 07:51 schreef David Fetter:
> > Folks,
> > 
> > Please find attached a patch to do $Subject. As dates in a fair number
> > of fields of endeavor are expressed this way, it seems reasonable to
> > ensure tha we can parse them on input. Making it possible to use them
> > in output is a more invasive patch, and would involve changes to
> > to_date and similar that would require careful consideration.
> 
> Hi David,
> 
> I find some unexpected results:
> 
> # select '112-04-30 BC'::date;
>  date
> ---
>  0112-04-30 BC
> (1 row)
> 
> but the same with the ' BCE' suffix seems broken:
> 
> # select '112-04-30 BCE'::date;
> ERROR:  invalid input syntax for type date: "112-04-30 BCE"
> LINE 1: select '112-04-30 BCE'::date;
> 
> The same goes for '112-04-30 AD' (works) and its CE version (errors out).
> 
> Or is this as expected?

It's not, and thanks for looking at this.  Will check to see what's
going on here.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Skipping logical replication transactions on subscriber side

2022-06-13 Thread Robert Haas
On Sun, Apr 17, 2022 at 11:22 PM Noah Misch  wrote:
> > Yes, but it could be false positives in some cases. For instance, the
> > column {oid, bool, XLogRecPtr} should be okay on ALIGNOF_DOUBLE == 4
> > and 8 platforms but the new test fails.
>
> I'm happy with that, because the affected author should look for padding-free
> layouts before settling on your example layout.  If the padding-free layouts
> are all unacceptable, the author should update the expected sanity_check.out
> to show the one row where the test "fails".

I realize that it was necessary to get something committed quickly
here to unbreak the buildfarm, but this is really a mess. As I
understand it, the problem here is that typalign='d' is either 4 bytes
or 8 depending on how the 'double' type is aligned on that platform,
but we use that typalign value also for some other data types that may
not be aligned in the same way as 'double'. Consequently, it's
possible to have a situation where the behavior of the C compiler
diverges from the behavior of heap_form_tuple(). To avoid that, we
need every catalog column that uses typalign=='d' to begin on an
8-byte boundary. We also want all such columns to occur before the
first NameData column in the catalog, to guard against the possibility
that NAMEDATALEN has been redefined to an odd value. I think this set
of constraints is a nuisance and that it's mostly good luck we haven't
run into any really awkward problems here so far.

In many of our catalogs, the first member is an OID and the second
member of the struct is of type NameData: pg_namespace, pg_class,
pg_proc, etc. That common design pattern is in direct contradiction to
the desires of this test case. As soon as someone wants to add a
typalign='d' member to any of those system catalogs, the struct layout
is going to have to get shuffled around -- and then it will look
different from all the other ones. Or else we'd have to rearrange them
all to move all the NameData columns to the end. I feel like it's
weird to introduce a test case that so obviously flies in the face of
how catalog layout has been done up to this point, especially for the
sake of a hypothetical user who want to set NAMEDATALEN to an odd
number. I doubt such scenarios have been thoroughly tested, or ever
will be. Perhaps instead we ought to legislate that NAMEDATALEN must
be a multiple of 8, or some such thing.

The other constraint, that typalign='d' fields must always fall on an
8 byte boundary, is probably less annoying in practice, but it's easy
to imagine a future catalog running into trouble. Let's say we want to
introduce a new catalog that has only an Oid column and a float8
column. Perhaps with 0-3 bool or uint8 columns as well, or with any
number of NameData columns as well. Well, the only way to satisfy this
constraint is to put the float8 column first and the Oid column after
it, which immediately makes it look different from every other catalog
we have. It's hard to feel like that would be a good solution here. I
think we ought to try to engineer a solution where heap_form_tuple()
is going to do the same thing as the C compiler without the sorts of
extra rules that this test case enforces.

AFAICS, we could do that by:

1. De-supporting platforms that have this problem, or
2. Introducing new typalign values, as Noah proposed back on April 2, or
3. Somehow forcing values that are sometimes 4-byte aligned and
sometimes 8-byte aligned to be 8-byte alignment on all platforms

I also don't like the fact that the test case doesn't even catch
exactly the problematic set of cases, but rather a superset, leaving
it up to future patch authors to make a correct judgment about whether
a certain new column can be listed as an expected output of the test
case or whether the catalog representation must be changed. The idea
that we'll reliably get that right might be optimistic. Again, I don't
mean to say that this is the fault of this test case since, without
the test case, we'd have no idea that there was even a potential
problem, which would not be better. But it feels to me like we're
hacking around the real problem instead of fixing it, and it seems to
me that we should try to do better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




2022-06-16 release announcement draft

2022-06-13 Thread Jonathan S. Katz

Hi,

Please see the attached draft of the release announcement for the 
2022-06-16 release.


Please review for technical accuracy and omissions. If you have 
feedback. please provide it no later than Thu, June 16, 2022 0:00 AoE[1].


Thanks,

Jonathan

[1] https://en.wikipedia.org/wiki/Anywhere_on_Earth
The PostgreSQL Global Development Group has released PostgreSQL 14.4 to fix an
issue that could cause silent data corruption when using the
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
and [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
commands. Please see the following section for information on how to detect and
correct silent data corruption in your indexes.

This release also fixes over 15 bugs since PostgreSQL 14.3 was released in May.
For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

This release is only for PostgreSQL 14. The PostgreSQL Global Development Group
will make a scheduled update release on August 11, 2022 for all supported
versions of PostgreSQL (10 - 14).

Detect and Fix `CREATE INDEX CONCURRENTLY` / `REINDEX CONCURRENTLY` Corruption
--

PostgreSQL 14.4 fixes an issue with the
[`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
and [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
that could cause silent data corruption of indexes. This issue may not have
corrupted your indexes, but if you are unsure, we advise you to reindex using
the instructions below.

You can detect if a B-tree index (the default index type) has data corruption
using the 
[`pg_amcheck`](https://www.postgresql.org/docs/current/app-pgamcheck.html)
command with the `--heapallindexed` flag. For example:

```
pg_amcheck --heapallindexed database
```

If `pg_amcheck` detects corruption or if you ran `CREATE INDEX CONCURRENTLY` or
`REINDEX CONCURRENTLY` on any other index type (e.g. GiST, GIN, etc.), please
follow the instructions below.

Once you upgrade your system to PostgreSQL 14.4, you can fix any silent data
corruption using `REINDEX CONCURRENTLY`. For example, if you have an index named
`elephant_idx` that has data corruption, you can run the following command on
PostgreSQL 14.4 to fix it:

```
REINDEX CONCURRENTLY elephant_idx;
```

You can use the
[`reindexdb`](https://www.postgresql.org/docs/current/app-reindexdb.html)
command to reindex all indexes across your cluster. `reindexdb` also has a
`--jobs` flag that lets you run reindex operations in parallel. For example, to
reindex your entire PostgreSQL cluster using `` parallel jobs, you can run
the following command:

```
reindexdb --all --concurrently --jobs 
```

Bug Fixes and Improvements
--

This update fixes over 15 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 14.

Included in this release:

* Several fixes for query plan memoization.
* Fix queries in which a "whole-row variable" references the result of a
function that returns a domain over composite type.
* Fix "variable not found in subplan target list" planner error using a
sub-`SELECT` that is referenced in a `GROUPING` function.
* Fix error checking in `COPY FROM` when the database encoding is `SQL_ASCII`
but the client encoding is a multi-byte encoding.
* Report implicitly-created operator families (`CREATE OPERATOR CLASS`) to event
triggers.
* Prevent triggering `wal_receiver_timeout` on a standby during logical
replication of large transactions.
* Remove incorrect TLS private key file ownership check in libpq.
* Prevent crash after server connection loss in `pg_amcheck`.

For the full list of changes available, please review the
[release notes](https://www.postgresql.org/docs/release/).

Updating


All PostgreSQL update releases are cumulative. As with other minor releases,
users are not required to dump and reload their database or use `pg_upgrade` in
order to apply this update release; you may simply shutdown PostgreSQL and
update its binaries.

However, if you ran [`CREATE INDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-createindex.html)
or [`REINDEX 
CONCURRENTLY`](https://www.postgresql.org/docs/current/sql-reindex.html)
on PostgreSQL 14, you may need to take additional steps. Please review the
"Detect and Fix `CREATE INDEX CONCURRENTLY` / `REINDEX CONCURRENTLY` Corruption"
section for more details.

Users who have skipped one or more update releases may need to run additional,
post-update steps; please see the release notes for earlier versions for
details.

For more details, please see the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 10 EOL Notice


PostgreSQL 10 will stop receiving fixes on November 10, 2022. If you are
running PostgreSQL 10 in a 

Re: Improve TAP tests of pg_upgrade for cross-version tests

2022-06-13 Thread Andrew Dunstan


On 2022-06-13 Mo 03:51, Michael Paquier wrote:
> On Sun, Jun 12, 2022 at 05:58:54PM -0400, Andrew Dunstan wrote:
>> On 2022-06-12 Su 10:14, Andrew Dunstan wrote:
>>> I tried in fb16d2c658 to avoid littering the mainline code with
>>> version-specific tests, and put that in the methods in the subclasses
>>> that override the mainline functions.
> Except that manipulating the diffs of pg_upgrade is not something that
> needs to be internal to the subclasses where we set up the nodes :)
>
>> I think I must have been insufficiently caffeinated when I wrote this,
>> because clearly I was not reading correctly.
>>
>> I have had another look and the patch looks fine, although I haven't
>> tested it.
> I have a question about the tests done in the buildfarm though.  Do
> the dumps from the older versions drop some of the objects that cause
> the diffs in the dumps?  At which extent is that a dump from
> installcheck?


See lines 324..347 (save stage) and 426..586 (upgrade stage) of


We save the cluster to be upgraded after all the installcheck stages
have run, so on crake here's the list of databases upgraded for HEAD:


postgres
template1
template0
contrib_regression_pgxml
contrib_regression_bool_plperl
contrib_regression_hstore_plperl
contrib_regression_jsonb_plperl
regression
contrib_regression_redis_fdw
contrib_regression_file_textarray_fdw
isolation_regression
pl_regression_plpgsql
contrib_regression_hstore_plpython3
pl_regression_plperl
pl_regression_plpython3
pl_regression_pltcl
contrib_regression_adminpack
contrib_regression_amcheck
contrib_regression_bloom
contrib_regression_btree_gin
contrib_regression_btree_gist
contrib_regression_citext
contrib_regression_cube
contrib_regression_dblink
contrib_regression_dict_int
contrib_regression_dict_xsyn
contrib_regression_earthdistance
contrib_regression_file_fdw
contrib_regression_fuzzystrmatch
contrib_regression_hstore
contrib_regression__int
contrib_regression_isn
contrib_regression_lo
contrib_regression_ltree
contrib_regression_pageinspect
contrib_regression_passwordcheck
contrib_regression_pg_surgery
contrib_regression_pg_trgm
contrib_regression_pgstattuple
contrib_regression_pg_visibility
contrib_regression_postgres_fdw
contrib_regression_seg
contrib_regression_tablefunc
contrib_regression_tsm_system_rows
contrib_regression_tsm_system_time
contrib_regression_unaccent
contrib_regression_pgcrypto
contrib_regression_uuid-ossp
contrib_regression_jsonb_plpython3
contrib_regression_ltree_plpython3
isolation_regression_summarization-and-inprogress-insertion
isolation_regression_delay_execution
contrib_regression_dummy_index_am
contrib_regression_dummy_seclabel
contrib_regression_plsample
contrib_regression_spgist_name_ops
contrib_regression_test_bloomfilter
contrib_regression_test_extensions
contrib_regression_test_ginpostinglist
contrib_regression_test_integerset
contrib_regression_test_parser
contrib_regression_test_pg_dump
contrib_regression_test_predtest
contrib_regression_test_rbtree
contrib_regression_test_regex
contrib_regression_test_rls_hooks
contrib_regression_test_shm_mq
contrib_regression_rolenames


cheers


andrew





--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-06-13 Thread Nitin Jadhav
> > Have you measured the performance effects of this? On fast storage with 
> > large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but 
> > it'd
> > be good to verify that.
>
> I am wondering if we could make the function inlined at some point.
> We could also play it safe and only update the counters every N loops
> instead.

The idea looks good but based on the performance numbers shared above,
it is not affecting the performance. So we can use the current
approach as it gives more accurate progress.
---

> > This view is depressingly complicated. Added up the view definitions for
> > the already existing pg_stat_progress* views add up to a measurable part of
> > the size of an empty database:
>
> Yeah.  I think that what's proposed could be simplified, and we had
> better remove the fields that are not that useful.  First, do we have
> any need for next_flags?

"next_flags" is removed in the v6 patch. Added a "new_requests" field
to get to know whether the current checkpoint is being throttled or
not. Please share your views on this.
---

> Second, is the start LSN really necessary
> for monitoring purposes?

IMO, start LSN is necessary to debug if the checkpoint is taking longer.
---

> Not all the information in the first
> parameter is useful, as well.  For example "shutdown" will never be
> seen as it is not possible to use a session at this stage, no?

I understand that "shutdown" and "end-of-recovery" will never be seen
and I have removed it in the v6 patch.
---

> There
> is also no gain in having "immediate", "flush-all", "force" and "wait"
> (for this one if the checkpoint is requested the session doing the
> work knows this information already).

"immediate" is required to understand whether the current checkpoint
is throttled or not. I am not sure about other flags "flush-all",
"force" and "wait". I have just supported all the flags to match the
'checkpoint start' log message. Please share your views. If it is not
really required, I will remove it in the next patch.
---

> A last thing is that we may gain in visibility by having more
> attributes as an effect of splitting param2.  On thing that would make
> sense is to track the reason why the checkpoint was triggered
> separately (aka wal and time).  Should we use a text[] instead to list
> all the parameters instead?  Using a space-separated list of items is
> not intuitive IMO, and callers of this routine will likely parse
> that.

If I understand the above comment correctly, you are saying to
introduce a new field, say "reason" ( possible values are either wal
or time) and the "flags" field will continue to represent the other
flags like "immediate", etc. The idea looks good here. We can
introduce new field "reason" and "flags" field can be renamed to
"throttled" (true/false) if we decide to not support other flags
"flush-all", "force" and "wait".
---

> +  WHEN 3 THEN 'checkpointing replication slots'
> +  WHEN 4 THEN 'checkpointing logical replication 
> snapshot files'
> +  WHEN 5 THEN 'checkpointing logical rewrite mapping 
> files'
> +  WHEN 6 THEN 'checkpointing replication origin'
> +  WHEN 7 THEN 'checkpointing commit log pages'
> +  WHEN 8 THEN 'checkpointing commit time stamp pages'
> There is a lot of "checkpointing" here.  All those terms could be
> shorter without losing their meaning.

I will try to make it short in the next patch.
---

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Apr 5, 2022 at 3:15 PM Michael Paquier  wrote:
>
> On Fri, Mar 18, 2022 at 05:15:56PM -0700, Andres Freund wrote:
> > Have you measured the performance effects of this? On fast storage with 
> > large
> > shared_buffers I've seen these loops in profiles. It's probably fine, but 
> > it'd
> > be good to verify that.
>
> I am wondering if we could make the function inlined at some point.
> We could also play it safe and only update the counters every N loops
> instead.
>
> > This view is depressingly complicated. Added up the view definitions for
> > the already existing pg_stat_progress* views add up to a measurable part of
> > the size of an empty database:
>
> Yeah.  I think that what's proposed could be simplified, and we had
> better remove the fields that are not that useful.  First, do we have
> any need for next_flags?  Second, is the start LSN really necessary
> for monitoring purposes?  Not all the information in the first
> parameter is useful, as well.  For example "shutdown" will never be
> seen as it is not possible to use a session at this stage, no?  There
> is also no gain in having "immediate", "flush-all", "force" and "wait"
> (for this one if the checkpoint is requested the session doing the
> work knows this information already).
>
> A last thing is that we may gain in visibility by having more
> attributes as an effect of splitting param2.  On thing that 

Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-06-13 Thread Nitin Jadhav
> Have you measured the performance effects of this? On fast storage with large
> shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> be good to verify that.

To understand the performance effects of the above, I have taken the
average of five checkpoints with the patch and without the patch in my
environment. Here are the results.
With patch: 269.65 s
Without patch: 269.60 s

It looks fine. Please share your views.

> This view is depressingly complicated. Added up the view definitions for
> the already existing pg_stat_progress* views add up to a measurable part of
> the size of an empty database:

Thank you so much for sharing the detailed analysis. We can remove a
few fields which are not so important to make it simple.

Thanks & Regards,
Nitin Jadhav

On Sat, Mar 19, 2022 at 5:45 AM Andres Freund  wrote:
>
> Hi,
>
> This is a long thread, sorry for asking if this has been asked before.
>
> On 2022-03-08 20:25:28 +0530, Nitin Jadhav wrote:
> >* Sort buffers that need to be written to reduce the likelihood of 
> > random
> > @@ -2129,6 +2132,8 @@ BufferSync(int flags)
> >   bufHdr = GetBufferDescriptor(buf_id);
> >
> >   num_processed++;
> > + 
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_PROCESSED,
> > +  
> > num_processed);
> >
> >   /*
> >* We don't need to acquire the lock here, because we're only 
> > looking
> > @@ -2149,6 +2154,8 @@ BufferSync(int flags)
> >   TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
> >   
> > PendingCheckpointerStats.m_buf_written_checkpoints++;
> >   num_written++;
> > + 
> > pgstat_progress_update_param(PROGRESS_CHECKPOINT_BUFFERS_WRITTEN,
> > +   
> >num_written);
> >   }
> >   }
>
> Have you measured the performance effects of this? On fast storage with large
> shared_buffers I've seen these loops in profiles. It's probably fine, but it'd
> be good to verify that.
>
>
> > @@ -1897,6 +1897,112 @@ pg_stat_progress_basebackup| SELECT s.pid,
> >  s.param4 AS tablespaces_total,
> >  s.param5 AS tablespaces_streamed
> > FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, 
> > param1, param2, param3, param4, param5, param6, param7, param8, param9, 
> > param10, param11, param12, param13, param14, param15, param16, param17, 
> > param18, param19, param20);
> > +pg_stat_progress_checkpoint| SELECT s.pid,
> > +CASE s.param1
> > +WHEN 1 THEN 'checkpoint'::text
> > +WHEN 2 THEN 'restartpoint'::text
> > +ELSE NULL::text
> > +END AS type,
> > +(((
> > +CASE
> > +WHEN ((s.param2 & (1)::bigint) > 0) THEN 'shutdown '::text
> > +ELSE ''::text
> > +END ||
> > +CASE
> > +WHEN ((s.param2 & (2)::bigint) > 0) THEN 'end-of-recovery 
> > '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (4)::bigint) > 0) THEN 'immediate '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (8)::bigint) > 0) THEN 'force '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (16)::bigint) > 0) THEN 'flush-all '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (32)::bigint) > 0) THEN 'wait '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (128)::bigint) > 0) THEN 'wal '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param2 & (256)::bigint) > 0) THEN 'time '::text
> > +ELSE ''::text
> > +END) AS flags,
> > +(((
> > +CASE
> > +WHEN ((s.param3 & (1)::bigint) > 0) THEN 'shutdown '::text
> > +ELSE ''::text
> > +END ||
> > +CASE
> > +WHEN ((s.param3 & (2)::bigint) > 0) THEN 'end-of-recovery 
> > '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param3 & (4)::bigint) > 0) THEN 'immediate '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param3 & (8)::bigint) > 0) THEN 'force '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param3 & (16)::bigint) > 0) THEN 'flush-all '::text
> > +ELSE ''::text
> > +END) ||
> > +CASE
> > +WHEN ((s.param3 & (32)::bigint) > 0) THEN 'wait '::text
> > +ELSE ''::text
> > +END) ||

Re: better page-level checksums

2022-06-13 Thread Aleksander Alekseev
Hi hackers,

> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata
> >
> +1 for this approach. I had observed some painful corruption cases where 
> block storage simply returned stale version of a rage of blocks. This is only 
> possible because checksum is stored on the page itself.

That's very interesting, Andrey. Thanks for sharing.

> One of my questions is what algorithm(s) we'd want to support.

Should it necessarily be a fixed list? Why not support plugable algorithms?

An extension implementing a checksum algorithm is going to need:

- several hooks: check_page_after_reading, calc_checksum_before_writing
- register_checksum()/deregister_checksum()
- an API to save the checksums to a seperate fork

By knowing the block number and the hash size the extension knows
exactly where to look for the checksum in the fork.

-- 
Best regards,
Aleksander Alekseev




Re: Replica Identity check of partition table on subscriber

2022-06-13 Thread Amit Kapila
On Mon, Jun 13, 2022 at 1:03 PM houzj.f...@fujitsu.com
 wrote:
>
> On Monday, June 13, 2022 1:53 PM Amit Kapila  wrote:
>
> I have separated out the bug-fix for the subscriber-side.
> And fix the typo and function name.
> Attach the new version patch set.
>

The first patch looks good to me. I have slightly modified one of the
comments and the commit message. I think we need to backpatch this
through 13 where we introduced support to replicate into partitioned
tables (commit f1ac27bf). If you guys are fine, I'll push this once
the work for PG14.4 is done.

-- 
With Regards,
Amit Kapila.


v5-0001-Fix-cache-look-up-failures-while-applying-changes.patch
Description: Binary data


Re: pltcl crash on recent macOS

2022-06-13 Thread Thomas Munro
On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut
 wrote:
>  frame #1: 0x7ff803a28751 libsystem_c.dylib`hash_search + 215
>  frame #2: 0x000110357700
> pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0,

Hmm, I can’t reproduce that….  although that symbol is present in my
libSystem.B.dylib according to dlsym() and callable from a simple
program not linked to anything else, pltcl.so is apparently reaching
postgres’s hash_search for me, based on the fact that make -C
src/pl/tcl check succeeds and nm -m on pltcl.so shows it as "from
executable".  It would be interesting to see what nm -m shows for you.

Archeological note: That hash_search stuff, header , seems
to have been copied from ancient FreeBSD before it was dropped
upstream for the crime of polluting the global symbol namespace with
junk[1].  It's been languishing in Apple's libc for at least 19
years[2], though, so I'm not sure why it's showing up suddenly as a
problem for you now.

> Note, I'm using the tcl-tk package from Homebrew.  The tcl installation
> provided by macOS itself no longer appears to work for linking against.

I’m using tcl 8.6.12 installed by MacPorts on macOS 12.4, though, hmm,
SDK 12.3.  I see the explicit -lc when building pltcl.so, and I see
that libSystem.B.dylib is explicitly mentioned here, whether or not I
have -lc:

% otool -L ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so
./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so:
/opt/local/lib/libtcl8.6.dylib (compatibility version 8.6.0, current
version 8.6.12)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1311.100.3)

Here’s the complete link line:

ccache cc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new -Wendif-labels
-Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument
-Wno-compound-token-split-by-macro -g -O0  -bundle -multiply_defined
suppress -o pltcl.so  pltcl.o -L../../../src/port
-L../../../src/common  -isysroot
/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk
-Wl,-dead_strip_dylibs   -L/opt/local/lib -ltcl8.6 -lz -lpthread
-framework CoreFoundation  -lc -bundle_loader
../../../src/backend/postgres

[1] 
https://github.com/freebsd/freebsd-src/commit/dc196afb2e58dd05cd66e2da44872bb3d619910f
[2] 
https://github.com/apple-open-source-mirror/Libc/blame/master/stdlib/FreeBSD/strhash.c




Re: Tracking notnull attributes inside Var

2022-06-13 Thread Andy Fan
I thought about the strategy below in the past few days,  and think it
is better because it uses less cycles to get the same answer.  IIUC, the
related structs should be created during / after deconstruct_jointree rather
than join_search_xx stage.


> The schemes I've been toying with tend to look more like putting a
> PlaceHolderVar-ish wrapper around the Var or expression that represents
> the below-the-join value.  The wrapper node could carry any join ID
> info that we find necessary.


Just to confirm my understanding,  the wrapper node should answer some
questions like this.

/*
 * rel_is_nullable_side
 *
 * For the given join ID joinrelids, return if the relid is in the nullable
 * side.
 */
static bool
rel_is_nullable_side(PlannerInfo *root, Relids joinrelids, Index relid)
{
Assert(bms_is_member(relid, joinrelids));
...
}


> The thing that I'm kind of stalled on is
> how to define this struct so that it's not a big headache for join
> strength reduction (which could remove the need for a wrapper altogether)

or outer-join reordering (which makes it a bit harder to define which
> join we think is the one nulling the value).
>

I think about the outer-join reorder case,  can we just rely on
SpecialJoinInfo.min_lefthands & min_righthands to get the answer?
The attached patch is based on that.  and I did some test in the patch
as well,  looks the answer is correct.

What's more, if the above is correct and the calls of rel_is_nullable_side
is small,  do we still need think about more effective data struct?

Thanks!

-- 
Best Regards
Andy Fan


not_null_based_on_special_joininfo.diff
Description: Binary data


Re: Using PQexecQuery in pipeline mode produces unexpected Close messages

2022-06-13 Thread Álvaro Herrera
On Fri, Jun 10, 2022, at 8:25 AM, Kyotaro Horiguchi wrote:
> 
> The current implement of PQsendQueryInternal looks like the result of
> a misunderstanding of the doc.  In the regression tests, that path is
> excercised only for an error case, where no CloseComplete comes.
> 
> The attached adds a test for the normal-path of pipelined
> PQsendQuery() to simple_pipeline test then modifies that function not
> to send Close message. Without the fix, the test fails by "unexpected
> notice" even if the trace matches the "expected" content.

Hah, the patch I wrote is almost identical to yours, down to the notice 
processor counting the number of notices received.  The only difference is that 
I put my test in pipeline_abort.

Sadly, it looks like I won't be able to get this patched pushed for 14.4.



RE: Multi-Master Logical Replication

2022-06-13 Thread r.takahash...@fujitsu.com
Hi,


In addition to the use cases mentioned above, some users want to use n-way
replication of partial database.

The following is the typical use case.

* There are several data centers.
  (ex. Japan and India)
* The database in each data center has its unique data.
  (ex. the database in Japan has the data related to Japan)
* There are some common data.
  (ex. the shipment data from Japan to India should be stored on both database)
* To replicate common data, users want to use n-way replication.


The current POC patch seems to support only n-way replication of entire 
database, 
but I think we should support n-way replication of partial database to achieve
above use case.


> I don't know if it requires the kind of code you are thinking but I
> agree that it is worth considering implementing it as an extension.

I think the other advantage to implement as an extension is that users could
install the extension to older Postgres.

As mentioned in previous email, the one use case of n-way replication is 
migration
from older Postgres to newer Postgres.

If we implement as an extension, users could use n-way replication for migration
from PG10 to PG16.


Regards,
Ryohei Takahashi


RE: Support logical replication of DDLs

2022-06-13 Thread houzj.f...@fujitsu.com
On Sunday, June 12, 2022 2:46 PM Zheng Li  wrote:
> 
> > > > > > I've not looked at these patches in-depth yet but with this
> > > > > > approach, what do you think we can handle the DDL syntax
> > > > > > differences between major versions? DDL syntax or behavior
> > > > > > could be changed by future changes and I think we need to
> > > > > > somehow deal with the differences. For
> > > > >
> > > > > > example, if the user uses logical replication for major
> > > > > > version upgrade, the publisher is older than the subscriber.
> > > > > > We might have to rewrite the DDL before applying to the
> > > > > > subscriber because the DDL executed on the publisher no longer
> > > > > > work on a new PostgreSQL version
> > > > >
> > > > > I don't think we will allow this kind of situation to happen in
> > > > > the first place for backward compatibility. If a DDL no longer
> > > > > works on a new version of PostgreSQL, the user will have to
> > > > > change the application code as well.
> > > > > So even if it happens for
> > > > > whatever reason, we could either 1. fail the apply worker and
> > > > > let the user fix such DDL because they'll have to fix the
> > > > > application code anyway when this happens.
> > > > > 2. add guard rail logic in the apply worker to automatically fix
> > > > > such DDL if possible, knowing the version of the source and
> > > > > target. Similar logic must have been implemented for
> pg_dump/restore/upgrade.
> > > > >
> > > > > > or we might have to add some options to the DDL before the
> > > > > > application in order to keep the same behavior. This seems to
> > > > > > require a different solution from what the patch does for the
> > > > > > problem you mentioned such
> > > > >
> > > > > > as "DDL involving multiple tables where only some tables are
> > > > > > replicated”.
> > > > >
> > > > > First of all, this case can only happen when the customer
> > > > > chooses to only replicate a subset of the tables in a database
> > > > > in which case table level DDL replication is chosen instead of
> > > > > database level DDL replication (where all tables and DDLs are
> > > > > replicated). I think the solution would be:
> > > > > 1. make best effort to detect such DDLs on the publisher and
> > > > > avoid logging of such DDLs in table level DDL replication.
> > > > > 2. apply worker will fail to replay such command due to missing
> > > > > objects if such DDLs didn't get filtered on the publisher for
> > > > > some reason. This should be rare and I think it's OK even if it
> > > > > happens, we'll find out why and fix it.
> > > > >
> > > >
> > > > FWIW, both these cases could be handled with the deparsing
> > > > approach, and the handling related to the drop of multiple tables
> > > > where only a few are published is already done in the last POC
> > > > patch shared by Ajin [1].
> > > >
> > >
> > > Right. So I'm inclined to think that deparsing approach is better
> > > from this point as well as the point mentioned by Álvaro before[1].
> >
> > I agree. One more point about deparsing approach is that it can also
> > help to replicate CREATE TABLE AS/SELECT INTO in a better way.
> >
> > The main idea of replicating the CREATE TABLE AS is that we deprase
> > the CREATE TABLE AS into a simple CREATE TABLE(without subquery)
> > command and WAL log it after creating the table and before writing
> > data into the table and replicate the incoming writes later as normal
> > INSERTs. In this apporach, we don't execute the subquery on subscriber
> > so that don't need to make sure all the objects referenced in the
> > subquery also exists in subscriber. And This approach works for all
> > kind of commands(e.g. CRAETE TABLE AS [SELECT][EXECUTE][VALUES])
> >
> > One problem of this approach is that we cannot use the current trigger
> > to deparse or WAL log the CREATE TABLE. Because none of the even
> > trigger is fired after creating the table and before inserting the
> > data. To solve this, one idea is that we could directly add some code
> > at the end of create_ctas_internal() to deparse and WAL log it.
> > Moreover, we could even introduce a new type of event
> > trigger(table_create) which would be fired at the expected timing so
> > that we can use the trigger function to deparse and WAL log. I am not
> > sure which way is better. I temporarily use the second idea which
> > introduce a new type event trigger in the 0003 POC patch.
> 
> Hi, I agree that an intermediate structured format (with all objects schema
> qualified) makes it easier to handle syntax differences between the publisher
> and the subscriber. Such structured format is also likely easier to use by 
> other
> logical replication consumers.
> However, to make things more maintainable, would it be better to use the
> existing serialization/deserialization functions in out/readfuncs.c to 
> generate
> the parsetree representation of the DDL command?
> 
> It turns out support for DDL commands are mostly missing in 

Re: Replica Identity check of partition table on subscriber

2022-06-13 Thread Amit Kapila
On Mon, Jun 13, 2022 at 2:20 PM Amit Langote  wrote:
>
> On Sat, Jun 11, 2022 at 10:36 AM Amit Kapila  wrote:
> > On Fri, Jun 10, 2022 at 2:26 PM Amit Langote  
> > wrote:
> > >
> > > +logicalrep_partmap_invalidate
> > >
> > > I wonder why not call this logicalrep_partmap_update() to go with
> > > logicalrep_relmap_update()?  It seems confusing to have
> > > logicalrep_partmap_invalidate() right next to
> > > logicalrep_partmap_invalidate_cb().
> > >
> >
> > I am thinking about why we need to update the relmap in this new
> > function logicalrep_partmap_invalidate()? I think it may be better to
> > do it in logicalrep_partition_open() when actually required,
> > otherwise, we end up doing a lot of work that may not be of use unless
> > the corresponding partition is accessed. Also, it seems awkward to me
> > that we do the same thing in this new function
> > logicalrep_partmap_invalidate() and then also in
> > logicalrep_partition_open() under different conditions.
>
> Both logicalrep_rel_open() and logicalrel_partition_open() only ever
> touch the local Relation, never the LogicalRepRelation.
>

We do make the copy of remote rel in  logicalrel_partition_open() when
the entry is not found. I feel the same should happen when remote
relation is reset/invalidated by the RELATION message.

>  Updating the
> latter is the responsibility of logicalrep_relmap_update(), which is
> there to support handling of the RELATION message by
> apply_handle_relation().  Given that we make a separate copy of the
> parent's LogicalRepRelMapEntry for each partition to put into the
> corresponding LogicalRepPartMapEntry, those copies must be updated as
> well when a RELATION message targeting the parent's entry arrives.  So
> it seems fine that the patch is making it the
> logicalrep_relmap_update()'s responsibility to update the partition
> copies using the new logicalrep_partition_invalidate/update()
> subroutine.
>

I think we can do that way as well but do you see any benefit in it?
The way I am suggesting will avoid the effort of updating the remote
rel copy till we try to access that particular partition.

-- 
With Regards,
Amit Kapila.




Re: Replica Identity check of partition table on subscriber

2022-06-13 Thread Amit Langote
On Sat, Jun 11, 2022 at 10:36 AM Amit Kapila  wrote:
> On Fri, Jun 10, 2022 at 2:26 PM Amit Langote  wrote:
> >
> > +logicalrep_partmap_invalidate
> >
> > I wonder why not call this logicalrep_partmap_update() to go with
> > logicalrep_relmap_update()?  It seems confusing to have
> > logicalrep_partmap_invalidate() right next to
> > logicalrep_partmap_invalidate_cb().
> >
>
> I am thinking about why we need to update the relmap in this new
> function logicalrep_partmap_invalidate()? I think it may be better to
> do it in logicalrep_partition_open() when actually required,
> otherwise, we end up doing a lot of work that may not be of use unless
> the corresponding partition is accessed. Also, it seems awkward to me
> that we do the same thing in this new function
> logicalrep_partmap_invalidate() and then also in
> logicalrep_partition_open() under different conditions.

Both logicalrep_rel_open() and logicalrel_partition_open() only ever
touch the local Relation, never the LogicalRepRelation.  Updating the
latter is the responsibility of logicalrep_relmap_update(), which is
there to support handling of the RELATION message by
apply_handle_relation().  Given that we make a separate copy of the
parent's LogicalRepRelMapEntry for each partition to put into the
corresponding LogicalRepPartMapEntry, those copies must be updated as
well when a RELATION message targeting the parent's entry arrives.  So
it seems fine that the patch is making it the
logicalrep_relmap_update()'s responsibility to update the partition
copies using the new logicalrep_partition_invalidate/update()
subroutine.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Any sense to get rid of known_assigned_xids_lck?

2022-06-13 Thread Michail Nikolaev
Hello, hackers.

While working on (1) in commit
2871b4618af1acc85665eec0912c48f8341504c4 (2) from 2010 I noticed Simon
Riggs was thinking about usage of memory barrier for KnownAssignedXids
access instead of spinlocks.

> We could dispense with the spinlock if we were to
> create suitable memory access barrier primitives and use those instead.

KnownAssignedXids is array with xids and head/tail pointers. Array is
changed only by startup process. But access to head pointer protected
by spinlock to guarantee new data in array is visible to other CPUs
before head values.

> To add XIDs to the array, we just insert
> them into slots to the right of the head pointer and then advance the head
> pointer.  This wouldn't require any lock at all, except that on machines
> with weak memory ordering we need to be careful that other processors
> see the array element changes before they see the head pointer change.
> We handle this by using a spinlock to protect reads and writes of the
> head/tail pointers.

Now we have memory barriers, so there is an WIP of patch to get rid of
`known_assigned_xids_lck`. The idea is pretty simple - issue
pg_write_barrier after updating array, but before updating head.

First potential positive effect I could see is
(TransactionIdIsInProgress -> KnownAssignedXidsSearch) locking but
seems like it is not on standby hotpath.

Second one - locking for KnownAssignedXidsGetAndSetXmin (build
snapshot). But I was unable to measure impact. It wasn’t visible
separately in (3) test.

Maybe someone knows scenario causing known_assigned_xids_lck or
TransactionIdIsInProgress become bottleneck on standby?

Best regards,
Michail.

[1]: 
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6

[2]: 
https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R2409

[3]: 
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6


0001-memory-barrier-instead-of-spinlock.patch
Description: Binary data


Re: Improve TAP tests of pg_upgrade for cross-version tests

2022-06-13 Thread Michael Paquier
On Sun, Jun 12, 2022 at 05:58:54PM -0400, Andrew Dunstan wrote:
> On 2022-06-12 Su 10:14, Andrew Dunstan wrote:
>> I tried in fb16d2c658 to avoid littering the mainline code with
>> version-specific tests, and put that in the methods in the subclasses
>> that override the mainline functions.

Except that manipulating the diffs of pg_upgrade is not something that
needs to be internal to the subclasses where we set up the nodes :)

> I think I must have been insufficiently caffeinated when I wrote this,
> because clearly I was not reading correctly.
> 
> I have had another look and the patch looks fine, although I haven't
> tested it.

I have a question about the tests done in the buildfarm though.  Do
the dumps from the older versions drop some of the objects that cause
the diffs in the dumps?  At which extent is that a dump from
installcheck?
--
Michael


signature.asc
Description: PGP signature


Re: Add header support to text format and matching feature

2022-06-13 Thread Michael Paquier
On Mon, Jun 13, 2022 at 10:32:13AM +0800, Julien Rouhaud wrote:
> On Sun, Jun 12, 2022 at 09:36:13AM -0400, Andrew Dunstan wrote:
> I'm fine with it.  I added such a check and mentioned it in the documentation.

An error looks like the right call at this stage of the game.  I am
not sure what the combination of MATCH with COPY TO would mean,
actually.  And with the concept of SELECT queries on top of it, the
whole idea gets blurrier.

> I think it would still be problematic if the target table has dropped columns.
> Fortunately, as I initially thought the problem is only due to a thinko in the
> original commit which used a wrong variable for the raw_fields offset.  Once
> fixed (attached v1) I didn't see any other problem in the rest of the logic 
> and
> all the added regression tests work as expected.

Interesting catch.  One thing that I've always found useful when it
comes to tests that stress dropped columns is to have tests where we
reduce the number of total columns that still exist.  An extra thing
is to look after pg.dropped.N a bit, and I would put
something in one of the headers.

>   if (pg_strcasecmp(sval, "match") == 0)
> + {
> + /* match is only valid for COPY FROM */
> + if (!is_from)
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> +  errmsg("%s match is only valid for COPY FROM",
> + def->defname)));

Some nits.  I would suggest to reword this error message, like "cannot
use \"match\" with HEADER in COPY TO".  There is no need for the extra
comment, and the error code should be ERRCODE_FEATURE_NOT_SUPPORTED.
--
Michael
From af4e06ca5f0a13ad5922abae446bd6716bbdf3c1 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 13 Jun 2022 09:49:23 +0800
Subject: [PATCH v2] Fix processing of header match option for COPY.

Thinko in 072132f04en which used the attnum offset to access the raw_fields
array, leading to incorrect results of crash.  Use the correct variable, and
add some regression tests to cover a bit more scenario for the HEADER MATCH
option.

While at it, disallow HEADER MATCH in COPY TO as there is no validation that
can be done in that case.

Author: Julien Rouhaud
Discussion: https://postgr.es/m/20220607154744.vvmitnqhyxrne5ms@jrouhaud
---
 src/backend/commands/copy.c  | 11 +--
 src/backend/commands/copyfromparse.c |  5 ++--
 src/test/regress/expected/copy.out   | 43 ++-
 src/test/regress/sql/copy.sql| 44 ++--
 doc/src/sgml/ref/copy.sgml   |  2 ++
 5 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f448d39c7e..e2870e3c11 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -318,7 +318,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
  * defGetBoolean() but also accepts the special value "match".
  */
 static CopyHeaderChoice
-defGetCopyHeaderChoice(DefElem *def)
+defGetCopyHeaderChoice(DefElem *def, bool is_from)
 {
 	/*
 	 * If no parameter given, assume "true" is meant.
@@ -360,7 +360,14 @@ defGetCopyHeaderChoice(DefElem *def)
 if (pg_strcasecmp(sval, "off") == 0)
 	return COPY_HEADER_FALSE;
 if (pg_strcasecmp(sval, "match") == 0)
+{
+	if (!is_from)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot use \"%s\" with HEADER in COPY TO",
+		sval)));
 	return COPY_HEADER_MATCH;
+}
 			}
 			break;
 	}
@@ -452,7 +459,7 @@ ProcessCopyOptions(ParseState *pstate,
 			if (header_specified)
 errorConflictingDefElem(defel, pstate);
 			header_specified = true;
-			opts_out->header_line = defGetCopyHeaderChoice(defel);
+			opts_out->header_line = defGetCopyHeaderChoice(defel, is_from);
 		}
 		else if (strcmp(defel->defname, "quote") == 0)
 		{
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index e06534943f..57813b3458 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -789,11 +789,12 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
 			foreach(cur, cstate->attnumlist)
 			{
 int			attnum = lfirst_int(cur);
-char	   *colName = cstate->raw_fields[attnum - 1];
+char	   *colName;
 Form_pg_attribute attr = TupleDescAttr(tupDesc, attnum - 1);
 
-fldnum++;
+Assert(fldnum < cstate->max_fields);
 
+colName = cstate->raw_fields[fldnum++];
 if (colName == NULL)
 	ereport(ERROR,
 			(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
diff --git a/src/test/regress/expected/copy.out b/src/test/regress/expected/copy.out
index e8d6b4fc13..7f2f4ae4ae 100644
--- a/src/test/regress/expected/copy.out
+++ b/src/test/regress/expected/copy.out
@@ -182,9 +182,21 @@ create table header_copytest (
 	b int,
 	c text
 );
+-- Make sure it works with with 

Re: Add header support to text format and matching feature

2022-06-13 Thread Peter Eisentraut

On 13.06.22 04:32, Julien Rouhaud wrote:

I think it makes more sense to have a sanity check to prevent HEADER
MATCH with COPY TO.


I'm fine with it.  I added such a check and mentioned it in the documentation.



I think it would still be problematic if the target table has dropped columns.
Fortunately, as I initially thought the problem is only due to a thinko in the
original commit which used a wrong variable for the raw_fields offset.  Once
fixed (attached v1) I didn't see any other problem in the rest of the logic and
all the added regression tests work as expected.


Thanks for this patch.  I'll check it in detail in a bit.  It looks good 
to me at first glance.





RE: Replica Identity check of partition table on subscriber

2022-06-13 Thread houzj.f...@fujitsu.com
On Monday, June 13, 2022 1:53 PM Amit Kapila  wrote:
> 
> On Sat, Jun 11, 2022 at 2:36 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Saturday, June 11, 2022 9:36 AM Amit Kapila 
> wrote:
> > >
> > > On Fri, Jun 10, 2022 at 2:26 PM Amit Langote
> > > 
> > > wrote:
> > > >
> > > > +logicalrep_partmap_invalidate
> > > >
> > > > I wonder why not call this logicalrep_partmap_update() to go with
> > > > logicalrep_relmap_update()?  It seems confusing to have
> > > > logicalrep_partmap_invalidate() right next to
> > > > logicalrep_partmap_invalidate_cb().
> > > >
> > >
> > > I am thinking about why we need to update the relmap in this new
> > > function logicalrep_partmap_invalidate()? I think it may be better
> > > to do it in
> > > logicalrep_partition_open() when actually required, otherwise, we
> > > end up doing a lot of work that may not be of use unless the
> > > corresponding partition is accessed. Also, it seems awkward to me
> > > that we do the same thing in this new function
> > > logicalrep_partmap_invalidate() and then also in
> > > logicalrep_partition_open() under different conditions.
> > >
> > > One more point about the 0001, it doesn't seem to have a test that
> > > validates
> > > logicalrep_partmap_invalidate_cb() functionality. I think for that
> > > we need to Alter the local table (table on the subscriber side). Can we 
> > > try to
> write a test for it?
> >
> >
> > Thanks for Amit. L and Amit. K for your comments ! I agree with this point.
> > Here is the version patch set which try to address all these comments.
> >
> > In addition, when reviewing the code, I found some other related
> > problems in the code.
> >
> > 1)
> > entry->attrmap = make_attrmap(map->maplen);
> > for (attno = 0; attno < entry->attrmap->maplen; attno++)
> > {
> > AttrNumber  root_attno =
> map->attnums[attno];
> >
> > entry->attrmap->attnums[attno] =
> attrmap->attnums[root_attno - 1];
> > }
> >
> > In this case, It’s possible that 'attno' points to a dropped column in
> > which case the root_attno would be '0'. I think in this case we should
> > just set the
> > entry->attrmap->attnums[attno] to '-1' instead of accessing the
> > attrmap->attnums[]. I included this change in 0001 because the
> > attrmap->testcase which
> > can reproduce these problems are related(we need to ALTER the
> > partition on subscriber to reproduce it).
> >
> 
> Hmm, this appears to be a different issue. Can we separate out the bug-fix
> code for the subscriber-side issue caused by the DDL on the subscriber?
> 
> Few other comments:
> + * Note that we don't update the remoterel information in the entry
> +here,
> + * we will update the information in logicalrep_partition_open to save
> + * unnecessary work.
> + */
> +void
> +logicalrep_partmap_invalidate(LogicalRepRelation *remoterel)
> 
> /to save/to avoid
> 
> Also, I agree with Amit L. that it is confusing to have
> logicalrep_partmap_invalidate() right next to
> logicalrep_partmap_invalidate_cb() and both have somewhat different kinds of
> logic. So, we can either name it as
> logicalrep_partmap_reset_relmap() or logicalrep_partmap_update() unless you
> have any other better suggestions? Accordingly, change the comment atop
> this function.

Thanks for the comments.

I have separated out the bug-fix for the subscriber-side.
And fix the typo and function name.
Attach the new version patch set.

Best regards,
Hou zj


v4-0004-fix-memory-leak-about-attrmap.patch
Description: v4-0004-fix-memory-leak-about-attrmap.patch


v4-0001-Fix-partition-map-cache-invalidation-on-subscriber.patch
Description:  v4-0001-Fix-partition-map-cache-invalidation-on-subscriber.patch


v4-0003-Check-partition-table-replica-identity-on-subscriber.patch
Description:  v4-0003-Check-partition-table-replica-identity-on-subscriber.patch


v4-0002-Reset-partition-map-cache-when-receiving-new-relatio.patch
Description:  v4-0002-Reset-partition-map-cache-when-receiving-new-relatio.patch


Re: Parse CE and BCE in dates and times

2022-06-13 Thread Erik Rijkers

Op 13-06-2022 om 07:51 schreef David Fetter:

Folks,

Please find attached a patch to do $Subject. As dates in a fair number
of fields of endeavor are expressed this way, it seems reasonable to
ensure tha we can parse them on input. Making it possible to use them
in output is a more invasive patch, and would involve changes to
to_date and similar that would require careful consideration.


Hi David,

I find some unexpected results:

# select '112-04-30 BC'::date;
 date
---
 0112-04-30 BC
(1 row)

but the same with the ' BCE' suffix seems broken:

# select '112-04-30 BCE'::date;
ERROR:  invalid input syntax for type date: "112-04-30 BCE"
LINE 1: select '112-04-30 BCE'::date;

The same goes for '112-04-30 AD' (works) and its CE version (errors out).

Or is this as expected?


Erik Rijkers










Best,
David.





Re: Finer grain log timestamps

2022-06-13 Thread David Fetter
On Mon, May 09, 2022 at 11:21:26AM +0100, Dagfinn Ilmari Mannsåker wrote:
> David Fetter  writes:
> 
> > diff --git src/backend/utils/error/elog.c src/backend/utils/error/elog.c
> > index 55ee5423af..4698e32ab7 100644
> > --- src/backend/utils/error/elog.c
> > +++ src/backend/utils/error/elog.c
> > @@ -2295,7 +2295,7 @@ char *
> >  get_formatted_log_time(void)
> >  {
> > pg_time_t   stamp_time;
> > -   charmsbuf[13];
> > +   charmsbuf[16];
> 
> Now that it holds microseconds (µs), not milliseconds (ms), should it
> not be renamed to `usbuf`?

Good point.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 9283d9ae4d8d10876aee1da0753e7db4257e7a11 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 13 Jun 2022 00:01:04 -0700
Subject: [PATCH v3] Change log timestamps from milli- to microseconds

---
 doc/src/sgml/config.sgml   | 12 ++--
 doc/src/sgml/file-fdw.sgml |  2 +-
 src/backend/utils/error/elog.c | 14 +++---
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git doc/src/sgml/config.sgml doc/src/sgml/config.sgml
index 5b7ce6531d..adc4fc090a 100644
--- doc/src/sgml/config.sgml
+++ doc/src/sgml/config.sgml
@@ -7174,17 +7174,17 @@ local0.*/var/log/postgresql
 
 
  %t
- Time stamp without milliseconds
+ Time stamp at second resolution
  no
 
 
  %m
- Time stamp with milliseconds
+ Time stamp with microseconds
  no
 
 
  %n
- Time stamp with milliseconds (as a Unix epoch)
+ Time stamp with microseconds (as a Unix epoch)
  no
 
 
@@ -7526,7 +7526,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 This option emits log lines in comma-separated-values
 (CSV) format,
 with these columns:
-time stamp with milliseconds,
+time stamp with microseconds,
 user name,
 database name,
 process ID,
@@ -7557,7 +7557,7 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
 
 CREATE TABLE postgres_log
 (
-  log_time timestamp(3) with time zone,
+  log_time timestamp(6) with time zone,
   user_name text,
   database_name text,
   process_id integer,
@@ -7682,7 +7682,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 
  timestamp
  string
- Time stamp with milliseconds
+ Time stamp with microseconds
 
 
  user
diff --git doc/src/sgml/file-fdw.sgml doc/src/sgml/file-fdw.sgml
index 5b98782064..c3efcbd679 100644
--- doc/src/sgml/file-fdw.sgml
+++ doc/src/sgml/file-fdw.sgml
@@ -242,7 +242,7 @@ CREATE SERVER pglog FOREIGN DATA WRAPPER file_fdw;
 
 
 CREATE FOREIGN TABLE pglog (
-  log_time timestamp(3) with time zone,
+  log_time timestamp(6) with time zone,
   user_name text,
   database_name text,
   process_id integer,
diff --git src/backend/utils/error/elog.c src/backend/utils/error/elog.c
index 55ee5423af..1ee49ce2be 100644
--- src/backend/utils/error/elog.c
+++ src/backend/utils/error/elog.c
@@ -2295,7 +2295,7 @@ char *
 get_formatted_log_time(void)
 {
 	pg_time_t	stamp_time;
-	char		msbuf[13];
+	char		usbuf[16];
 
 	/* leave if already computed */
 	if (formatted_log_time[0] != '\0')
@@ -2315,13 +2315,13 @@ get_formatted_log_time(void)
 	 * nonempty or CSV mode can be selected.
 	 */
 	pg_strftime(formatted_log_time, FORMATTED_TS_LEN,
-	/* leave room for milliseconds... */
-"%Y-%m-%d %H:%M:%S %Z",
+	/* leave room for microseconds... */
+"%Y-%m-%d %H:%M:%S%Z",
 pg_localtime(_time, log_timezone));
 
 	/* 'paste' milliseconds into place... */
-	sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000));
-	memcpy(formatted_log_time + 19, msbuf, 4);
+	sprintf(usbuf, ".%06d", saved_timeval.tv_usec );
+	memcpy(formatted_log_time + 19, usbuf, 7);
 
 	return formatted_log_time;
 }
@@ -2652,9 +2652,9 @@ log_line_prefix(StringInfo buf, ErrorData *edata)
 		saved_timeval_set = true;
 	}
 
-	snprintf(strfbuf, sizeof(strfbuf), "%ld.%03d",
+	snprintf(strfbuf, sizeof(strfbuf), "%ld.%06d",
 			 (long) saved_timeval.tv_sec,
-			 (int) (saved_timeval.tv_usec / 1000));
+			 saved_timeval.tv_usec);
 
 	if (padding != 0)
 		appendStringInfo(buf, "%*s", padding, strfbuf);
-- 
2.36.1



pltcl crash on recent macOS

2022-06-13 Thread Peter Eisentraut
A little while ago, the pltcl tests starting crashing for me on macOS. 
I don't know what had changed, but I suspect it was either an operating 
system update or something like an xcode update.


Here is a backtrace:

  * frame #0: 0x7ff7b0e61853
frame #1: 0x7ff803a28751 libsystem_c.dylib`hash_search + 215
frame #2: 0x000110357700 
pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0, 
is_event_trigger=false, pltrusted=true) at pltcl.c:1418:13
frame #3: 0x000110355d50 
pltcl.so`pltcl_func_handler(fcinfo=0x7fb6f1817028, 
call_state=0x7ff7b0e61b80, pltrusted=true) at pltcl.c:814:12

...

Note that the hash_search call goes into some system library, not postgres.

The command to link pltcl is:

gcc ... -ltcl8.6 -lz -lpthread -framework CoreFoundation  -lc 
-bundle_loader ../../../src/backend/postgres


Notice the -lc in there.  If I remove that, it works again.

The -lc is explicitly added in src/pl/tcl/Makefile, so it's our own 
doing.  I tracked this back, and it's been moved and rearranged in that 
makefile a number of time.  The original addition was


commit e3909672f12e0ddf3e202b824fda068ad2195ef2
Author: Tom Lane 
Date:   Mon Dec 14 00:46:49 1998

Build pltcl.so correctly on platforms that want dependent
shared libraries to be listed in the link command.

Has anyone else seen this?

Note, I'm using the tcl-tk package from Homebrew.  The tcl installation 
provided by macOS itself no longer appears to work for linking against.