Re: Support logical replication of DDLs

2023-04-20 Thread Amit Kapila
On Thu, Apr 20, 2023 at 2:28 PM shveta malik  wrote:
>
>  Few comments for ddl_json.c in the patch dated April17:
>
...
>
> 3) expand_jsonb_array()
> arrayelem is allocated as below, but not freed.
> initStringInfo()
>
> 4) expand_jsonb_array(),
> we initialize iterator as below which internally does palloc
> it = JsonbIteratorInit(container);
> Shall this be freed at the end? I see usage of this function in other files. 
> At a few places, it is freed while not freed at other places.
>

Normally, it is a good idea to free whenever the allocation is done in
a long-lived context. However, in some places, we free just for the
sake of cleanliness. I think we don't need to bother doing retail free
in this case unless it is allocated in some long-lived context.


> 5) deparse_ddl_json_to_string(char *json_str, char** owner)
> str = palloc(value->val.string.len + 1);
> we do  palloc here and return allocated memory to caller as 'owner'. Caller 
> sets this 'owner' using SetConfigOption() which internally allocates new 
> memory and copies 'owner' to that. So the memory allocated in 
> deparse_ddl_json_to_string is never freed. Better way should be the caller 
> passing this allocated memory to deparse_ddl_json_to_string() and freeing it 
> when done. Thoughts?
>

I think that will complicate the code. I don't see the need to
allocate this in any longer-lived context, so we shouldn't be bothered
to retail pfree it.

> 6)expand_fmt_recursive():
> value = findJsonbValueFromContainer(container, JB_FOBJECT, );
> Should this 'value' be freed at the end like we do at all other places in 
> this file?
>

Yeah, we can do this for the sake of consistency.

Few comments on 0001 patch:
=
1.
+ case 'O':
+ specifier = SpecOperatorName;
+ break;
...
+ case 'R':
+ specifier = SpecRole;
+ break;
+ default:

Both of these specifiers don't seem to be used in the patch. So, we
can remove them. I see some references to 'O' in the comments, those
also need to be modified.

2.
+ /* For identity column, find the sequence owned by column in order
+ * to deparse the column definition */

In multi-line comments, the first and last lines should be empty.
Refer to multi-line comments at other places.

3.
+ return object_name.data;
+
+}

An extra empty line before } is not required.

-- 
With Regards,
Amit Kapila.




RE: Test failures of 100_bugs.pl

2023-04-20 Thread Yu Shi (Fujitsu)
On Tue, Jan 24, 2023 7:41 PM Amit Kapila  wrote:
> 
> On Tue, Jan 24, 2023 at 8:53 AM Andres Freund  wrote:
> >
> > cfbot, the buildfarm and locally I have seen 100_bugs.pl fail
> > occasionally. Just rarely enough that I never got around to looking into it
> > for real.
> >
> ...
> >
> > We see t2 added to the publication:
> > 2023-01-24 00:57:30.099 UTC [73654][client backend] [100_bugs.pl][7/5:0]
> LOG:  statement: ALTER PUBLICATION testpub ADD TABLE t2
> >
> > And that *then* "t" was synchronized:
> > 2023-01-24 00:57:30.102 UTC [73640][logical replication worker] LOG:  
> > logical
> replication table synchronization worker for subscription "testsub", table 
> "t" has
> finished
> >
> > and then that the refresh was issued:
> > 2023-01-24 00:57:30.128 UTC [73657][client backend] [100_bugs.pl][5/10:0]
> LOG:  statement: ALTER SUBSCRIPTION testsub REFRESH PUBLICATION
> >
> > And we see a walsender starting and the query to get the new tables being
> executed:
> > 2023-01-24 00:57:30.139 UTC [73660][walsender] [testsub][6/8:0] LOG:
> statement: SELECT DISTINCT t.schemaname, t.tablename
> > , t.attnames
> > FROM pg_catalog.pg_publication_tables t
> >  WHERE t.pubname IN ('testpub')
> >
> >
> > And that's it, the rest of the time is just polling.
> >
> >
> > Perhaps wait_for_subscription_sync() should dump the set of rel states to
> make
> > something like this more debuggable?
> >
> >
> > The fact that the synchronization for t finished just before the refresh 
> > makes
> > me wonder if a wakeup or a cache invalidation got lost?
> >
> 
> From the LOGs, the only thing one could draw is lost invalidation
> because the nap time of the apply worker is 1s, so it should process
> invalidation during the time we are polling. Also, the rel should be
> added to pg_subscription_rel because the test is still polling for
> rels to be in 'ready' or 'done' state.
> 
> I think we can do three things to debug (a) as you suggest dump the
> rel state in wait_for_subscription_sync; (b) add some DEBUG log in
> invalidate_syncing_table_states() to ensure that invalidation has been
> processed; (c) print rel states and relids from table_states_not_ready
> in process_syncing_tables_for_apply() to see if t2 has ever appeared
> in that list.
> 

There was a similar buildfarm failure on francolin recently[1]. I think the
problem is that after REFRESH PUBLICATION, the table sync worker for the new
table test_mismatching_types was not started. So, the test timed out while
waiting for an ERROR message that should have been reported by the table sync
worker.

--
regress_log_014_binary:
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data 
format) at 
/home/bf/bf-build/francolin/HEAD/pgsql/src/test/subscription/t/014_binary.pl 
line 269.

014_binary_subscriber.log:
2023-04-16 18:18:38.455 UTC [3079482] 014_binary.pl LOG:  statement: ALTER 
SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-04-16 18:21:39.219 UTC [3079474] ERROR:  could not receive data from WAL 
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--

I wrote a patch to dump rel state in wait_for_subscription_sync() as suggested.
Please see the attached patch.
I will try to add some debug logs in code later.

[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin=2023-04-16%2018%3A17%3A09

Regards,
Shi Yu


v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch
Description:  v1-0001-dump-rel-state-in-wait_for_subscription_sync.patch


Minor code de-duplication in fe-connect.c

2023-04-20 Thread Gurjeet Singh
Commit [1] implements Fisher-Yates shuffling algorithm to shuffle
connection addresses, in two places.

The attached patch moves the duplicated code to a function, and calls
it in those 2 places.

[1]: Support connection load balancing in libpq
7f5b19817eaf38e70ad1153db4e644ee9456853e

Best regards,
Gurjeet https://Gurje.et
Postgres Contributors Team, http://aws.amazon.com


v0-0001-Reduce-code-duplication-in-fe-connect.c.patch
Description: Binary data


Re: Make message strings in fe-connect.c consistent

2023-04-20 Thread Gurjeet Singh
On Thu, Apr 20, 2023 at 9:31 PM Tom Lane  wrote:
>
> Gurjeet Singh  writes:
> > When reviewing a recently committed patch [1] I noticed the odd usage
> > of a format specifier:
>
> > +   libpq_append_conn_error(conn, "invalid %s value: \"%s\"",
> > +   "load_balance_hosts",
> > +   conn->load_balance_hosts);
>
> > The oddity is that the first %s is unnecessary, since the value we
> > want there is a constant. Typically a format specifier used to get the
> > value stored in a variable.
>
> This is actually intentional, on the grounds that it reduces the
> number of format strings that require translation.

That's the only reason I too could come up with.

> > There's just one exception to this pattern, though.
>
> >> libpq_append_conn_error(conn, "invalid require_auth method: \"%s\"",
> >> method);
>
> Yup, this one did not get the memo.

That explains why I could not find any translation for this error message.

Best regards,
Gurjeet http://Gurje.et
Postgres Contributors Team, http://aws.amazon.com




Re: Make message strings in fe-connect.c consistent

2023-04-20 Thread Tom Lane
Gurjeet Singh  writes:
> When reviewing a recently committed patch [1] I noticed the odd usage
> of a format specifier:

> +   libpq_append_conn_error(conn, "invalid %s value: \"%s\"",
> +   "load_balance_hosts",
> +   conn->load_balance_hosts);

> The oddity is that the first %s is unnecessary, since the value we
> want there is a constant. Typically a format specifier used to get the
> value stored in a variable.

This is actually intentional, on the grounds that it reduces the
number of format strings that require translation.

> There's just one exception to this pattern, though.

>> libpq_append_conn_error(conn, "invalid require_auth method: \"%s\"",
>> method);

Yup, this one did not get the memo.

regards, tom lane




Make message strings in fe-connect.c consistent

2023-04-20 Thread Gurjeet Singh
When reviewing a recently committed patch [1] I noticed the odd usage
of a format specifier:

+   libpq_append_conn_error(conn, "invalid %s value: \"%s\"",
+   "load_balance_hosts",
+   conn->load_balance_hosts);

The oddity is that the first %s is unnecessary, since the value we
want there is a constant. Typically a format specifier used to get the
value stored in a variable.

Upon closer look, it looks like this is a common pattern in
fe-connect.c; there are many places where a %s format specifier is
being used in the format sting, where the name of the parameter would
have sufficed.

Upon some research, the only explanation I could come up with was that
this pattern of specifying error messages helps with message
translations. This way there's just one message to be translated. For
example:

.../libpq/po/es.po-#: fe-connect.c:1268 fe-connect.c:1294
fe-connect.c:1336 fe-connect.c:1345
.../libpq/po/es.po-#: fe-connect.c:1378 fe-connect.c:1422
.../libpq/po/es.po-#, c-format
.../libpq/po/es.po:msgid "invalid %s value: \"%s\"\n"

There's just one exception to this pattern, though.

> libpq_append_conn_error(conn, "invalid require_auth method: \"%s\"",
> method);

So, to make it consistent throughout the file, we should either
replace all such %s format specifiers with the respective strings, or
use the same pattern for the message string used for require_method,
as well. Attached patch [2] does the former, and patch [3] does the
latter.

Pick your favorite one.

[1]: Support connection load balancing in libpq
7f5b19817eaf38e70ad1153db4e644ee9456853e

[2]: Replace placeholders with known strings
v1-0001-Replace-placeholders-with-known-strings.patch

[3]: Make require_auth error message similar to surrounding messages
v1-0001-Make-require_auth-error-message-similar-to-surrou.patch

Best regards,
Gurjeet http://Gurje.et
Postgres Contributors Team, http://aws.amazon.com


v1-0001-Replace-placeholders-with-known-strings.patch
Description: Binary data


v1-0001-Make-require_auth-error-message-similar-to-surrou.patch
Description: Binary data


Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Ashutosh Bapat
Greate news! Hearty congratulations to Nathan, Amit and Masahiko.

On Thu, Apr 20, 2023 at 11:10 PM Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.
>
> regards, tom lane
>
>


-- 
Best Wishes,
Ashutosh Bapat




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Thomas Munro
On Fri, Apr 21, 2023 at 5:40 AM Tom Lane  wrote:
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.

+100.  Great news.




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Amit Kapila
On Thu, Apr 20, 2023 at 11:10 PM Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.
>

Congratulations to the new committers!

-- 
With Regards,
Amit Kapila.




Re: Testing autovacuum wraparound (including failsafe)

2023-04-20 Thread John Naylor
I agree having the new functions in the tree is useful. I also tried
running the TAP tests in v2, but 001 and 002 fail to run:

Odd number of elements in hash assignment at [...]/Cluster.pm line 2010.
Can't locate object method "pump_nb" via package
"PostgreSQL::Test::BackgroundPsql" at [...]

It seems to be complaining about

+my $in  = '';
+my $out = '';
+my $timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+my $background_psql = $node->background_psql('postgres', \$in, \$out,
$timeout);

...that call to background_psql doesn't look like other ones that have "key
=> value". Is there something I'm missing?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Kyotaro Horiguchi
At Thu, 20 Apr 2023 13:40:49 -0400, Tom Lane  wrote in 
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
> 
> Please join me in wishing them much success and few reverts.

Congratulations to all, from me, too, wishing much success!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread vignesh C
On Thu, 20 Apr 2023 at 23:10, Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.

Very deserved. Congratulations Nathan, Amit and Masahiko.

Regards,
Vignesh




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread John Naylor
On Fri, Apr 21, 2023 at 12:40 AM Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.

Congratulations to all!

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Support logical replication of DDLs

2023-04-20 Thread Peter Smith
Here are some more review comments for the patch 0002-2023_04_07-2

This was a WIP review in parts because the patch was quite large:

WIP part 1 [1] was posted 17/4.
WIP part 2 [2] was posted 17/4.
WIP part 3 [3] was posted 19/4.
WIP part 4 is this post. (This is my final WIP part for this 0002 patch)

==
contrib/test_decoding/sql/ddl.sql

1.
+SELECT 'ddl msg2' FROM pg_logical_emit_ddl_message('ddl msg2', 16394,
1, '{"fmt": "CREATE SCHEMA %{if_not_exists}s %{name}I
%{authorization}s", "name": "foo", "authorization": {"fmt":
"AUTHORIZATION %{authorization_role}I", "present": false,
"authorization_role": null}, "if_not_exists": ""}');

I wasn't entirely sure what are these tests showing. It seems to do
nothing but hardwire a bunch of random stuff and then print it out
again. Am I missing something?

And are those just bogus content payloads? Maybe they are valid JSON
but AFAICT nobody is using them. What is the advantage of using this
bogus payload data instead of just a simple string like "DDL message
content goes here"?

==
contrib/test_decoding/test_decoding.c

2. _PG_output_plugin_init

  cb->message_cb = pg_decode_message;
+ cb->ddl_cb = pg_decode_ddl_message;
  cb->filter_prepare_cb = pg_decode_filter_prepare;

Where is the 'stream_ddl_cb' to match this?

~~~

3. pg_decode_ddl_message

+static void
+pg_decode_ddl_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   XLogRecPtr message_lsn, const char *prefix, Oid relid,
+   DeparsedCommandType cmdtype, Size sz, const char *message)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "message: prefix: %s, relid: %u, ",
+ prefix, relid);

Should the appendStringInfo say "DDL message:" instead of "message"? I
can't tell if this was deliberate or a cut/paste error from the
existing code.

~~~

4. pg_decode_ddl_message

+ appendStringInfo(ctx->out, "sz: %zu content:", sz);
+ appendBinaryStringInfo(ctx->out, message, sz);
+ OutputPluginWrite(ctx, true);

4a.
Should there be a whitespace after that last 'content:' before the
binary content?

~

4b.
Is it necessary to say this is 'Binary'? I thought this payload was
only JSON text data.

==
src/backend/replication/logical/ddltrigger.c

5.
+/*-
+ *
+ * ddltrigger.c
+ *   Logical DDL messages.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *   src/backend/replication/logical/ddltrigger.c
+ *
+ * NOTES
+ *
+ * Deparse the ddl command and log it.
+ *
+ * ---
+ */

~

5a.
Just saying "Logical DDL messages" is the same header comment as in
the other new file ddlmessges.c, so it looks like a cut/paste issue.

~

5b.
Should say 2023.

~~~

6. publication_deparse_ddl_command_start

+/*
+ * Deparse the ddl command and log it prior to
+ * execution. Currently only used for DROP TABLE command
+ * so that catalog can be accessed before being deleted.
+ * This is to check if the table is part of the publication
+ * or not.
+ */
+Datum
+publication_deparse_ddl_command_start(PG_FUNCTION_ARGS)
+{
+ EventTriggerData *trigdata;
+ char*command = psprintf("Drop table command start");

Since information about this only being for DROP is hardcoded and in
the function comment, shouldn't this whole function be renamed to
something DROP-specific? e.g
publication_deparse_ddl_drop_command_start()

~~~

7. publication_deparse_ddl_command_start

+ if (relation)
+ table_close(relation, NoLock);

I thought this check was not needed because the relation was already
checked earlier in this function so it cannot be NULL here.

~~~

8. publication_deparse_table_rewrite

+ char relpersist;
+ CollectedCommand *cmd;
+ char*json_string;

The json_string can be declared later within the scope that uses it,
instead of here at the top.

~~~

9. publication_deparse_ddl_command_end

+ ListCell   *lc;
+ slist_iter iter;
+ DeparsedCommandType type;
+ Oid relid;
+ char relkind;

9a.
Some of these variable declarations seem misplaced. I think the
'json_string' and the 'type' can be at a lower scope, can't they?

~

9b.
Also IMO it is better to call 'type' as 'cmdtype', so it has the same
name as the variable in the other slist_foreach loop.

~~~

10. publication_deparse_ddl_command_end

+ foreach(lc, currentEventTriggerState->commandList)
+ {
+ char relpersist = RELPERSISTENCE_PERMANENT;
+ CollectedCommand *cmd = lfirst(lc);
+ char*json_string;

This json_string can be declared later only in the scope that uses it.

~~~

11. publication_deparse_ddl_command_end

+ if (cmd->type == SCT_Simple &&
+ !OidIsValid(cmd->d.simple.address.objectId))
+ continue;
+
+ if (cmd->type == SCT_AlterTable)
+ {
+ relid = cmd->d.alterTable.objectId;
+ type = DCT_TableAlter;
+ }
+ else
+ {
+ /* Only SCT_Simple for now */
+ relid = cmd->d.simple.address.objectId;
+ type = DCT_SimpleCmd;
+ }

This code seemed structured a bit strangely to me; 

Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-04-20 Thread Nathan Bossart
On Wed, Mar 15, 2023 at 10:38:34AM +0100, Daniel Gustafsson wrote:
> While here, I wonder if we should document what BGW_MAXLEN is defined as in
> bgworker.sgml?

I am -0.5 for this.  If you are writing a new background worker, it's
probably reasonable to expect that you can locate the definition of
BGW_MAXLEN.  Also, I think there's a good chance that we'd forget to update
such documentation the next time we adjust it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Should we remove vacuum_defer_cleanup_age?

2023-04-20 Thread Nathan Bossart
On Fri, Apr 14, 2023 at 03:07:37PM -0400, Jonathan S. Katz wrote:
> +1.

+1.  I agree with the upthread discussion and support removing
vacuum_defer_cleanup_age.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Memory leak from ExecutorState context?

2023-04-20 Thread Melanie Plageman
On Thu, Apr 20, 2023 at 12:42 PM Konstantin Knizhnik  wrote:
> On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:
> > On Sat, 8 Apr 2023 02:01:19 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> >
> >> On Fri, 31 Mar 2023 14:06:11 +0200
> >> Jehan-Guillaume de Rorthais  wrote:
> >>
> >>   [...]
> >>
> >> After rebasing Tomas' memory balancing patch, I did some memory measures
> >> to answer some of my questions. Please, find in attachment the resulting
> >> charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
> >> between HEAD and Tomas' patch. They shows an alternance of numbers
> >> before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
> >> didn't try to find the exact last total peak of memory consumption during 
> >> the
> >> join phase and before all the BufFiles are destroyed. So the last number
> >> might be underestimated.
> > I did some more analysis about the total memory consumption in filecxt of 
> > HEAD,
> > v3 and v4 patches. My previous debug numbers only prints memory metrics 
> > during
> > batch increments or hash table destruction. That means:
> >
> > * for HEAD: we miss the batches consumed during the outer scan
> > * for v3: adds twice nbatch in spaceUsed, which is a rough estimation
> > * for v4: batches are tracked in spaceUsed, so they are reflected in 
> > spacePeak
> >
> > Using a breakpoint in ExecHashJoinSaveTuple to print 
> > "filecxt->mem_allocated"
> > from there, here are the maximum allocated memory for bufFile context for 
> > each
> > branch:
> >
> >  batches   max bufFiles   total  spaceAllowed rise
> >HEAD16384  199966960  ~194MB
> >v3   4096   65419456   ~78MB
> >v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
> >v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
> >v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5
> >
> > It seems account for bufFile in spaceUsed allows a better memory balancing 
> > and
> > management. The precise factor to rise spaceAllowed is yet to be defined. 
> > *3 or
> > *4 looks good, but this is based on a single artificial test case.
> >
> > Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
> > far wrong with the reality. So even if we don't commit the balancing memory
> > patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?
> >
> > Regards,
>
> Thank you for the patch.
> I  faced with the same problem (OOM caused by hash join).
> I tried to create simplest test reproducing the problem:
>
> create table t(pk int, val int);
> insert into t values (generate_series(1,1),0);
> set work_mem='64kB';
> explain (analyze,buffers) select count(*) from t t1 join t t2 on
> (t1.pk=t2.pk);
>
>
> There are three workers and size of each exceeds 1.3Gb.
>
> Plan is the following:
>
>   Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1
> width=8) (actual time=2
> 12961.033..226097.513 rows=1 loops=1)
> Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374,
> temp read=944407 w
> ritten=1130380
> ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8)
> (actual time=212943.
> 505..226097.497 rows=3 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   Buffers: shared hit=32644 read=852474 dirtied=437947
> written=426374, temp read=94
> 4407 written=1130380
>   ->  Partial Aggregate (cost=355905976972.65..355905976972.66
> rows=1 width=8) (ac
> tual time=212938.410..212940.035 rows=1 loops=3)
> Buffers: shared hit=32644 read=852474 dirtied=437947
> written=426374, temp r
> ead=944407 written=1130380
> ->  Parallel Hash Join (cost=1542739.26..303822614472.65
> rows=2083334502 width=0) (actual time=163268.274..207829.524
> rows= loops=3)
>   Hash Cond: (t1.pk = t2.pk)
>   Buffers: shared hit=32644 read=852474
> dirtied=437947 written=426374, temp read=944407 written=1130380
>   ->  Parallel Seq Scan on t t1
> (cost=0.00..859144.78 rows=4178 width=4) (actual
> time=0.045..30828.051 rows= loops=3)
> Buffers: shared hit=16389 read=426089 written=87
>   ->  Parallel Hash (cost=859144.78..859144.78
> rows=4178 width=4) (actual time=82202.445..82202.447 rows=
> loops=3)
> Buckets: 4096 (originally 4096)  Batches:
> 32768 (originally 8192)  Memory Usage: 192kB
> Buffers: shared hit=16095 read=426383
> dirtied=437947 written=426287, temp read=267898 written=737164
> ->  Parallel Seq Scan on t t2
> (cost=0.00..859144.78 rows=4178 width=4) (actual
> time=0.054..12647.534 rows= loops=3)
>   Buffers: shared hit=16095 read=426383
> dirtied=437947 writ
> ten=426287
>   Planning:
> 

Re: Fix typos and inconsistencies for v16

2023-04-20 Thread David Rowley
On Wed, 19 Apr 2023 at 07:00, Alexander Lakhin  wrote:
> please look at the similar list for v15+ (596b5af1d..HEAD).

I've now pushed most of these but didn't include the following ones:

> 3. BufFileOpenShared -> BufFileOpenFileSet // see dcac5e7ac

Maybe I need to spend longer, but I just didn't believe the command
that claimed that "BufFiles opened using BufFileOpenFileSet() are
read-only by definition". Looking at the code for that, it seems to
depend on if O_RDONLY is included in the mode flags.

> 19. multidimensional-aware -> multidimension-aware // sync with gistbuild.c

I didn't change this as I didn't think it was an improvement.  I'd
probably have written "multidimensionally aware", but I didn't feel
strongly enough to want to change it.

David




Re: Remove io prefix from pg_stat_io columns

2023-04-20 Thread Michael Paquier
On Thu, Apr 20, 2023 at 11:38:42AM +0900, Michael Paquier wrote:
> No worries, committers should take care of that.

Done as of 0ecb87e, as I can keep an eye on the buildfarm today, with
a catversion bump.
--
Michael


signature.asc
Description: PGP signature


Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread David Rowley
On Fri, 21 Apr 2023 at 05:40, Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.

Great news! Congratulations to all three!

David




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Michael Paquier
On Thu, Apr 20, 2023 at 01:40:49PM -0400, Tom Lane wrote:
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
> 
> Please join me in wishing them much success and few reverts.

Congratulations to all of you!

May the buildfarm show kindness to your commits.
--
Michael


signature.asc
Description: PGP signature


Re: Incremental sort for access method with ordered scan support (amcanorderbyop)

2023-04-20 Thread David Rowley
On Thu, 20 Apr 2023 at 18:46, Richard Guo  wrote:
>
>
> On Thu, Apr 20, 2023 at 6:38 AM David Rowley  wrote:
>>
>> That function is pretty new and was exactly added so we didn't have to
>> write list_truncate(list_copy(...), n) anymore.  That gets pretty
>> wasteful when the input List is long and we only need a small portion
>> of it.
>
> I searched the codes and found some other places where the manipulation
> of lists can be improved in a similar way.

I'd be happy to discuss our thought about List inefficiencies, but I
think to be fair to Miroslav, we should do that somewhere else. The
list_copy_head() discussion was directly related to his patch due to
the list of list_truncate(list_copy(..), ..).  The other things you've
mentioned are not.  Feel free to start a thread and copy me in.

David




Re: Incremental sort for access method with ordered scan support (amcanorderbyop)

2023-04-20 Thread Ranier Vilela
>I searched the codes and found some other places where the manipulation
>of lists can be improved in a similar way.

>* lappend(list_copy(list), datum) as in get_required_extension().
>This is not very efficient as after list_copy it would need to enlarge
>the list immediately. It can be improved by inventing a new function,
>maybe called list_append_copy, that do the copy and append all together.

>* lcons(datum, list_copy(list)) as in get_query_def().
>This is also not efficient. Immediately after list_copy, we'd need to
>enlarge the list and move all the entries. It can also be improved by
>doing all these things all together in one function.

>* lcons(datum, list_delete_nth_cell(list_copy(list), n)) as in
>sort_inner_and_outer.
>It'd need to copy all the elements, and then delete the n'th entry which
>would cause all following entries be moved, and then move all the
>remaining entries for lcons. Maybe we can invent a new function for it?

>So is it worthwhile to improve these places?

I think yes. It's very inefficient coping and moving, unnecessarily.

Perhaps, like the attached patch?

lcons_copy_delete needs a careful review.


>I wonder if we can invent function list_nth_xid to do it, to keep
>consistent with list_nth/list_nth_int/list_nth_oid.

Perhaps list_nth_xid(const List *list, int n)?

regards,

Ranier Vilela


0001-add-new-list-functions.patch
Description: Binary data


Re: LLVM strip -x fails

2023-04-20 Thread Tom Lane
Andres Freund  writes:
> On 2023-04-20 12:43:48 -0400, Tom Lane wrote:
>> The previous complaint about this [1] suggested we use --strip-unneeded
>> for all cases with GNU strip, same as we've long done for shared libs.
>> It's an easy enough change, but I wonder if anyone will complain.

> I doubt anybody wants to strip symbols and keep debug information, so I doubt
> there's much ground for complaints?

Agreed.  It doesn't look like --strip-unneeded is a worse choice than -x.

> Oddly the output of llvm-strip confuses binutils objdump enough that it claims
> that "file format not recognized". Not sure which side is broken there.

Not our problem, I'd say ...

> llvm-strip's output is a lot larger than gnu strip's:

... nor that.  These things do suggest that llvm-strip isn't all that
close to being ready for prime time, but if FreeBSD wants to push the
envelope on toolchain usage, who are we to stand in the way?

I'll go make it so.

regards, tom lane




Re: Improve logging when using Huge Pages

2023-04-20 Thread Nathan Bossart
On Thu, Mar 23, 2023 at 07:23:28AM +0900, Michael Paquier wrote:
> On Wed, Mar 22, 2023 at 05:18:28PM -0500, Justin Pryzby wrote:
>> Wow, good point.  I think to make it work we'd need put
>> huge_pages_active into BackendParameters and handle it in
>> save_backend_variables().  If so, that'd be strong argument for using a
>> GUC, which already has all the necessary infrastructure for exposing the
>> server's state.
> 
> I am afraid so, duplicating an existing infrastructure for a need like
> the one of this thread is not really appealing.

AFAICT this would involve adding a bool to BackendParameters and using it
in save_backend_variables() and restore_backend_variables(), which is an
additional 3 lines of code.  That doesn't sound too bad to me, but perhaps
I am missing something.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Non-superuser subscription owners

2023-04-20 Thread Robert Haas
On Thu, Apr 20, 2023 at 1:08 AM Amit Kapila  wrote:
> Pushed. I noticed that we didn't display this new subscription option
> 'password_required' in \dRs+:
>
> postgres=# \dRs+
>
>   List of subscriptions
>  Name |  Owner   | Enabled | Publication | Binary | Streaming |
> Two-phase commit | Disable on error | Origin | Run as Owner? |
> Synchronous commit |Conninfo | Skip LSN
>
> Is that intentional? Sorry, if it was discussed previously because I
> haven't followed this discussion in detail.

No, I don't think that's intentional. I just didn't think about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Tab completion for GRANT MAINTAIN

2023-04-20 Thread Nathan Bossart
On Wed, Apr 19, 2023 at 10:54:05AM -0400, Tom Lane wrote:
> (One could wish that it didn't take touching three or so places in
> tab-complete.c to add a privilege, especially when a naive hacker might
> think he was done after touching Privilege_options_of_grant_and_revoke.
> I didn't see any easy way to improve that situation though.)

Sorry, I think this was my fault.  Thanks for fixing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Request for comment on setting binary format output per session

2023-04-20 Thread Dave Cramer
On Tue, 18 Apr 2023 at 12:31, Dave Cramer  wrote:

>
>
> On Tue, 18 Apr 2023 at 12:24, Robert Haas  wrote:
>
>> On Tue, Apr 18, 2023 at 11:51 AM Tom Lane  wrote:
>> > Robert Haas  writes:
>> > > One thing I think we should do in this area is introduce #defines for
>> > > all the message type codes and use those instead of having hard-coded
>> > > constants everywhere.
>> >
>> > +1, but I wonder where we should put those exactly.  My first thought
>> > was postgres_ext.h, but the charter for that is
>> >
>> >  * This file contains declarations of things that are visible
>> everywhere
>> >  *  in PostgreSQL *and* are visible to clients of frontend interface
>> libraries.
>> >  *  For example, the Oid type is part of the API of libpq and other
>> libraries.
>> >
>> > so picayune details of the wire protocol probably don't belong there.
>> > Maybe we need a new header concerned with the wire protocol?
>>
>> Yeah. I sort of thought maybe one of the files in src/include/libpq
>> would be the right place, but it doesn't look like it.
>>
>> If we at least created the defines and replaced occurrences with the
> same, then we can litigate where to put them later.
>
> I think I'd prefer this in a different patch, but I'd be willing to take a
> run at it.
>

As promised here is a patch with defines for all of the protocol messages.
I created a protocol.h file and put it in src/includes
I'm fairly sure that some of the names I used may need to be changed but
the grunt work of finding and replacing everything is done.

Dave Cramer

>
> Dave
>


0001-Created-protocol.h.patch
Description: Binary data


Re: LLVM strip -x fails

2023-04-20 Thread Andres Freund
Hi,

On 2023-04-20 12:43:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Afaict the only safe thing to use when stripping static libs is
> > -g/--strip-debug.
>
> The previous complaint about this [1] suggested we use --strip-unneeded
> for all cases with GNU strip, same as we've long done for shared libs.
> It's an easy enough change, but I wonder if anyone will complain.

--strip-unneeded output is smaller than -x output:

19M  src/interfaces/libpq/libpq.a
364K src/interfaces/libpq/libpq.a.strip.gnu.g
352K src/interfaces/libpq/libpq.a.strip.gnu.unneeded
356K src/interfaces/libpq/libpq.a.strip.gnu.x
352K src/interfaces/libpq/libpq.a.strip.gnu.x.g

strip --version
GNU strip (GNU Binutils for Debian) 2.40

Partially that's because --strip-unneeded implies -g. Interestingly -x -g
output isn't quite the same as --strip-unneeded. The latter also removes the
_GLOBAL_OFFSET_TABLE_ symbol.


I doubt anybody wants to strip symbols and keep debug information, so I doubt
there's much ground for complaints?


Oddly the output of llvm-strip confuses binutils objdump enough that it claims
that "file format not recognized". Not sure which side is broken there.

llvm-strip's output is a lot larger than gnu strip's:
 19M src/interfaces/libpq/libpq.a
 19M src/interfaces/libpq/libpq.a.strip.llvm.X
908K src/interfaces/libpq/libpq.a.strip.llvm.g
892K src/interfaces/libpq/libpq.a.strip.llvm.unneeded
892K src/interfaces/libpq/libpq.a.strip.llvm.unneeded.g
364K src/interfaces/libpq/libpq.a.strip.gnu.g
356K src/interfaces/libpq/libpq.a.strip.gnu.x
352K src/interfaces/libpq/libpq.a.strip.gnu.x.g
352K src/interfaces/libpq/libpq.a.strip.gnu.unneeded

Greetings,

Andres Freund




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Andrew Dunstan


On 2023-04-20 Th 14:12, Peter Geoghegan wrote:

On Thu, Apr 20, 2023 at 10:56 AM Pavel Borisov  wrote:

It's much deserved! Congratulations, Nathan, Amit and Masahiko!

Congratulations to all three!



+3


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Incremental sort for access method with ordered scan support (amcanorderbyop)

2023-04-20 Thread Miroslav Bendik
> I've just pushed a fix to master for this. See [1].  If you base your
> patch atop of that you should be able to list list_copy_head() without
> any issues.

Thanks for this fix. Now the version
am_orderbyop_incremental_sort_v3.1.patch [1] works without issues
using the master branch.

[1] 
https://www.postgresql.org/message-id/attachment/146450/am_orderbyop_incremental_sort_v3.1.patch




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Peter Geoghegan
On Thu, Apr 20, 2023 at 10:56 AM Pavel Borisov  wrote:
> It's much deserved! Congratulations, Nathan, Amit and Masahiko!

Congratulations to all three!

-- 
Peter Geoghegan




Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Pavel Borisov
On Thu, 20 Apr 2023 at 21:40, Tom Lane  wrote:
>
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
>
> Please join me in wishing them much success and few reverts.
>
> regards, tom lane
Great news!
It's much deserved! Congratulations, Nathan, Amit and Masahiko!

Pavel Borisov




New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-20 Thread Tom Lane
The Core Team would like to extend our congratulations to
Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
accepted invitations to become our newest Postgres committers.

Please join me in wishing them much success and few reverts.

regards, tom lane




Re: LLVM strip -x fails

2023-04-20 Thread Tom Lane
Andres Freund  writes:
> Afaict the only safe thing to use when stripping static libs is
> -g/--strip-debug.

The previous complaint about this [1] suggested we use --strip-unneeded
for all cases with GNU strip, same as we've long done for shared libs.
It's an easy enough change, but I wonder if anyone will complain.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/17898-5308d09543463266%40postgresql.org




Re: Memory leak from ExecutorState context?

2023-04-20 Thread Konstantin Knizhnik




On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:

On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais  wrote:


On Fri, 31 Mar 2023 14:06:11 +0200
Jehan-Guillaume de Rorthais  wrote:

  [...]

After rebasing Tomas' memory balancing patch, I did some memory measures
to answer some of my questions. Please, find in attachment the resulting
charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
between HEAD and Tomas' patch. They shows an alternance of numbers
before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
didn't try to find the exact last total peak of memory consumption during the
join phase and before all the BufFiles are destroyed. So the last number
might be underestimated.

I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

 batches   max bufFiles   total  spaceAllowed rise
   HEAD16384  199966960  ~194MB
   v3   4096   65419456   ~78MB
   v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
   v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
   v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,


Thank you for the patch.
I  faced with the same problem (OOM caused by hash join).
I tried to create simplest test reproducing the problem:

create table t(pk int, val int);
insert into t values (generate_series(1,1),0);
set work_mem='64kB';
explain (analyze,buffers) select count(*) from t t1 join t t2 on 
(t1.pk=t2.pk);



There are three workers and size of each exceeds 1.3Gb.

Plan is the following:

 Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1 
width=8) (actual time=2

12961.033..226097.513 rows=1 loops=1)
   Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374, 
temp read=944407 w

ritten=1130380
   ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8) 
(actual time=212943.

505..226097.497 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=32644 read=852474 dirtied=437947 
written=426374, temp read=94

4407 written=1130380
 ->  Partial Aggregate (cost=355905976972.65..355905976972.66 
rows=1 width=8) (ac

tual time=212938.410..212940.035 rows=1 loops=3)
   Buffers: shared hit=32644 read=852474 dirtied=437947 
written=426374, temp r

ead=944407 written=1130380
   ->  Parallel Hash Join (cost=1542739.26..303822614472.65 
rows=2083334502 width=0) (actual time=163268.274..207829.524 
rows= loops=3)

 Hash Cond: (t1.pk = t2.pk)
 Buffers: shared hit=32644 read=852474 
dirtied=437947 written=426374, temp read=944407 written=1130380
 ->  Parallel Seq Scan on t t1 
(cost=0.00..859144.78 rows=4178 width=4) (actual 
time=0.045..30828.051 rows= loops=3)

   Buffers: shared hit=16389 read=426089 written=87
 ->  Parallel Hash (cost=859144.78..859144.78 
rows=4178 width=4) (actual time=82202.445..82202.447 rows= 
loops=3)
   Buckets: 4096 (originally 4096)  Batches: 
32768 (originally 8192)  Memory Usage: 192kB
   Buffers: shared hit=16095 read=426383 
dirtied=437947 written=426287, temp read=267898 written=737164
   ->  Parallel Seq Scan on t t2 
(cost=0.00..859144.78 rows=4178 width=4) (actual 
time=0.054..12647.534 rows= loops=3)
 Buffers: shared hit=16095 read=426383 
dirtied=437947 writ

ten=426287
 Planning:
   Buffers: shared hit=69 read=38
 Planning Time: 2.819 ms
 Execution Time: 226113.292 ms
(22 rows)



-

So we have increased number of batches to 32k.
I applied your patches 0001-0004 but unfortunately them have not reduced 
memory consumption - still size of each backend is more than 1.3Gb.


I wonder what can be the prefered solution of the problem?
We 

Re: Wrong results from Parallel Hash Full Join

2023-04-20 Thread Justin Pryzby
On Wed, Apr 19, 2023 at 08:47:07PM -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 8:41 PM Justin Pryzby  wrote:
> >
> > On Wed, Apr 19, 2023 at 12:20:51PM -0700, Andres Freund wrote:
> > > On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
> > > > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
> > > > > Ultimately this is probably fine. If we wanted to modify one of the
> > > > > existing tests to cover the multi-batch case, changing the select
> > > > > count(*) to a select * would do the trick. I imagine we wouldn't want 
> > > > > to
> > > > > do this because of the excessive output this would produce. I wondered
> > > > > if there was a pattern in the tests for getting around this.
> > > >
> > > > You could use explain (ANALYZE).  But the output is machine-dependant in
> > > > various ways (which is why the tests use "explain analyze so rarely).
> > >
> > > I think with sufficient options it's not machine specific.
> >
> > It *can* be machine specific depending on the node type..
> >
> > In particular, for parallel workers, it shows "Workers Launched: ..",
> > which can vary even across executions on the same machine.  And don't
> > forget about "loops=".
> >
> > Plus:
> > src/backend/commands/explain.c: "Buckets: %d  Batches: %d  Memory Usage: 
> > %ldkB\n",
> >
> > > We have a bunch of
> > >  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
> > > in our tests.
> >
> > There's 81 uses of "timing off", out of a total of ~1600 explains.  Most
> > of them are in partition_prune.sql.  explain analyze is barely used.
> >
> > I sent a patch to elide the machine-specific parts, which would make it
> > easier to use.  But there was no interest.
> 
> While I don't know about other use cases, I would have used that here.
> Do you still have that patch laying around? I'd be interested to at
> least review it.

https://commitfest.postgresql.org/41/3409/




Re: Wrong results from Parallel Hash Full Join

2023-04-20 Thread Melanie Plageman
On Wed, Apr 19, 2023 at 8:43 PM Melanie Plageman
 wrote:
> On Wed, Apr 19, 2023 at 3:20 PM Andres Freund  wrote:
>> On 2023-04-19 12:16:24 -0500, Justin Pryzby wrote:
>> > On Wed, Apr 19, 2023 at 11:17:04AM -0400, Melanie Plageman wrote:
>> > > Ultimately this is probably fine. If we wanted to modify one of the
>> > > existing tests to cover the multi-batch case, changing the select
>> > > count(*) to a select * would do the trick. I imagine we wouldn't want to
>> > > do this because of the excessive output this would produce. I wondered
>> > > if there was a pattern in the tests for getting around this.
>> >
>> > You could use explain (ANALYZE).  But the output is machine-dependant in
>> > various ways (which is why the tests use "explain analyze so rarely).
>>
>> I think with sufficient options it's not machine specific. We have a bunch of
>>  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) ..
>> in our tests.
>
>
> Cool. Yea, so ultimately these options are almost enough but memory
> usage changes from execution to execution. There are some tests which do
> regexp_replace() on the memory usage part of the EXPLAIN ANALYZE output
> to allow us to still compare the plans. However, I figured if I was
> already going to go to the trouble of using regexp_replace(), I might as
> well write a function that returns the "Actual Rows" field from the
> EXPLAIN ANALYZE output.
>
> The attached patch does that. I admittedly mostly copy-pasted the
> plpgsql function from similar examples in other tests, and I suspect it
> may be overkill and also poorly written.

I renamed the function to join_hash_actual_rows to avoid potentially
affecting other tests. Nothing about the function is specific to a hash
join plan, so I think it is more clear to prefix the function with the
test file name. v2 attached.

- Melanie
From d8f6946cf127092913d8f1b7a9741b97f3215bbd Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 19 Apr 2023 20:09:37 -0400
Subject: [PATCH v2] Test multi-batch PHJ match bit initialization

558c9d75 fixed a bug with initialization of the hash join match status
bit and added test coverage for serial hash join and single batch
parallel hash join. Add a test for multi-batch parallel hash join.

To test this with the existing relations in the join_hash test file,
this commit modifies some of the test queries to use SELECT * instead of
SELECT COUNT(*), which was needed to ensure that the tuples being
inserted into the hashtable had not already been made into virtual
tuples and lost their HOT status bit.

These modifications made the original tests introduced in 558c9d75
redundant.

Discussion: https://postgr.es/m/ZEAh6CewmARbDVuN%40telsasoft.com
---
 src/test/regress/expected/join_hash.out | 134 +++-
 src/test/regress/sql/join_hash.sql  |  72 +++--
 2 files changed, 100 insertions(+), 106 deletions(-)

diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out
index e892e7cccb..d4a70f6754 100644
--- a/src/test/regress/expected/join_hash.out
+++ b/src/test/regress/expected/join_hash.out
@@ -49,10 +49,35 @@ begin
   end loop;
 end;
 $$;
+-- Return the number of actual rows returned by a query. This is useful when a
+-- count(*) would not execute the desired codepath but a SELECT * would produce
+-- an excessive number of rows.
+create or replace function join_hash_actual_rows(query text)
+returns jsonb language plpgsql
+as
+$$
+declare
+  elements jsonb;
+begin
+  execute 'explain (analyze, costs off, summary off, timing off, format ''json'') ' || query into strict elements;
+  if jsonb_array_length(elements) > 1 then
+raise exception 'Cannot return actual rows from more than one plan';
+  end if;
+  return elements->0->'Plan'->'Actual Rows';
+end;
+$$;
 -- Make a simple relation with well distributed keys and correctly
 -- estimated size.
-create table simple as
-  select generate_series(1, 2) AS id, 'aa';
+create table simple (id int, value text);
+insert into simple values (1, 'aa');
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable. To test this, update simple, creating a HOT tuple. If this status
+-- bit isn't cleared, we won't correctly emit the NULL-extended unmatching
+-- tuple in full hash join.
+update simple set id = 1;
+insert into simple select generate_series(2, 2), 'aa';
 alter table simple set (parallel_workers = 2);
 analyze simple;
 -- Make a relation whose size we will under-estimate.  We want stats
@@ -273,6 +298,7 @@ set local max_parallel_workers_per_gather = 2;
 set local work_mem = '192kB';
 set local hash_mem_multiplier = 1.0;
 set local enable_parallel_hash = on;
+set local parallel_tuple_cost = 0;
 explain 

Re: LLVM strip -x fails

2023-04-20 Thread Andres Freund
Hi,

Moving this to -hackers.

On 2023-04-20 11:49:23 +0200, Palle Girgensohn wrote:
> I was recently made aware of a problem building postgresql using LLVM 
> binutils.
> 
> A summary:
> 
> --
> 
> pgsql's build has requested to strip all non-global symbols (strip -x), but
> there is at least one non-global symbol that in fact cannot be stripped
> because it is referenced by a relocation.

An important detail here is that this happens when stripping static
libraries. It's not too surprising that one needs the symbols referenced by
relocations until after linking with the static lib, even if they're not
globally visible symbols.


> Both GNU strip and ELF Tool Chain strip silently handle this case (and just 
> retain the local symbol), but LLVM strip is stricter and emits an error upon 
> request to strip a non-removable local symbol.
> 
> There is an LLVM ticket open for this at 
> https://github.com/llvm/llvm-project/issues/47468, and it may make sense for 
> LLVM strip to behave the same as GNU and ELF Tool Chain strip. That said, 
> pgsql should just not use strip -x when there are symbols that cannot be 
> stripped.

Personally I'd say stripping symbols is something that should just not be done
anymore, but ...

> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270769
> 
> https://reviews.freebsd.org/D39590
> 
> 
> Any toughts about this? Should send the suggested patch upstreams? Or do we 
> just consider LLVM to behave badly?

Peter, it's unlikely given the timeframe, but do you happen to remember why
you specified -x when stripping static libs? This seems to be all the way back
from

commit 563673e15db995b6f531b44be7bb162330ac157a
Author: Peter Eisentraut 
Date:   2002-04-10 16:45:25 +

Add make install-strip target.


Afaict the only safe thing to use when stripping static libs is
-g/--strip-debug.

Greetings,

Andres Freund




Re: check_strxfrm_bug()

2023-04-20 Thread Jeff Davis
On Thu, 2023-04-20 at 13:34 +1200, Thomas Munro wrote:
> I could write a patch to remove the libc strxfrm support, but since
> Jeff recently wrote new code in 16 to abstract that stuff, he might
> prefer to look at it?

+1 to removing it.

As far as how it's removed, we could directly check:

  if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
abbreviate = false;

as it was before, or we could still try to hide it as a detail behind a
function. I don't have a strong opinion there, though I thought it
might be good for varlena.c to not know those internal details.

Regards,
Jeff Davis





memory context mismatch in LockMethodLocalHash entries.

2023-04-20 Thread Ashutosh Bapat
Hi All,
LockMethodLocalHash hash table is allocated in memory context ""LOCALLOCK hash".
LockAcquireExtended() fetches an entry from this hash table and
allocates memory for locallock->lockOwners in TopMemoryContext. Thus
the entry locallock and an array pointed from this entry are allocated
in two different memory contexts.

Theoretically if LockMethodLocalHash was destroyed with some entries
in it, it would leave some dangling pointers in TopMemoryContext. It
looks to me that the array lockOwners should be allocated in same
context as LOCALLOCK hash or its child. Something like below

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 42595b38b2..32804b1c2c 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -843,7 +843,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
locallock->maxLockOwners = 8;
locallock->lockOwners = NULL;   /* in case next line fails */
locallock->lockOwners = (LOCALLOCKOWNER *)
-   MemoryContextAlloc(TopMemoryContext,
+   MemoryContextAlloc(GetMemoryChunkContext(locallock),

locallock->maxLockOwners * sizeof(LOCALLOCKOWNER));
}
else

LockMethodLocalHash is hash_destroyed() in InitLocks(). The comment
there mentions that possibly there are no entries in that hash table
at that time. So the problem described above is rare or non-existent
as of now. But it's possibly worth fixing while it's not too serious.

There were some untraceable bugs related to locks reported earlier
[1]. Those may be linked to this. But we couldn't establish the
connection.

[1] 
https://www.postgresql.org/message-id/flat/5227.1315428924%40sss.pgh.pa.us#00116525613b7ddb82669d2ba358b31e

-- 
Best Wishes,
Ashutosh Bapat




Re: Should we put command options in alphabetical order in the doc?

2023-04-20 Thread Daniel Gustafsson
> On 20 Apr 2023, at 14:40, David Rowley  wrote:

> I see "man grep" categorises the command line options and then sorts
> alphabetically within the category.


On FreeBSD and macOS "man grep" lists all options alphabetically.

> FWIW, vacuumdb --help has its options in alphabetical order using the
> abbreviated form of the option.

It does (as most of our binaries do) group "Connection options" separately
though, and in initdb --help and pg_dump --help we have other groupings as
well.

--
Daniel Gustafsson





[BUG] FK broken after DETACHing referencing part

2023-04-20 Thread Jehan-Guillaume de Rorthais
Hi,

Considering two partitionned tables with a FK between them:

  DROP TABLE IF EXISTS p, c, c_1 CASCADE;

  --
  -- Parent table + partition + data
  CREATE TABLE p (
id bigint PRIMARY KEY
  )
  PARTITION BY list (id);

  CREATE TABLE p_1 PARTITION OF p FOR VALUES IN (1);

  INSERT INTO p VALUES (1);

  
  -- Child table + partition + data
  CREATE TABLE c (
idbigint PRIMARY KEY,
p_id bigint NOT NULL,
FOREIGN KEY (p_id) REFERENCES p (id)
  )
  PARTITION BY list (id);

  CREATE TABLE c_1 PARTITION OF c FOR VALUES IN (1);

  INSERT INTO c VALUES (1,1);

After DETACHing the "c_1" partition, current implementation make sure it
keeps the FK herited from its previous top table "c":

  ALTER TABLE c DETACH PARTITION c_1;
  \d c_1
  -- outputs:
  -- [...]
  -- Foreign-key constraints:
  --   "c_p_id_fkey" FOREIGN KEY (p_id) REFERENCES p(id)

However, because the referenced side is partionned, this FK is half backed, with
only the referencing (insert/update on c_1) side enforced, but not the
referenced side (update/delete on p):

  INSERT INTO c_1 VALUES (2,2); -- fails as EXPECTED
  -- ERROR:  insert or update on table "child_1" violates foreign key [...]

  DELETE FROM p; -- should actually fail
  -- DELETE 1

  SELECT * FROM c_1;
  --  id | parent_id 
  -- +---
  --   1 | 1
  -- (1 row)

  SELECT * FROM p;
  --  id 
  -- 
  -- (0 rows)

When detaching "c_1", current implementation adds two triggers to enforce
UPDATE/DELETE on "p" are restricted if "c_1" keeps referencing the
related rows... But it forgets to add them on partitions of "p_1", where the
triggers should actually fire.

To make it clear, the FK c_1 -> p constraint and triggers after DETACHING c_1
are:

  SELECT c.oid AS conid, c.conname, c.conparentid AS conparent,
 r2.relname AS pkrel,
 t.tgrelid::regclass AS tgrel,
 p.proname
  FROM pg_constraint c 
  JOIN pg_class r ON c.conrelid = r.oid
  JOIN pg_class r2 ON c.confrelid = r2.oid
  JOIN pg_trigger t ON t.tgconstraint = c.oid
  JOIN pg_proc p ON p.oid = t.tgfoid
  WHERE r.relname = 'c_1' AND r2.relname LIKE 'p%'
  ORDER BY r.relname, c.conname, t.tgrelid::regclass::text, p.proname;

  --  conid |   conname   | conparent | pkrel | tgrel |   proname
  -- ---+-+---+---+---+--
  --  18454 | c_p_id_fkey | 0 | p | c_1   | RI_FKey_check_ins
  --  18454 | c_p_id_fkey | 0 | p | c_1   | RI_FKey_check_upd
  --  18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_del
  --  18454 | c_p_id_fkey | 0 | p | p | RI_FKey_noaction_upd

Where they should be:

  --  conid |   conname| conparent | pkrel | tgrel |   proname
  -- ---+--+---+---+---+--
  --  18454 | c_p_id_fkey  | 0 | p | c_1   | RI_FKey_check_ins
  --  18454 | c_p_id_fkey  | 0 | p | c_1   | RI_FKey_check_upd
  --  18454 | c_p_id_fkey  | 0 | p | p | RI_FKey_noaction_del
  --  18454 | c_p_id_fkey  | 0 | p | p | RI_FKey_noaction_upd
  --  NEW!! | c_p_id_fkey1 | 18454 | p_1   | p_1   | RI_FKey_noaction_del
  --  NEW!! | c_p_id_fkey1 | 18454 | p_1   | p_1   | RI_FKey_noaction_upd

I poked around DetachPartitionFinalize() to try to find a way to fix this, but
it looks like it would duplicate a bunch of code from other code path (eg.
from CloneFkReferenced).

Instead of tweaking existing FK, keeping old constraint name (wouldn't
"c_1_p_id_fkey" be better after detach?) and duplicating some code around, what
about cleaning up the FK constraints from the detached table and
recreating a cleaner one using the known code path ATAddForeignKeyConstraint() ?

Thanks for reading me down to here!

++




Re: Should we put command options in alphabetical order in the doc?

2023-04-20 Thread David Rowley
On Wed, 19 Apr 2023 at 22:04, Alvaro Herrera  wrote:
>
> On 2023-Apr-18, Peter Geoghegan wrote:
>
> > While I'm certain that nobody will agree with me on every little
> > detail, I have to imagine that most would find my preferred ordering
> > quite understandable and unsurprising, at a high level -- this is not
> > a hopelessly idiosyncratic ranking, that could just as easily have
> > been generated by a PRNG. People may not easily agree that "apples are
> > more important than oranges, or vice-versa", but what does it matter?
> > I've really only put each option into buckets of items with *roughly*
> > the same importance. All of the details beyond that don't matter to
> > me, at all.
>
> I agree with you that roughly bucketing items is a good approach.
> Within each bucket we can then sort alphabetically.

If these "buckets" were subcategories, then it might be ok. I see "man
grep" categorises the command line options and then sorts
alphabetically within the category. If we could come up with a way of
categorising the options then this would satisfy what Melanie
mentioned about having the argument types listed separately. However,
I'm really not sure which categories we could have.  I really don't
have any concrete ideas here, but I'll attempt to at least start
something:

Behavioral:
ANALYZE
DISABLE_PAGE_SKIPPING
FREEZE
FULL
INDEX_CLEANUP
ONLY_DATABASE_STATS
PROCESS_MAIN
PROCESS_TOAST
SKIP_DATABASE_STATS
SKIP_LOCKED
TRUNCATE

Resource Usage:
BUFFER_USAGE_LIMIT
PARALLEL

Informational:
VERBOSE

Option Parameters:
boolean
column_name
integer
size
table_name

I'm just not sure if we have enough options to have a need to
categorise them.  Also, going by the categories I attempted to come up
with, it just feels like "Behavioral" contains too many and
"Informational" is likely only ever going to contain VERBOSE. So I'm
not very happy with them.

I'm not really feeling excited enough about this to even come up with
a draft patch. I thought I'd send out this anyway to see if anyone can
think of anything better.

FWIW, vacuumdb --help has its options in alphabetical order using the
abbreviated form of the option.

David




Re: Initial Schema Sync for Logical Replication

2023-04-20 Thread Kumar, Sachin
I am working on a prototype with above discussed idea, I think I will send it 
for initial review by Monday.

Regards
Sachin



Re: Support logical replication of DDLs

2023-04-20 Thread shveta malik
On Thu, Apr 20, 2023 at 2:28 PM shveta malik  wrote:
>
> On Thu, Apr 20, 2023 at 9:11 AM shveta malik  wrote:
>>
>> On Mon, Apr 17, 2023 at 5:32 PM Zhijie Hou (Fujitsu)
>>  wrote:
>> >
>> > Attach the new version patch set which include the following changes:
>> Few comments for ddl_deparse.c in patch dated April17:
>>
>  Few comments for ddl_json.c in the patch dated April17:
>

Few more comments, mainly for event_trigger.c  in the patch dated April17:

1)EventTriggerCommonSetup()
+pub_funcoid = LookupFuncName(pubfuncname, 0, NULL, true);

This is the code where we have special handling for ddl-triggers. Here
we are passing 'missing_ok' as true, so shouldn't we check pub_funcoid
against 'InvalidOid' ?


2) EventTriggerTableInitWriteEnd()

+ if (stmt->objtype == OBJECT_TABLE)
+ {
+parent = currentEventTriggerState->currentCommand->parent;
+pfree(currentEventTriggerState->currentCommand);
+currentEventTriggerState->currentCommand = parent;
+ }
+ else
+ {
+MemoryContext oldcxt;
+oldcxt = MemoryContextSwitchTo(currentEventTriggerState->cxt);
+currentEventTriggerState->currentCommand->d.simple.address = address;
+currentEventTriggerState->commandList =
+ lappend(currentEventTriggerState->commandList,
+ currentEventTriggerState->currentCommand);
+
+ MemoryContextSwitchTo(oldcxt);
+ }
+}

It will be good to add some comments in the 'else' part. It is not
very much clear what exactly we are doing here and for which scenario.


3) EventTriggerTableInitWrite()

+ if (!currentEventTriggerState)
+ return;
+
+ /* Do nothing if no command was collected. */
+ if (!currentEventTriggerState->currentCommand)
+  return;

Is it possible that when we reach here no command is collected yet? I
think this can happen only for the case when
commandCollectionInhibited is true. If so, above can be modified to:

if (!currentEventTriggerState ||
currentEventTriggerState->commandCollectionInhibited)
return;
Assert(currentEventTriggerState->currentCommand != NULL);

This will make the behaviour and checks consistent across similar
functions in this file.


4) EventTriggerTableInitWriteEnd()
Here as well, we can have the same assert as below:
 Assert(currentEventTriggerState->currentCommand != NULL);
'currentEventTriggerState' and 'commandCollectionInhibited' checks are
already present.

5) logical_replication.sgml:
 +  'This is especially useful if you have lots schema changes over
time that need replication.'

 lots schema --> lots of schema

thanks
Shveta




Re: Should we put command options in alphabetical order in the doc?

2023-04-20 Thread Melanie Plageman
On Wed, Apr 19, 2023 at 05:33:47PM -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 2:39 PM Peter Geoghegan  wrote:
> >
> > On Wed, Apr 19, 2023 at 3:04 AM Alvaro Herrera  
> > wrote:
> > > > While I'm certain that nobody will agree with me on every little
> > > > detail, I have to imagine that most would find my preferred ordering
> > > > quite understandable and unsurprising, at a high level -- this is not
> > > > a hopelessly idiosyncratic ranking, that could just as easily have
> > > > been generated by a PRNG. People may not easily agree that "apples are
> > > > more important than oranges, or vice-versa", but what does it matter?
> > > > I've really only put each option into buckets of items with *roughly*
> > > > the same importance. All of the details beyond that don't matter to
> > > > me, at all.
> > >
> > > I agree with you that roughly bucketing items is a good approach.
> > > Within each bucket we can then sort alphabetically.
> >
> > I think of these buckets as working at a logarithmic scale. The FULL,
> > FREEZE, VERBOSE, and ANALYZE options are multiple orders of magnitude
> > more important than most of the other options, and maybe one order of
> > magnitude more important than the PARALLEL, TRUNCATE, and
> > INDEX_CLEANUP options. With differences that big, you have a structure
> > that generalizes across all users quite well. This doesn't seem
> > particularly subjective.
> 
> I actually favor query/command order followed by alphabetical order for
> most of the commands David included in his patch.
> 
> Of course the parameter argument types, like boolean and integer, should
> be grouped together separate from the main parameters. David fit this
> into the alphabetical paradigm by doing uppercase alphabetical followed
> by lowercase alphabetical. There are some specific cases where I think
> this isn't working quite as intended in his patch. I've called those out
> in my command-by-command code review below.
> 
> I actually think we should consider having a single location which
> defines all argument types for all SQL command parameters. Then we
> wouldn't need to define them for each command. We could simply link to
> the definition from the synopsis. That would clean up these lists quite
> a bit. Perhaps there is some variation from command to command in the
> actual definitions, though (I haven't checked). I would be happy to try
> and write this patch if folks are interested in the idea.

I looked into this and it isn't a good idea. Out of the 183 SQL
commands, really only ANALYZE, VACUUM, COPY, CLUSTER, EXPLAIN, and
REINDEX have parameter argument types that are context-independent. And
out of those, boolean is the only type shared by all. VACUUM is the only
one with more than one parameter argument "type". So, it is basically
just a bad idea. Oh well...

- Melanie




Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-20 Thread Yurii Rashkovskii
Aleksander,

On Thu, Apr 20, 2023 at 1:22 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi,
>
> > Also, I don't think there's a case for distributed systems here because
> we're only managing a single computer's resource: the allocation of local
> ports.
>
> I don't suggest building a distributed system but rather using
> well-known solutions from this area. For the described case the
> "resource manager" will be as simple a single Consul instance (a
> single binary file, since Consul is written in Go) running locally.
> The "complexity" would be for the test framework to use a few extra
> REST queries. Arguably not that complicated.
>

Bringing in a process that works over REST API (requiring itself to have a
port, by the way) and increasing the rollout of such an environment is
antithetical to simplicity
and, thus, will make it only worse. If this is the alternative, I'd rather
have a few retries or some other small hacks.

Bringing in a new dependency with Python is also a heavy solution I'd
rather avoid. I find that this is rather a problem with a relatively small
surface. If the patch won't go through,
I'll just find a workaround to live with, but I'd rather stay away from
blowing the development environment out of proportion for something so
minuscule.


>
> > using a KV store to lease a port does not guarantee the port's
> availability
>
> I assume you don't have random processes doing random things (like
> listening random ports) on a CI machine. You know that certain ports
> are reserved for the tests and are going to be used only for this
> purpose using the same leasing protocol.
>

This is intended to be used by CI and development workstations, where all
bets are kind of off.


>
> > For example, I'd suggest adding an option to Postgres to receive sockets
> it should listen [...]
>
> Not sure if I fully understood the idea, but it looks like you are
> suggesting to build in certain rather complicated functionality for an
> arguably rare use case so that a QA engineer didn't have one extra
> small dependency to worry about in this rare case. I'm quite skeptical
> that this is going to happen.
>

My suggestion was to simply allow listening for a wildcard port and be able
to read it out in some way. Nothing particularly complicated. The fact that
the process may die before it is connected to is rather a strange argument
as the same can happen outside of this use case.


-- 
Y.


Re: Note new NULLS NOT DISTINCT on unique index tutorial page

2023-04-20 Thread David Rowley
On Thu, 20 Apr 2023 at 12:04, David Gilman  wrote:
> The revised patch is good. Please go ahead and commit whatever
> phrasing you or the other committers find acceptable. I don't really
> have any preferences in how this is exactly phrased, I just think it
> should be mentioned in the docs.

Thanks.  With that, I admit to further adjusting the wording before I
pushed the result.

David




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-20 Thread vignesh C
On Thu, 20 Apr 2023 at 11:01, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Vignesh,
>
> Thank you for reviewing! PSA new patchset.
>
> > > Additionally, I added a checking functions in 0003
> > > According to pg_resetwal and other functions, the length of
> > CHECKPOINT_SHUTDOWN
> > > record seems (SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort +
> > sizeof(CheckPoint)).
> > > Therefore, the function ensures that the difference between current insert
> > position
> > > and confirmed_flush_lsn is less than (above + page header).
> >
> > Logical replication slots can be created only if wal_level >= logical,
> > currently we do not have any check to see if wal_level >= logical if
> > "--include-logical-replication-slots" option is specified. If
> > include-logical-replication-slots is specified with pg_upgrade, we
> > will be creating replication slots after a lot of steps like
> > performing prechecks, analyzing, freezing, deleting, restoring,
> > copying, setting related objects and then create replication slot,
> > where we will be erroring out after a lot of time(Many cases
> > pg_upgrade takes a lot of hours to perform these operations). I feel
> > it would be better to add a check in the beginning itself somewhere in
> > check_new_cluster to see if wal_level is set appropriately in case of
> > include-logical-replication-slot option to detect and throw an error
> > early itself.
>
> I see your point. Moreover, I think max_replication_slots != 0 must be also 
> checked.
> I added a checking function and related test in 0001.

Thanks for the updated patch.
A Few comments:
1) if the verbose option is enabled, we should print the new slot
information, we could add a function print_slot_infos similar to
print_rel_infos which could print slot name and two_phase is enabled
or not.
+   end_progress_output();
+   check_ok();
+
+   /* update new_cluster info now that we have objects in the databases */
+   get_db_and_rel_infos(_cluster);
+}

2) Since we will be using this option with pg_upgrade, should we use
this along with the --binary-upgrade option only?
+   if (dopt.logical_slots_only && dopt.dataOnly)
+   pg_fatal("options --logical-replication-slots-only and
-a/--data-only cannot be used together");
+   if (dopt.logical_slots_only && dopt.schemaOnly)
+   pg_fatal("options --logical-replication-slots-only and
-s/--schema-only cannot be used together");

3) Since it two_phase is boolean, can we use bool data type instead of string:
+   slotinfo[i].dobj.objType = DO_LOGICAL_REPLICATION_SLOT;
+
+   slotinfo[i].dobj.catId.tableoid = InvalidOid;
+   slotinfo[i].dobj.catId.oid = InvalidOid;
+   AssignDumpId([i].dobj);
+
+   slotinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i,
i_slotname));
+
+   slotinfo[i].plugin = pg_strdup(PQgetvalue(res, i, i_plugin));
+   slotinfo[i].twophase = pg_strdup(PQgetvalue(res, i,
i_twophase));

We can change it to something like:
if (strcmp(PQgetvalue(res, i, i_twophase), "t") == 0)
slotinfo[i].twophase = true;
else
slotinfo[i].twophase = false;

4) The comments are inconsistent, some have termination characters and
some don't. We can keep it consistent:
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
+# Initialize old publisher node
+my $old_publisher = PostgreSQL::Test::Cluster->new('old_publisher');
+$old_publisher->init(allows_streaming => 'logical');
+$old_publisher->start;
+
+# Initialize subscriber node
+my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$subscriber->init(allows_streaming => 'logical');
+$subscriber->start;
+
+# Schema setup
+$old_publisher->safe_psql('postgres',
+   "CREATE TABLE tbl AS SELECT generate_series(1,10) AS a");
+$subscriber->safe_psql('postgres', "CREATE TABLE tbl (a int)");
+
+# Initialize new publisher node

5) should we use free instead of pfree as used in other function like
dumpForeignServer:
+   appendPQExpBuffer(query, ");");
+
+   ArchiveEntry(fout, slotinfo->dobj.catId, slotinfo->dobj.dumpId,
+ARCHIVE_OPTS(.tag = slotname,
+
.description = "REPLICATION SLOT",
+
.section = SECTION_POST_DATA,
+
.createStmt = query->data));
+
+   pfree(slotname);
+   destroyPQExpBuffer(query);
+   }
+}

Regards,
Vignesh




Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-20 Thread Aleksander Alekseev
Hi,

> Also, I don't think there's a case for distributed systems here because we're 
> only managing a single computer's resource: the allocation of local ports.

I don't suggest building a distributed system but rather using
well-known solutions from this area. For the described case the
"resource manager" will be as simple a single Consul instance (a
single binary file, since Consul is written in Go) running locally.
The "complexity" would be for the test framework to use a few extra
REST queries. Arguably not that complicated.

> using a KV store to lease a port does not guarantee the port's availability

I assume you don't have random processes doing random things (like
listening random ports) on a CI machine. You know that certain ports
are reserved for the tests and are going to be used only for this
purpose using the same leasing protocol.

If there are random things happening on CI you have no control of, you
are having a problem with the CI infrastructure, not with Postgres.

> For example, I'd suggest adding an option to Postgres to receive sockets it 
> should listen [...]

Not sure if I fully understood the idea, but it looks like you are
suggesting to build in certain rather complicated functionality for an
arguably rare use case so that a QA engineer didn't have one extra
small dependency to worry about in this rare case. I'm quite skeptical
that this is going to happen.

> I am curious, Yurii, is Postgres the only service that need an unused
> port for listening in your testing/application environment? Otherwise,
> how is this handled in other software?

That's a very good point.

To clarify, there is nothing wrong with the patch per se. It's merely
an unreliable solution for a problem it is supposed to address. I
don't think we should encourage the users to build unreliable systems.

-- 
Best regards,
Aleksander Alekseev




Re: Initial Schema Sync for Logical Replication

2023-04-20 Thread Amit Kapila
On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada  wrote:
>
> On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila  wrote:
> >
> > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > While writing a PoC patch, I found some difficulties in this idea.
> > > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > > could not define the primary key of pg_subscription_rel. The primary
> > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > > also doesn't work.
> > >
> >
> > Can we think of having a separate catalog table say
> > pg_subscription_remote_rel for this? You can have srsubid,
> > remote_schema_name, remote_rel_name, etc. We may need some other state
> > to be maintained during the initial schema sync where this table can
> > be used. Basically, this can be used to maintain the state till the
> > initial schema sync is complete because we can create a relation entry
> > in pg_subscritption_rel only after the initial schema sync is
> > complete.
>
> It might not be ideal but I guess it works. But I think we need to
> modify the name of replication slot for initial sync as it currently
> includes OID of the table:
>
> void
> ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
> char *syncslotname, Size szslot)
> {
> snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
>  relid, GetSystemIdentifier());
> }
>
> If we use both schema name and table name, it's possible that slot
> names are duplicated if schema and/or table names are long. Another
> idea is to use the hash value of schema+table names, but it cannot
> completely eliminate that possibility, and probably would make
> investigation and debugging hard in case of any failure. Probably we
> can use the OID of each entry in pg_subscription_remote_rel instead,
> but I'm not sure it's a good idea, mainly because we will end up using
> twice as many OIDs as before.
>

The other possibility is to use worker_pid. To make debugging easier,
we may want to LOG schema_name+rel_name vs slot_name mapping at DEBUG1
log level.

-- 
With Regards,
Amit Kapila.




Re: Support logical replication of DDLs

2023-04-20 Thread Amit Kapila
On Thu, Apr 20, 2023 at 9:11 AM shveta malik  wrote:
>
>
> Few comments for ddl_deparse.c in patch dated April17:
>
> 1) append_format_string()
> I think we need to have 'Assert(sub_fmt)' here like we have it in all
> other similar functions (append_bool_object, append_object_object,
> ...)
>

+1.

> 2) append_object_to_format_string()
> here we have code piece :
> if (sub_fmt == NULL || tree->fmtinfo == NULL)
> return sub_fmt;
> but sub_fmt will never be null when we reach this function as all its
> callers assert for null sub_fmt. So that means when tree->fmtinfo is
> null, we end up returning sub_fmt as it is, instead of extracting
> object name from that. Is this intended?
>
> 3) We can remove extra spaces after full-stop in the comment below
> /*
>  * Deparse a ColumnDef node within a typed table creation. This is simpler
>  * than the regular case, because we don't have to emit the type declaration,
>  * collation, or default.  Here we only return something if the column is 
> being
>  * declared NOT NULL.
>  ...
>  deparse_ColumnDef_typed()
>

Which full-stop you are referring to here first or second? I see there
is a tab after the first one which should be changed to space.

>
> 4) These functions are not being used, do we need to retain these in this 
> patch?
> deparse_Type_Storage()
> deparse_Type_Receive()
> deparse_Type_Send()
> deparse_Type_Typmod_In()
> deparse_Type_Typmod_Out()
> deparse_Type_Analyze()
> deparse_Type_Subscript()
>

I don't think we need to retain these. And, it seems they are already removed.

> 5) deparse_AlterRelation()
> We have below variable initialized to false in the beginning
> 'boolistype = false;'
> And then we have many conditional codes using the above, eg: istype ?
> "ATTRIBUTE" : "COLUMN".  We are not changing 'istype' anywhere and it
> is hard-coded in the beginning. It means there are parts of code in
> this function which will never be htt (written for 'istype=true' case)
> , so why do we need this variable and conditional code around it?
>

I don't see any usage of istype. We should simply remove the use of
istype and use "COLUMN" directly.

>
> 6) There are plenty of places where we use 'append_not_present'
> without using 'append_null_object'.
> Do we need to have 'append_null_object' along with
> 'append_not_present' at these places?
>

What is the difference if we add a null object or not before not_present?

-- 
With Regards,
Amit Kapila.




RE: Initial Schema Sync for Logical Replication

2023-04-20 Thread Kumar, Sachin
I am working on a prototype with above Idea , and will send it for review by 
Sunday/Monday

Regards
Sachin


Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-20 Thread Denis Laxalde

Hi,

Yurii Rashkovskii a écrit :

On Wed, Apr 19, 2023 at 11:44 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

I would like to suggest a patch against master (although it may be

worth

backporting it) that makes it possible to listen on any unused port.

[...]

A bullet-proof approach would be (approximately) for the test
framework to lease the ports on the given machine, for instance by
using a KV value with CAS support like Consul or etcd (or another
PostgreSQL instance), as this is done for leader election in
distributed systems (so called leader lease). After leasing the port
the framework knows no other testing process on the given machine will
use it (and also it keeps telling the KV storage that the port is
still leased) and specifies it in postgresql.conf as usual.



The approach you suggest introduces a significant amount of complexity but
seemingly fails to address one of the core issues: using a KV store to
lease a port does not guarantee the port's availability. I don't believe
this is a sound way to address this issue, let alone a bulletproof one.

Also, I don't think there's a case for distributed systems here because
we're only managing a single computer's resource: the allocation of local
ports.


For this (local computer) use case, a tool such as 
https://github.com/kmike/port-for/ would do the job if I understand 
correctly (the lease thing, locally). And it would work for "anything", 
not just Postgres.


I am curious, Yurii, is Postgres the only service that need an unused 
port for listening in your testing/application environment? Otherwise, 
how is this handled in other software?


Cheers,
Denis




Re: Support logical replication of DDLs

2023-04-20 Thread shveta malik
On Thu, Apr 20, 2023 at 9:11 AM shveta malik  wrote:

> On Mon, Apr 17, 2023 at 5:32 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Attach the new version patch set which include the following changes:
> >
>
> Few comments for ddl_deparse.c in patch dated April17:
>
>
 Few comments for ddl_json.c in the patch dated April17:

1) expand_jsonval_string()
I think we need to assert if jsonval is neither jbvString nor jbvBinary.

2) expand_jsonval_strlit()
same here, assert if not jbvString (like we do in expand_jsonval_number and
expand_jsonval_identifier etc)

3) expand_jsonb_array()
arrayelem is allocated as below, but not freed.
initStringInfo()

4) expand_jsonb_array(),
we initialize iterator as below which internally does palloc
it = JsonbIteratorInit(container);
Shall this be freed at the end? I see usage of this function in other
files. At a few places, it is freed while not freed at other places.

5) deparse_ddl_json_to_string(char *json_str, char** owner)
str = palloc(value->val.string.len + 1);
we do  palloc here and return allocated memory to caller as 'owner'. Caller
sets this 'owner' using SetConfigOption() which internally allocates new
memory and copies 'owner' to that. So the memory allocated in
deparse_ddl_json_to_string is never freed. Better way should be the caller
passing this allocated memory to deparse_ddl_json_to_string() and freeing
it when done. Thoughts?

6)expand_fmt_recursive():
value = findJsonbValueFromContainer(container, JB_FOBJECT, );
Should this 'value' be freed at the end like we do at all other places in
this file?


thanks
Shveta


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-20 Thread Julien Rouhaud
Hi,

On Thu, Apr 20, 2023 at 05:31:16AM +, Hayato Kuroda (Fujitsu) wrote:
> Dear Vignesh,
>
> Thank you for reviewing! PSA new patchset.
>
> > > Additionally, I added a checking functions in 0003
> > > According to pg_resetwal and other functions, the length of
> > CHECKPOINT_SHUTDOWN
> > > record seems (SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort +
> > sizeof(CheckPoint)).
> > > Therefore, the function ensures that the difference between current insert
> > position
> > > and confirmed_flush_lsn is less than (above + page header).

I think that this test should be different when just checking for the
prerequirements (live_check / --check) compared to actually doing the upgrade,
as it's almost guaranteed that the slots won't have sent everything when the
source server is up and running.

Maybe simply check that all logical slots are currently active when running the
live check, so at least there's a good chance that they will still be at
shutdown, and will therefore send all the data to the subscribers?  Having a
regression tests for that scenario would also be a good idea.  Having an
uncommitted write transaction should be enough to cover it.




Re: Incremental sort for access method with ordered scan support (amcanorderbyop)

2023-04-20 Thread Richard Guo
On Thu, Apr 20, 2023 at 6:38 AM David Rowley  wrote:

> That function is pretty new and was exactly added so we didn't have to
> write list_truncate(list_copy(...), n) anymore.  That gets pretty
> wasteful when the input List is long and we only need a small portion
> of it.


I searched the codes and found some other places where the manipulation
of lists can be improved in a similar way.

* lappend(list_copy(list), datum) as in get_required_extension().
This is not very efficient as after list_copy it would need to enlarge
the list immediately.  It can be improved by inventing a new function,
maybe called list_append_copy, that do the copy and append all together.

* lcons(datum, list_copy(list)) as in get_query_def().
This is also not efficient.  Immediately after list_copy, we'd need to
enlarge the list and move all the entries.  It can also be improved by
doing all these things all together in one function.

* lcons(datum, list_delete_nth_cell(list_copy(list), n)) as in
sort_inner_and_outer.
It'd need to copy all the elements, and then delete the n'th entry which
would cause all following entries be moved, and then move all the
remaining entries for lcons.  Maybe we can invent a new function for it?

So is it worthwhile to improve these places?

Besides, I found one place that can be improved the same way as what we
did in 9d299a49.

--- a/src/backend/rewrite/rewriteSearchCycle.c
+++ b/src/backend/rewrite/rewriteSearchCycle.c
@@ -523,7 +523,7 @@ rewriteSearchAndCycle(CommonTableExpr *cte)

fexpr = makeFuncExpr(F_INT8INC, INT8OID, list_make1(fs),
InvalidOid, InvalidOid, COERCE_EXPLICIT_CALL);

-   lfirst(list_head(search_col_rowexpr->args)) = fexpr;
+   linitial(search_col_rowexpr->args) = fexpr;


Also, in applyparallelworker.c we have the usage as

TransactionId xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));

I wonder if we can invent function list_nth_xid to do it, to keep
consistent with list_nth/list_nth_int/list_nth_oid.

Thanks
Richard


Re: Logical replication failed with SSL SYSCALL error

2023-04-20 Thread vignesh C
On Wed, 19 Apr 2023 at 17:26, shaurya jain <12345shau...@gmail.com> wrote:
>
> Hi Team,
>
> Could you please help me with this, It's urgent for the production 
> environment.
>
> On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shau...@gmail.com> wrote:
>>
>> Hi Team,
>>
>> Could you please help, It's urgent for the production env?
>>
>> On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shau...@gmail.com> wrote:
>>>
>>> Hi Team,
>>>
>>> Postgres Version:- 13.8
>>> Issue:- Logical replication failing with SSL SYSCALL error
>>> Priority:-High
>>>
>>> We are migrating our database through logical replications, and all of 
>>> sudden below error pops up in the source and target logs which leads us to 
>>> nowhere.
>>>
>>> Logs from Source:-
>>> LOG:  could not send data to client: Connection reset by peer
>>> STATEMENT:  COPY public.test TO STDOUT
>>> FATAL:  connection to client lost
>>> STATEMENT:  COPY public.test TO STDOUT
>>>
>>> Logs from Target:-
>>> 2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from WAL 
>>> stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
>>> 2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical 
>>> replication worker" (PID 1250) exited with exit code 1
>>> 2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table 
>>> synchronization worker for subscription " sub_tables_2_180", table "test" 
>>> has started
>>> 2023-04-15 19:12:05 
>>> UTC:10.144.19.34(33276):postgres@webadmit_staging:[7112]:WARNING: there is 
>>> no transaction in progress
>>> 2023-04-15 19:14:08 
>>> UTC:10.144.19.34(33324):postgres@webadmit_staging:[6052]:LOG: could not 
>>> receive data from client: Connection reset by peer
>>> 2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from WAL 
>>> stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from WAL 
>>> stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from WAL 
>>> stream: SSL SYSCALL error: Connection timed out
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical 
>>> replication worker" (PID 2556) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical 
>>> replication worker" (PID 2112) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical 
>>> replication worker" (PID 1089) exited with exit code 1
>>> 2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply worker for 
>>> subscription "sub_tables_2_180" has started
>>> 2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply worker for 
>>> subscription "sub_tables_3_192" has started
>>> 2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply worker for 
>>> subscription "sub_tables_1_180" has started
>>>
>>> Just after this error, all other replication slots get disabled for some 
>>> time and come back online along with COPY command with the new PID in 
>>> pg_stat_activity.
>>>
>>> I have a few queries regarding this:-
>>>
>>> The exact reason for disconnection (Few articles claim memory and few 
>>> network)
This might be because of network failure, did you notice any network
instability, could you check the TCP settings.
You could check the following configurations tcp_keepalives_idle,
tcp_keepalives_interval and tcp_keepalives_count.
This means it will connect the server based on tcp_keepalives_idle
seconds specified , if the server does not respond in
tcp_keepalives_interval seconds it'll try again, and will consider the
connection gone after tcp_keepalives_count failures.

>>> Will it lead to data inconsistency?
It will not lead to inconsistency. In case of failure the failed
transaction will be rolled back.

>>> Does this new PID COPY command again migrate the whole data of the test 
>>> table once again?
Yes, it will migrate the whole table data again in case of failures.

Regards,
Vignesh