date:20240628

Re: Adminpack removal

2024-06-28 Thread Philippe BEAUDOIN


Le 27/06/2024 à 10:38, Matthias van de Meent a écrit :

On Thu, 27 Jun 2024, 07:34 Philippe BEAUDOIN,  wrote:

Hi,

I have just tested PG17 beta1 with the E-Maj solution I maintain. The
only issue I found is the removal of the adminpack contrib.

In the emaj extension, which is the heart of the solution, and which is
written in plpgsql, the code just uses the pg_file_unlink() function to
automatically remove files produced by COPY TO statements when no rows
have been written. In some specific use cases, it avoids the user to get
a few interesting files among numerous empty files in a directory. I
have found a workaround. That's a little bit ugly, but it works. So this
is not blocking for me.

FYI, the project's repo is on github (https://github.com/dalibo/emaj),
which was supposed to be scanned to detect potential adminpack usages.

The extension at first glance doesn't currently seem to depend on
adminpack: it is not included in the control file as dependency, and
has not been included as a dependency since the creation of that file.


You are right. Even before the adminpack usage removal, the extension 
was not listed as prerequisite in the control file. In fact, I 
introduced a new E-Maj feature in the version of last automn, that used 
the adminpack extension in one specific case. But the user may not 
install adminpack. In such a case, the feature was limited and a warning 
message told the user why it reached the limitation. I was waiting for 
some feedbacks before possibly adding adminpack as a real prerequisite.




Where else would you expect us to search for dependencies?


The word "adminpack" can be found in the sql source file 
(sql/emaj--4.4.0.sql), and in 2 documentation source files (in 
docs/en/*.rst).


The pg_file_unlink() function name can be found in the same sql source file.

But, I understand that looking for simple strings in all types of files 
in a lot of repo is costly and may report a lot of noise.



More broadly, my feeling is that just looking at public repositories is 
not enough. The Postgres features usage can be found in:


- public tools, visible in repo (in github, gitlab and some other 
platforms) ;


- softwares from commercial vendors, so in close source ;

- and a huge number of applications developed in all organizations, and 
that are not public.


So just looking in public repo covers probably less than 1% of the code. 
However, this may give a first idea, especialy if a feature use is 
already detected.


In this "adminpack" case, it may be interesting to distinguish the 
pg_logdir_ls() function, which covers a very specific administration 
feature, and the other functions, which are of a general interest. It 
wouldn't be surprising that pg_logdir_ls() be really obsolete now that 
it is not used by pgAdmin anymore, and thus could be removed if nobody 
complains about that. May be the others functions could be directly 
integrated into the core (or left in adminpack, with the pgAdmin 
reference removed from the documentation).


Kind Regards.

Philippe.

Re: Document use of ldapurl with LDAP simple bind

2024-06-28 Thread Peter Eisentraut


On 24.05.24 20:54, Jacob Champion wrote:

Our documentation implies that the ldapurl setting in pg_hba is used
for search+bind mode only. It was pointed out to me recently that this
is not true, and if you're dealing with simple bind on a non-standard
scheme or port, then ldapurl makes the HBA easier to read:

 ... ldap ldapurl="ldaps://ldap.example.net:49151" ldapprefix="cn="
ldapsuffix=", dc=example, dc=net"

0001 tries to document this helpful behavior a little better, and 0002
pins it with a test. WDYT?


Yes, this looks correct.  Since ldapurl is really just a shorthand that 
is expanded to various other parameters, it makes sense that it would 
work for simple bind as well.


hba.c has this error message:

"cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, ldapsearchattribute, 
ldapsearchfilter, or ldapurl together with ldapprefix"


This appears to imply that specifying ldapurl is only applicable for 
search+bind.  Maybe that whole message should be simplified to something 
like


"configuration mixes arguments for simple bind and search+bind"

(The old wording also ignores that the error might arise via "ldapsuffix".)

Re: Support "Right Semi Join" plan shapes

2024-06-28 Thread Richard Guo

On Fri, Jun 28, 2024 at 2:54 PM Richard Guo  wrote:
> On Mon, Jun 24, 2024 at 5:59 PM Richard Guo  wrote:
> > I noticed that this patch changes the plan of a query in join.sql from
> > a semi join to right semi join, compromising the original purpose of
> > this query, which was to test the fix for neqjoinsel's behavior for
> > semijoins (see commit 7ca25b7d).
> >
> > --
> > -- semijoin selectivity for <>
> > --
> > explain (costs off)
> > select * from int4_tbl i4, tenk1 a
> > where exists(select * from tenk1 b
> >  where a.twothousand = b.twothousand and a.fivethous <> 
> > b.fivethous)
> >   and i4.f1 = a.tenthous;
> >
> > So I've changed this test case a bit so that it is still testing what it
> > is supposed to test.
>
> I've refined this test case further to make it more stable by using an
> additional filter 'a.tenthous < 5000'.  Besides, I noticed a surplus
> blank line in ExecHashJoinImpl().  I've removed it in the v7 patch.

BTW, I've also verified the empty-rel optimization for hash join and
AFAICT it works correctly for the new right-semi join.

Thanks
Richard

Re: walsender.c comment with no context is hard to understand

2024-06-28 Thread Peter Smith

On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila  wrote:
>
> On Fri, Jun 28, 2024 at 5:15 AM Peter Smith  wrote:
> >
> > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier  wrote:
> > >
> > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote:
> > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith  wrote:
> > > >> Perhaps the comment should say something like it used to:
> > > >> /* Fail if there is not enough WAL available. This can happen during
> > > >> shutdown. */
> > > >
> > > > Agree with this, +1 for this change.
> > >
> > > That would be an improvement.  Would you like to send a patch with all
> > > the areas you think could stand for improvements?
> > > --
> >
> > OK, I attached a patch equivalent of the suggestion in this thread.
> >
>
> Shouldn't the check for flushptr (if (flushptr < targetPagePtr +
> reqLen)) be moved immediately after the call to WalSndWaitForWal().
> The comment seems to suggests the same: "Make sure we have enough WAL
> available before retrieving the current timeline. .."
>

Yes, as I wrote in the first post, those lines did once used to be
adjacent in logical_read_xlog_page.

I also wondered if they still belonged together, but I opted for the
safest option of fixing only the comment instead of refactoring old
code when no problem had been reported.

AFAICT these lines became separated due to a 2023 patch [1], and you
were one of the reviewers of that patch, so I assumed the code
separation was deemed OK at that time. Unless it was some mistake that
slipped past multiple reviewers?

==
[1] 
https://github.com/postgres/postgres/commit/0fdab27ad68a059a1663fa5ce48d76333f1bd74c#diff-da7052ce339ec037f5c721e08a9532b1fcfb4405966cf6a78d0c811550fd7b43

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Flush pgstats file during checkpoints

2024-06-28 Thread Konstantin Knizhnik




On 18/06/2024 9:01 am, Michael Paquier wrote:

Hi all,

On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:

  * TODO: With a bit of extra work we could just start with a pgstat file
  * associated with the checkpoint redo location we're starting from.

Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them.  I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence.  This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN.  This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery.  These
should not be costly, but I'd rather not add more of these.

7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example).  The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.

Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.

I am adding that to the next CF.  Thoughts and comments are welcome.
Thanks,
--
Michael


Hi Michael.

I am working mostly on the same problem - persisting pgstat state in 
Neon (because of separation of storage and compute it has no local files).
I have two questions concerning this PR and the whole strategy for 
saving pgstat state between sessions.


1. Right now pgstat.stat is discarded after abnormal Postgres 
termination. And in your PR we are storing LSN in pgstat.staty to check 
that it matches checkpoint redo LSN. I wonder if having outdated version 
of pgstat.stat  is worse than not having it at all? Comment in xlog.c 
says: "When starting with crash recovery, reset pgstat data - it might 
not be valid." But why it may be invalid? We are writing it first to 
temp file and then rename. May be fsync() should be added here and 
durable_rename() should be used instead of rename(). But it seems to be 
better than loosing statistics. And should not add some significant 
overhead (because it happens only at shutdown). In your case we are 
checking LSN of pgstat.stat file. But once again - why it is better to 
discard file than load version from previous checkpoint?


2. Second question is also related with 1). So we saved pgstat.stat on 
checkpoint, then did some updates and then Postgres crashed. As far as I 
understand with your patch we will successfully restore pgstats.stat 
file. But it is not actually up-to-date: it doesn't reflect information 
about recent updates. If it was so critical to keep pgstat.stat 
up-to-date, then why do we allow to restore state on most recent checkpoint?


Thanks,
Konstantin

Re: Converting README documentation to Markdown

2024-06-28 Thread Peter Eisentraut


On 15.05.24 14:26, Daniel Gustafsson wrote:

Another aspect of platform/flavour was to make the markdown version easy to
maintain for hackers writing content.  Requiring the minimum amount of markup
seems like the developer-friendly way here to keep productivity as well as
document quality high.

Most importantly though, I targeted reading the files as plain text without any
rendering.  We keep these files in text format close to the code for a reason,
and maintaining readability as text was a north star.


I've been thinking about this some more.  I think the most value here 
would be to just improve the plain-text formatting, so that there are 
consistent list styles, header styles, indentation, some of the 
ambiguities cleared up -- much of which your 0001 patch does.  You might 
as well be targeting markdown-like conventions with this; they are 
mostly reasonable.


I tend to think that actually converting all the README files to 
README.md could be a net negative for maintainability.  Because now you 
are requiring everyone who potentially wants to edit those to be aware 
of Markdown syntax and manually check the rendering.  With things like 
DocBook, if you make a mess, you get error messages from the build step. 
 If you make a mess in Markdown, you have to visually find it yourself. 
 There are many READMEs that contain nested lists and code snippets and 
diagrams and such all mixed together.  Getting that right in Markdown 
can be quite tricky.  I'm also foreseeing related messes of trailing 
whitespace, spaces-vs-tab confusion, gitattributes violations, etc.  It 
can be a lot of effort.  It's okay to do this for prominent files like 
the top-level one, but I suggest that for the rest we can keep it simple 
and just use plain text.

Re: remaining sql/json patches

2024-06-28 Thread Alexander Lakhin


Hi Amit,

28.06.2024 09:15, Amit Langote wrote:

Hi Alexander,


Thanks for the report.  Yeah, those comments that got added in
7081ac46ace are obsolete.



Thanks for paying attention to that!

Could you also look at comments for transformJsonObjectAgg() and
transformJsonArrayAgg(), aren't they obsolete too?

Best regards,
Alexander

Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-28 Thread Jelte Fennema-Nio

On Fri, 28 Jun 2024 at 00:41, Peter Geoghegan  wrote:
> Typically, no, it won't be. But there's really no telling for sure.
> The access patterns for a composite index on '(a, b)' with a qual
> "WHERE b = 5" are identical to a qual explicitly written "WHERE a =
> any() AND b = 5".

Hmm, that's true. But in that case the explain plan gives a clear hint
that something like that might be going on, because you'll see:

Index Cond: a = any() AND b = 5

That does make me think of another way, and maybe more "correct" way,
of representing Masahiros situation than adding a new "Skip Scan Cond"
row to the EXPLAIN output. We could explicitly include a comparison to
all prefix columns in the Index Cond:

Index Cond: ((test.id1 = 1) AND (test.id2 = ANY) AND (test.id3 = 101))

Or if you want to go with syntactically correct SQL we could do the following:

Index Cond: ((test.id1 = 1) AND ((test.id2 IS NULL) OR (test.id2 IS
NOT NULL)) AND (test.id3 = 101))

An additional benefit this provides is that you now know which
additional column you should use a more specific filter on to speed up
the query. In this case test.id2

OTOH, maybe it's not a great way because actually running that puts
the IS NULL+ IS NOT NULL query in the Filter clause (which honestly
surprises me because I had expected this "always true expression"
would have been optimized away by the planner).

> EXPLAIN (VERBOSE, ANALYZE) SELECT id1, id2, id3 FROM test WHERE id1 = 1 AND 
> (id2 IS NULL OR id2 IS NOT NULL) AND id3 = 101;
   QUERY PLAN
─
 Index Only Scan using test_idx on public.test  (cost=0.42..12809.10
rows=1 width=12) (actual time=0.027..11.234 rows=1 loops=1)
   Output: id1, id2, id3
   Index Cond: ((test.id1 = 1) AND (test.id3 = 101))
   Filter: ((test.id2 IS NULL) OR (test.id2 IS NOT NULL))

> What about cases where we legitimately have to vary our strategy
> during the same index scan?

Would my new suggestion above address this?

> In fact, that'd make sense even today, without skip scan (just with
> the 17 work on nbtree SAOP scans). Even with regular SAOP nbtree index
> scans, the number of primitive scans is hard to predict, and quite
> indicative of what's really going on with the scan.

*googles nbtree SAOP scans and finds the very helpful[1]*

Yes, I feel like this should definitely be part of the ANALYZE output.
Seeing how Lukas has to look at pg_stat_user_tables to get this
information seems quite annoying[2] and only possible on systems that
have no concurrent queries.

So it sounds like we'd want a "Primitive Index Scans" counter in
ANALYZE too. In addition to the number of filtered rows by, which if
we go with my proposal above should probably be called "Rows Removed
by Index Cond".

[1]: https://www.youtube.com/watch?v=jg2KeSB5DR8
[2]: https://youtu.be/jg2KeSB5DR8?t=188

Re: walsender.c comment with no context is hard to understand

2024-06-28 Thread Amit Kapila

On Fri, Jun 28, 2024 at 12:55 PM Peter Smith  wrote:
>
> On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila  wrote:
> >
> > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith  wrote:
> > >
> > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier  
> > > wrote:
> > > >
> > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote:
> > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith  
> > > > > wrote:
> > > > >> Perhaps the comment should say something like it used to:
> > > > >> /* Fail if there is not enough WAL available. This can happen during
> > > > >> shutdown. */
> > > > >
> > > > > Agree with this, +1 for this change.
> > > >
> > > > That would be an improvement.  Would you like to send a patch with all
> > > > the areas you think could stand for improvements?
> > > > --
> > >
> > > OK, I attached a patch equivalent of the suggestion in this thread.
> > >
> >
> > Shouldn't the check for flushptr (if (flushptr < targetPagePtr +
> > reqLen)) be moved immediately after the call to WalSndWaitForWal().
> > The comment seems to suggests the same: "Make sure we have enough WAL
> > available before retrieving the current timeline. .."
> >
>
> Yes, as I wrote in the first post, those lines did once used to be
> adjacent in logical_read_xlog_page.
>
> I also wondered if they still belonged together, but I opted for the
> safest option of fixing only the comment instead of refactoring old
> code when no problem had been reported.
>
> AFAICT these lines became separated due to a 2023 patch [1], and you
> were one of the reviewers of that patch, so I assumed the code
> separation was deemed OK at that time. Unless it was some mistake that
> slipped past multiple reviewers?
>

I don't know whether your assumption is correct. AFAICS, those two
lines should be together. Let us ee if Bertrand remembers anything?


-- 
With Regards,
Amit Kapila.

Re: Converting README documentation to Markdown

2024-06-28 Thread Jelte Fennema-Nio

On Fri, 28 Jun 2024 at 09:38, Peter Eisentraut  wrote:
> Getting that right in Markdown can be quite tricky.

I agree that in some cases it's tricky. But what's the worst case that
can happen when you get it wrong? It renders weird on github.com.
Luckily there's a "code" button to go to the plain text format[1]. In
all other cases (which I expect will be most) the doc will be easier
to read. Forcing plaintext, just because sometimes we might make a
mistake in the syntax seems like an overcorrection imho. Especially
because these docs are (hopefully) read more often than written.

[1]: https://github.com/postgres/postgres/blob/master/README.md?plain=1

Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)

2024-06-28 Thread Robins Tharakan

On Sun, 23 Jun 2024 at 22:30, Alexander Lakhin  wrote:


> Unfortunately, the buildfarm log doesn't contain regress_log_002_limits,
> but I managed to reproduce the failure on that my device, when it's storage
> as slow as:
> $ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
> 1024+0 records in
> 1024+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s
>
>
The past ~1 week, I tried to space out all other tasks on the machine, so
as to ensure
that 1-min CPU is mostly <2 (and thus not many things hammering the disk)
and with
that I see 0 failures these past few days. This isn't conclusive by any
means, but it
does seem that reducing IO contention has helped remove the errors, like
what
Alexander suspects / repros here.

Just a note, that I've reverted some of those recent changes now, and so if
the theory
holds true, I wouldn't be surprised if some of these errors restarted on
dodo.

-
robins

Re: pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-06-28 Thread Amit Kapila

On Thu, Jun 27, 2024 at 11:47 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > It seems disabling subscriptions on the primary can make the primary
> > stop functioning for some duration of time. I feel we need some
> > solution where after converting to subscriber, we disable and drop
> > pre-existing subscriptions. One idea could be that we use the list of
> > new subscriptions created by the tool such that any subscription not
> > existing in that list will be dropped.
>
> Previously I avoided coding like yours, because there is a room that converted
> node can connect to another publisher. But per off-list discussion, we can 
> skip
> it by setting max_logical_replication_workers = 0. I refactored with the 
> approach.
> Note that the GUC is checked at verification phase, so an attribute is added 
> to
> start_standby_server() to select the workload.
>

Thanks, this is a better approach. I have changed a few comments and
made some other cosmetic changes. See attached.

Euler, Peter E., and others, do you have any comments/suggestions?

BTW, why have you created a separate test file for this test? I think
we should add a new test to one of the existing tests in
040_pg_createsubscriber. You can create a dummy subscription on node_p
and do a test similar to what we are doing in "# Create failover slot
to test its removal".

-- 
With Regards,
Amit Kapila.

v3-0001-pg_createsubscriber-Drop-pre-existing-subscriptio.patch
Description: Binary data

Re: Showing applied extended statistics in explain Part 2

2024-06-28 Thread Tatsuro Yamada

Hi Tomas and All,

Attached file is a new patch including:
6) Add stats option to explain command
7) The patch really needs some docs (partly)

 >4) Add new node (resolve errors in cfbot and prepared statement)

I tried adding a new node in pathnode.h, but it doesn't work well.
So, it needs more time to implement it successfully because this is
the first time to add a new node in it.


> 8) Add regression test (stats_ext.sql)


Actually, I am not yet able to add new test cases to stats_ext.sql.
Instead, I created a simple test (test.sql) and have attached it.
Also, output.txt is the test result.

To add new test cases to stats_ext.sql,
I'd like to decide on a strategy for modifying it. In particular, there are
381 places where the check_estimated_rows function is used, so should I
include the same number of tests, or should we include the bare minimum
of tests that cover the code path? I think only the latter would be fine.
Any advice is appreciated. :-D

P.S.
I'm going to investigate how to use CI this weekend hopefully.

Regards,
Tatsuro Yamada
CREATE FUNCTION
CREATE FUNCTION
CREATE TABLE
INSERT 0 10
QUERY PLAN  
   
---
 Seq Scan on t  (cost=0.00..1944.77 rows=3 width=8) (actual time=0.049..37.182 
rows=1 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9
 Planning Time: 0.380 ms
 Execution Time: 37.751 ms
(5 rows)

 clause | extstats 
+--
(0 rows)

  QUERY PLAN   
---
 HashAggregate  (cost=1944.77..2044.89 rows=10012 width=12)
   Group Key: a, b
   ->  Seq Scan on t  (cost=0.00..1444.18 rows=100118 width=8)
(3 rows)

 clause | extstats 
+--
(0 rows)

CREATE STATISTICS
ANALYZE
   List of extended statistics
 Schema | Name | Definition  | Ndistinct | Dependencies |   MCV   
+--+-+---+--+-
 public | s| a, b FROM t | defined   | defined  | defined
(1 row)

  QUERY PLAN
  
--
 Seq Scan on t  (cost=0.00..1943.00 rows=9947 width=8) (actual 
time=0.021..26.203 rows=1 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9
   Ext Stats: public.s  Clauses: ((a = 1) AND (b = 1))
 Planning Time: 0.169 ms
 Execution Time: 26.958 ms
(6 rows)

clause | extstats 
---+--
 Filter: ((a = 1) AND (b = 1)) | public.s  Clauses: ((a = 1) AND (b = 1))
(1 row)

  QUERY PLAN
   
---
 HashAggregate  (cost=1943.00..1943.10 rows=10 width=12) (actual 
time=96.912..96.918 rows=10 loops=1)
   Group Key: a, b
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=10 width=8) (actual 
time=0.026..24.280 rows=10 loops=1)
 Ext Stats: public.s  Clauses: a, b
 Planning Time: 0.353 ms
 Execution Time: 96.974 ms
(7 rows)

 clause  |extstats 
-+-
 Group Key: a, b | public.s  Clauses: a, b
(1 row)

SET
  QUERY PLAN
   
---
 HashAggregate  (cost=1943.00..1943.10 rows=10 width=12) (actual 
time=89.360..89.366 rows=10 loops=1)
   Group Key: a, b
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=10 width=8) (actual 
time=0.017..19.903 rows=10 loops=1)
 Ext Stats: public.s  Clauses: a, b
 Ext Stats: public.s  Clauses: a, b
 Planning Time: 0.094 ms
 Execution Time: 89.409 ms
(8 rows)

 clause  |extstats 
-+-
 Group Key: a, b | public.s  Clauses: a, b
(1 row)

PREPARE
  QUERY PLAN
  
--
 Seq Scan on t  (cost=0.00..1943.00 rows=9850 width=8) (actual 
time=0.024..27.818 rows=1 loops=1)
   Filter: ((a = 5) AND (b = 5))
   Rows Removed by Filter: 9
 Planning Time: 0.108 ms
 Execution Time: 28.510 ms
(5 rows)

DROP TABLE
DROP FUNCTION
 List of extended statistics
 Schema

Re: pgsql: Add more SQL/JSON constructor functions

2024-06-28 Thread Amit Langote

On Fri, Jun 28, 2024 at 3:14 PM jian he  wrote:
> On Thu, Jun 27, 2024 at 7:48 PM Amit Langote  wrote:
> > >
> > > I've attempted that in the attached 0001, which removes
> > > JsonExpr.coercion_expr and a bunch of code around it.
> > >
> > > 0002 is now the original patch minus the changes to make
> > > JSON_EXISTS(), JSON_QUERY(), and JSON_VALUE() behave as we would like,
> > > because the changes in 0001 covers them. The changes for JsonBehavior
> > > expression coercion as they were in the last version of the patch are
> > > still needed, but I decided to move those into 0001 so that the
> > > changes for query functions are all in 0001 and those for constructors
> > > in 0002.  It would be nice to get rid of that coerce_to_target_type()
> > > call to coerce the "behavior expression" to RETURNING type, but I'm
> > > leaving that as a task for another day.
> >
> > Updated 0001 to remove outdated references, remove some more unnecessary 
> > code.
> >
>
> i found some remaining references of "coercion_expr" should be removed.
>
> src/include/nodes/primnodes.h
> /* JsonExpr's collation, if coercion_expr is NULL. */
>
>
> src/include/nodes/execnodes.h
> /*
> * Address of the step to coerce the result value of jsonpath evaluation
> * to the RETURNING type using JsonExpr.coercion_expr.  -1 if no coercion
> * is necessary or if either JsonExpr.use_io_coercion or
> * JsonExpr.use_json_coercion is true.
> */
> int jump_eval_coercion;
>
> src/backend/jit/llvm/llvmjit_expr.c
> /* coercion_expr code */
> LLVMPositionBuilderAtEnd(b, b_coercion);
> if (jsestate->jump_eval_coercion >= 0)
> LLVMBuildBr(b, opblocks[jsestate->jump_eval_coercion]);
> else
> LLVMBuildUnreachable(b);
>
>
> src/backend/executor/execExprInterp.c
> /*
>  * Checks if an error occurred either when evaluating JsonExpr.coercion_expr 
> or
>  * in ExecEvalJsonCoercion().  If so, this sets JsonExprState.error to trigger
>  * the ON ERROR handling steps.
>  */
> void
> ExecEvalJsonCoercionFinish(ExprState *state, ExprEvalStep *op)
> {
> }
>
> if (jbv == NULL)
> {
> /* Will be coerced with coercion_expr, if any. */
> *op->resvalue = (Datum) 0;
> *op->resnull = true;
> }
>
>
> src/backend/executor/execExpr.c
> /*
> * Jump to coerce the NULL using coercion_expr if present.  Coercing NULL
> * is only interesting when the RETURNING type is a domain whose
> * constraints must be checked.  jsexpr->coercion_expr containing a
> * CoerceToDomain node must have been set in that case.
> */
>
> /*
> * Jump to coerce the NULL using coercion_expr if present.  Coercing NULL
> * is only interesting when the RETURNING type is a domain whose
> * constraints must be checked.  jsexpr->coercion_expr containing a
> * CoerceToDomain node must have been set in that case.
> */

Thanks for checking.

Will push the attached 0001 and 0002 shortly.

-- 
Thanks, Amit Langote


v5-0001-SQL-JSON-Fix-coercion-of-constructor-outputs-to-t.patch
Description: Binary data


v5-0002-SQL-JSON-Always-coerce-JsonExpr-result-at-runtime.patch
Description: Binary data

Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-28 Thread Matthias van de Meent

On Fri, 28 Jun 2024 at 10:59, Jelte Fennema-Nio  wrote:
>
> On Fri, 28 Jun 2024 at 00:41, Peter Geoghegan  wrote:
> > Typically, no, it won't be. But there's really no telling for sure.
> > The access patterns for a composite index on '(a, b)' with a qual
> > "WHERE b = 5" are identical to a qual explicitly written "WHERE a =
> > any() AND b = 5".
>
> Hmm, that's true. But in that case the explain plan gives a clear hint
> that something like that might be going on, because you'll see:
>
> Index Cond: a = any() AND b = 5
>
> That does make me think of another way, and maybe more "correct" way,
> of representing Masahiros situation than adding a new "Skip Scan Cond"
> row to the EXPLAIN output. We could explicitly include a comparison to
> all prefix columns in the Index Cond:
>
> Index Cond: ((test.id1 = 1) AND (test.id2 = ANY) AND (test.id3 = 101))
>
> Or if you want to go with syntactically correct SQL we could do the following:
>
> Index Cond: ((test.id1 = 1) AND ((test.id2 IS NULL) OR (test.id2 IS
> NOT NULL)) AND (test.id3 = 101))
>
> An additional benefit this provides is that you now know which
> additional column you should use a more specific filter on to speed up
> the query. In this case test.id2
>
> OTOH, maybe it's not a great way because actually running that puts
> the IS NULL+ IS NOT NULL query in the Filter clause (which honestly
> surprises me because I had expected this "always true expression"
> would have been optimized away by the planner).
>
> > EXPLAIN (VERBOSE, ANALYZE) SELECT id1, id2, id3 FROM test WHERE id1 = 1 AND 
> > (id2 IS NULL OR id2 IS NOT NULL) AND id3 = 101;
>QUERY PLAN
> ─
>  Index Only Scan using test_idx on public.test  (cost=0.42..12809.10
> rows=1 width=12) (actual time=0.027..11.234 rows=1 loops=1)
>Output: id1, id2, id3
>Index Cond: ((test.id1 = 1) AND (test.id3 = 101))
>Filter: ((test.id2 IS NULL) OR (test.id2 IS NOT NULL))
>
> > What about cases where we legitimately have to vary our strategy
> > during the same index scan?
>
> Would my new suggestion above address this?
>
> > In fact, that'd make sense even today, without skip scan (just with
> > the 17 work on nbtree SAOP scans). Even with regular SAOP nbtree index
> > scans, the number of primitive scans is hard to predict, and quite
> > indicative of what's really going on with the scan.
>
> *googles nbtree SAOP scans and finds the very helpful[1]*
>
> Yes, I feel like this should definitely be part of the ANALYZE output.
> Seeing how Lukas has to look at pg_stat_user_tables to get this
> information seems quite annoying[2] and only possible on systems that
> have no concurrent queries.
>
> So it sounds like we'd want a "Primitive Index Scans" counter in
> ANALYZE too. In addition to the number of filtered rows by, which if
> we go with my proposal above should probably be called "Rows Removed
> by Index Cond".

This all just made me more confident that this shows a need to enable
index AMs to provide output for EXPLAIN: The knowledge about how index
scankeys are actually used is exclusively known to the index AM,
because the strategies are often unique to the index AM (or even
chosen operator classes), and sometimes can only be applied at
runtime: while the index scankeys' sizes, attribute numbers and
operators are known in advance (even if not all arguments are filled
in; `FROM a JOIN b ON b.id = ANY (a.ref_array)`), the AM can at least
show what strategy it is likely going to choose, and how (in)efficient
that strategy could be.

Right now, Masahiro-san's patch tries to do that with an additional
field in IndexPath populated (by proxy) exclusively in btcostestimate.
I think that design is wrong, because it wires explain- and
btree-specific data through the planner, adding overhead everywhere
which is only used for btree- and btree-compatible indexes.

I think the better choice would be adding an IndexAmRoutine->amexplain
support function, which would get called in e.g. explain.c's
ExplainIndexScanDetails to populate a new "Index Scan Details" (name
to be bikeshed) subsection of explain plans. This would certainly be
possible, as the essentials for outputting things to EXPLAIN are
readily available in the explain.h header.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Parallel heap vacuum

2024-06-28 Thread Amit Kapila

On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada  wrote:
>
> # Benchmark results
>
> * Test-1: parallel heap scan on the table without indexes
>
> I created 20GB table, made garbage on the table, and run vacuum while
> changing parallel degree:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 6); --- 20GB table
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> Here are the results (total time and heap scan time):
>
> PARALLEL 0: 21.99 s (single process)
> PARALLEL 1: 11.39 s
> PARALLEL 2:   8.36 s
> PARALLEL 3:   6.14 s
> PARALLEL 4:   5.08 s
>
> * Test-2: parallel heap scan on the table with one index
>
> I used a similar table to the test case 1 but created one btree index on it:
>
> create unlogged table test (a int) with (autovacuum_enabled = off);
> insert into test select generate_series(1, 6); --- 20GB table
> create index on test (a);
> delete from test where a % 5 = 0;
> vacuum (verbose, parallel 0) test;
>
> I've measured the total execution time as well as the time of each
> vacuum phase (from left heap scan time, index vacuum time, and heap
> vacuum time):
>
> PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
>
> Overall, I can see the parallel heap scan in lazy vacuum has a decent
> scalability; In both test-1 and test-2, the execution time of heap
> scan got ~4x faster with 4 parallel workers. On the other hand, when
> it comes to the total vacuum execution time, I could not see much
> performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> worker (12.75 vs. 11.39).
>
> I think the reason is the shared TidStore is not very scalable since
> we have a single lock on it. In all cases in the test-1, we don't use
> the shared TidStore since all dead tuples are removed during heap
> pruning. So the scalability was better overall than in test-2. In
> parallel 0 case in test-2, we use the local TidStore, and from
> parallel degree of 1 in test-2, we use the shared TidStore and
> parallel worker concurrently update it. Also, I guess that the lookup
> performance of the local TidStore is better than the shared TidStore's
> lookup performance because of the differences between a bump context
> and an DSA area. I think that this difference contributed the fact
> that index vacuuming got slower (16.74 vs. 22.04).
>
> There are two obvious improvement ideas to improve overall vacuum
> execution time: (1) improve the shared TidStore scalability and (2)
> support parallel heap vacuum. For (1), several ideas are proposed by
> the ART authors[1]. I've not tried these ideas but it might be
> applicable to our ART implementation. But I prefer to start with (2)
> since it would be easier. Feedback is very welcome.
>

Starting with (2) sounds like a reasonable approach. We should study a
few more things like (a) the performance results where there are 3-4
indexes, (b) What is the reason for performance improvement seen with
only heap scans. We normally get benefits of parallelism because of
using multiple CPUs but parallelizing scans (I/O) shouldn't give much
benefits. Is it possible that you are seeing benefits because most of
the data is either in shared_buffers or in memory? We can probably try
vacuuming tables by restarting the nodes to ensure the data is not in
memory.

-- 
With Regards,
Amit Kapila.

Re: SQL:2011 application time

2024-06-28 Thread Robert Haas

On Thu, Jun 27, 2024 at 5:56 PM Paul Jungwirth
 wrote:
> I did add a relperiods column, but I have a mostly-complete branch here (not 
> included in the
> patches) that does without. Not maintaining that new column is simpler for 
> sure. The consequence is
> that the relcache must scan for WITHOUT OVERLAPS constraints on every table. 
> That seems like a high
> performance cost for a feature most databases won't use. Since we try hard to 
> avoid that kind of
> thing (e.g. [1]), I thought adding relperiods would be preferred. If that's 
> the wrong tradeoff I can
> change it.

I'm sure that you are right that nobody is going to like an extra
index scan just to find periods. So, suppose we do as you propose and
add relperiods. In the situation where we are adding the first period
(or whatever the right term is) to the table, what kind of lock are we
holding on the table? Conversely, when we drop the last period, what
kind of lock are we holding on the table? If, hypothetically, both
answers were AccessExclusiveLock, this might not be too bad, but if
you say "ShareLock" then we've got a lot of problems; that's not even
self-exclusive.

> These patches still add some if-clauses to psql and pg_dump that say `if 
> (fout->remoteVersion >=
> 17)`. But if I change them to 18 I get failures in e.g. the pg_dump 
> tests. What do other
> people do here before a release is cut?

Sometimes I make a commit that bumps the version number (update major
version in src/tools/version_stamp.pl, then run it, then run autoconf,
then commit). Then I build my patch set on top of that. Once the
actual major release bump happens, I just drop that commit from the
stack.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Test to dump and restore objects left behind by regression

2024-06-28 Thread Ashutosh Bapat

Sorry for delay, but here's next version of the patchset for review.

On Thu, Jun 6, 2024 at 5:07 AM Michael Paquier  wrote:

> On Wed, Jun 05, 2024 at 05:09:58PM +0530, Ashutosh Bapat wrote:
> > Thanks for the suggestion. I didn't understand the dependency with the
> > buildfarm module. Will the new module be used in buildfarm separately? I
> > will work on this soon. Thanks for changing CF entry to WoA.
>
> I had some doubts about PGBuild/Modules/TestUpgradeXversion.pm, but
> after double-checking it loads dynamically AdjustUpgrade from the core
> tree based on the base path where all the modules are:
> # load helper module from source tree
> unshift(@INC, "$srcdir/src/test/perl");
> require PostgreSQL::Test::AdjustUpgrade;
> PostgreSQL::Test::AdjustUpgrade->import;
> shift(@INC);


> It would be annoying to tweak the buildfarm code more to have a
> different behavior depending on the branch of Postgres tested.
> Anyway, from what I can see, you could create a new module with the
> dump filtering rules that AdjustUpgrade requires without having to
> update the buildfarm code.
>

The two filtering rules that I picked from AdjustUpgrade() are a. use unix
style newline b. eliminate blank lines. I think we could copy those rule
into the new module (as done in the patch) without creating any dependency
between modules. There's little gained by creating another perl function
just for those two sed commands. There's no way to do that otherwise. If we
keep those two modules independent, we will be free to change each module
as required in future. Do we need to change buildfarm code to load the
AdjustDump module like above? I am not familiar with the buildfarm code.

Here's a description of patches and some notes
0001
---
1. Per your suggestion the logic to handle dump output differences is
externalized in PostgreSQL::Test::AdjustDump. Instead of eliminating those
differences altogether from both the dump outputs, the corresponding DDL in
the original dump output is adjusted to look like that from the restored
database. Thus we retain full knowledge of what differences to expect.
2. I have changed the name filter_dump to filter_dump_for_upgrade so as to
differentiate between two adjustments 1. for upgrade and 2. for
dump/restore. Ideally the name should have been adjust_dump_for_ugprade() .
It's more of an adjustment than filtering as indicated by the function it
calls. But I haven't changed that. The new function to adjust dumps for
dump and restore tests is named adjust_dump_for_restore() however.
3. As suggested by Daniel upthread, the test for dump and restore happens
before upgrade which might change the old cluster thus changing the state
of objects left behind by regression. The test is not executed if
regression is not used to create the old cluster.
4. The code to compare two dumps and report differences if any is moved to
its own function compare_dumps() which is used for both upgrade and
dump/restore tests.
The test uses the custom dump format for dumping and restoring the database.

0002
--
This commit expands the previous test to test all dump formats. But as
expected that increases the time taken by this test. On my laptop 0001
takes approx 28 seconds to run the test and with 0002 it takes approx 35
seconds. But there's not much impact on the duration of running all the
tests (2m30s vs 2m40s). The code which creates the DDL statements in the
dump is independent of the dump format. So usually we shouldn't require to
test all the formats in this test. But each format stores the dependencies
between dumped objects in a different manner which would be tested with the
changes in this patch. I think this patch is also useful. If we decide to
keep this test, the patch is intended to be merged into 0001.

-- 
Best Wishes,
Ashutosh Bapat
From dea7d55a8b938c1b670eebe7662a0dace5077a0d Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 27 Jun 2024 15:17:29 +0530
Subject: [PATCH 3/3] Test dump and restore in all formats

Expanding on the previous commit, this commit modifies the test to dump
and restore regression database using all dump formats one by one.

But this changes increases the time to run test from 51s to 78s on my laptop.
If that's acceptable this commit should be squashed into the previous commit
before committing upstream.

Ashutosh Bapat
---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 84 +-
 1 file changed, 54 insertions(+), 30 deletions(-)

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 0e181b294d..5093b2bcaa 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -576,41 +576,65 @@ sub test_regression_dump_restore
 {
 	my ($src_node, %node_params) = @_;
 	my $dst_node = PostgreSQL::Test::Cluster->new('dst_node');
-	my $dump3_file = "$tempdir/dump3.custom";
-	my $dump4_file = "$tempdir/dump4.sql";
-	my $dump5_file = "$tempdir/dump5.sql";
-
-
-	#

Re: pgindent exit status if a file encounters an error

2024-06-28 Thread Ashutosh Bapat

On Wed, Jun 26, 2024 at 8:54 PM Tom Lane  wrote:

> Ashutosh Bapat  writes:
> > The usage help mentions exit code 2 specifically while describing --check
> > option but it doesn't mention exit code 1. Neither does the README. So I
> > don't think we need to document exit code 3 anywhere. Please let me know
> if
> > you think otherwise.
>
> I think we should have at least a code comment summarizing the
> possible exit codes, along the lines of
>
> # Exit codes:
> #   0 -- all OK
> #   1 -- could not invoke pgindent, nothing done
> #   2 -- --check mode and at least one file requires changes
> #   3 -- pgindent failed on at least one file
>

Thanks. Here's a patch with these lines.

In an offline chat Andrew mentioned that he expects more changes in the
patch and he would take care of those. Will review and test his patch.

-- 
Best Wishes,
Ashutosh Bapat
From 2a61830f0a2b6559d04bef75c6aa224e30ed1609 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 26 Jun 2024 15:46:53 +0530
Subject: [PATCH 1/2] pgindent exit status on error

When pg_bsd_indent exits with a non-zero status or reports an error,
make pgindent exit with non-zero status 3. The program does not exit on
the first instance of the error. Instead it continues to process
remaining files as long as some other exit condition is encountered, in
which case exit code 3 is reported.

This helps to detect errors automatically when pgindent is run in shells
which interpret non-zero exit status as failure.

Author: Ashutosh Bapat
Reviewed by: Andrew Dunstan
Discussion: https://www.postgresql.org/message-id/caexhw5sprsifeldp-u1fa5ba7ys2f0gvljmkoobopkadjwq...@mail.gmail.com
---
 src/tools/pgindent/pgindent | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index 48d83bc434..03b319c9c3 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -1,5 +1,12 @@
 #!/usr/bin/perl
 
+# Program to maintain uniform layout style in our C and Perl code.
+# Exit codes:
+#   0 -- all OK
+#   1 -- could not invoke pgindent, nothing done
+#   2 -- --check mode and at least one file requires changes
+#   3 -- pgindent failed on at least one file
+
 # Copyright (c) 2021-2024, PostgreSQL Global Development Group
 
 use strict;
@@ -409,6 +416,7 @@ foreach my $source_filename (@files)
 	if ($source eq "")
 	{
 		print STDERR "Failure in $source_filename: " . $error_message . "\n";
+		$status = 3;
 		next;
 	}
 
@@ -429,7 +437,7 @@ foreach my $source_filename (@files)
 
 			if ($check)
 			{
-$status = 2;
+$status ||= 2;
 last unless $diff;
 			}
 		}
-- 
2.34.1

Re: Converting README documentation to Markdown

2024-06-28 Thread Tatsuo Ishii

> I've been thinking about this some more.  I think the most value here
> would be to just improve the plain-text formatting, so that there are
> consistent list styles, header styles, indentation, some of the
> ambiguities cleared up -- much of which your 0001 patch does.  You
> might as well be targeting markdown-like conventions with this; they
> are mostly reasonable.
> 
> I tend to think that actually converting all the README files to
> README.md could be a net negative for maintainability.  Because now
> you are requiring everyone who potentially wants to edit those to be
> aware of Markdown syntax and manually check the rendering.  With
> things like DocBook, if you make a mess, you get error messages from
> the build step.  If you make a mess in Markdown, you have to visually
> find it yourself.  There are many READMEs that contain nested lists
> and code snippets and diagrams and such all mixed together.  Getting
> that right in Markdown can be quite tricky.  I'm also foreseeing
> related messes of trailing whitespace, spaces-vs-tab confusion,
> gitattributes violations, etc.  It can be a lot of effort.  It's okay
> to do this for prominent files like the top-level one, but I suggest
> that for the rest we can keep it simple and just use plain text.

Agreed.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: walsender.c comment with no context is hard to understand

2024-06-28 Thread Bertrand Drouvot

Hi,

On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote:
> On Fri, Jun 28, 2024 at 12:55 PM Peter Smith  wrote:
> >
> > On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith  wrote:
> > > >
> > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote:
> > > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith  
> > > > > > wrote:
> > > > > >> Perhaps the comment should say something like it used to:
> > > > > >> /* Fail if there is not enough WAL available. This can happen 
> > > > > >> during
> > > > > >> shutdown. */
> > > > > >
> > > > > > Agree with this, +1 for this change.
> > > > >
> > > > > That would be an improvement.  Would you like to send a patch with all
> > > > > the areas you think could stand for improvements?
> > > > > --
> > > >
> > > > OK, I attached a patch equivalent of the suggestion in this thread.
> > > >
> > >
> > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr +
> > > reqLen)) be moved immediately after the call to WalSndWaitForWal().
> > > The comment seems to suggests the same: "Make sure we have enough WAL
> > > available before retrieving the current timeline. .."
> > >
> >
> > Yes, as I wrote in the first post, those lines did once used to be
> > adjacent in logical_read_xlog_page.
> >
> > I also wondered if they still belonged together, but I opted for the
> > safest option of fixing only the comment instead of refactoring old
> > code when no problem had been reported.
> >
> > AFAICT these lines became separated due to a 2023 patch [1], and you
> > were one of the reviewers of that patch, so I assumed the code
> > separation was deemed OK at that time. Unless it was some mistake that
> > slipped past multiple reviewers?
> >
> 
> I don't know whether your assumption is correct. AFAICS, those two
> lines should be together. Let us ee if Bertrand remembers anything?
> 

IIRC the WalSndWaitForWal() call has been moved to ensure that we can determine
the timeline accurately. I agree with Amit that it would make more sense to
move the (flushptr < targetPagePtr + reqLen) check just after the flushptr
assignement. I don't recall that we discussed any reason of not doing so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: cfbot update: Using GitHub for patch review

2024-06-28 Thread Ashutosh Bapat

On Fri, Jun 21, 2024 at 8:06 PM Jelte Fennema-Nio 
wrote:

> I recently got write access to the cfbot repo[1] and machine from
> Thomas. And I deployed a few improvements this week. The most
> significant one is that it is now much easier to use GitHub as part of
> your patch review workflow.
>
> On the cfbot website[2] there's now a "D" (diff) link next to each
> commit fest entry.  A good example of such a link would be the one for
> my most recent commitfest entry[3]. There is a separate commit for
> each patch file and those commits contain the "git format-patch"
> metadata. (this is not done using git am, but using git mailinfo +
> patch + sed, because git am is horrible at resolving conflicts)
>
> The killer feature (imho) of GitHub diffs over looking at patch files:
> You can press the "Expand up"/"Expand down" buttons on the left of the
> diff to see some extra context that the patch file doesn't contain.
>
> You can also add the cfbot repo as a remote to your local git
> repository. That way you don't have to manually download patches and
> apply them to your local checkout anymore:
>
> # Add the remote
> git remote add -f cfbot https://github.com/postgresql-cfbot/postgresql.git
> # make future git pulls much quicker (optional)
> git maintenance start
> # check out a commitfest entry
> git checkout cf/5065
>
> P.S. Suggestions for further improvements are definitely appreciated.
> We're currently already working on better integration between the
> commitfest app website and the cfbot website.
>
> P.P.S The "D" links don't work for patches that need to be rebased
> since before I deployed this change, but that problem should fix
> itself with time.
>

Thanks. Very helpful.

Will it be possible to make it send an email containing the review
comments? Better even if a reply to that email adds comments/responses back
to PR.

I need to sign in to github to add my review comments. So those who do not
have a github account can not use it for review. But I don't think that can
be fixed. We need a way to know who left review comments.

There was some discussion at pgconf.dev about using gitlab instead of
github. How easy is it to use gitlab if we decide to go that way?

-- 
Best Wishes,
Ashutosh Bapat

Re: remaining sql/json patches

2024-06-28 Thread Amit Langote

Hi Alexander,

On Fri, Jun 28, 2024 at 5:00 PM Alexander Lakhin  wrote:
>
> Hi Amit,
>
> 28.06.2024 09:15, Amit Langote wrote:
> > Hi Alexander,
> >
> >
> > Thanks for the report.  Yeah, those comments that got added in
> > 7081ac46ace are obsolete.
> >
>
> Thanks for paying attention to that!
>
> Could you also look at comments for transformJsonObjectAgg() and
> transformJsonArrayAgg(), aren't they obsolete too?

You're right.  I didn't think they needed to be similarly fixed,
because I noticed the code like the following in in
transformJsonObjectAgg() which sets the OID of the function to call
from, again, JsonConstructorExpr:

{
if (agg->absent_on_null)
if (agg->unique)
aggfnoid = F_JSONB_OBJECT_AGG_UNIQUE_STRICT;
else
aggfnoid = F_JSONB_OBJECT_AGG_STRICT;
else if (agg->unique)
aggfnoid = F_JSONB_OBJECT_AGG_UNIQUE;
else
aggfnoid = F_JSONB_OBJECT_AGG;

aggtype = JSONBOID;
}

So, yes, the comments for them should be fixed too like the other two
to also mention JsonConstructorExpr.

Updated patch attached.

Wonder if Alvaro has any thoughts on this.

-- 
Thanks, Amit Langote

v2-0001-SQL-JSON-Fix-some-obsolete-comments.patch
Description: Binary data

Re: Why is infinite_recurse test suddenly failing?

2024-06-28 Thread Alexander Lakhin


Hello hackers,

15.08.2019 10:17, Thomas Munro wrote:

On Thu, Aug 15, 2019 at 5:49 PM Tom Lane  wrote:

So that leads to the thought that "the infinite_recurse test is fine
if it runs by itself, but it tends to fall over if there are
concurrently-running backends".  I have absolutely no idea how that
would happen on anything that passes for a platform built in this
century.  Still, it's a place to start, which we hadn't before.

Hmm.  mereswin's recent failure on REL_11_STABLE was running the
serial schedule.

I read about 3 ways to get SEGV from stack-related faults: you can
exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
(per man pages), you can access a page that is inside the mapping but
is beyond the stack pointer (with some tolerance, exact details vary
by arch), and you can fail to allocate a page due to low memory.

The first kind of failure doesn't seem right -- we carefully set
max_stack_size based on RLIMIT_STACK minus some slop, so that theory
would require child processes to have different stack limits than the
postmaster as you said (perhaps OpenStack, Docker, related tooling or
concurrent activity on the host system is capable of changing it?), or
a bug in our slop logic.  The second kind of failure would imply that
we have a bug -- we're accessing something below the stack pointer --
but that doesn't seem right either -- I think various address
sanitising tools would have told us about that, and it's hard to
believe there is a bug in the powerpc and arm implementation of the
stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
leaves the third explanation, except then I'd expect to see other
kinds of problems like OOM etc before you get to that stage, and why
just here?  Confused.


Also notable is that we now have a couple of hits on ARM, not
only ppc64.  Don't know what to make of that.

Yeah, that is indeed interesting.


Excuse me for reviving this ancient thread, but aforementioned mereswine
animal has failed again recently [1]:
002_pg_upgrade_old_node.log contains:
2024-06-26 02:49:06.742 PDT [29121:4] LOG:  server process (PID 30908) was 
terminated by signal 9: Killed
2024-06-26 02:49:06.742 PDT [29121:5] DETAIL:  Failed process was running: 
select infinite_recurse();

I believe this time it's caused by OOM condition and I think this issue
occurs on armv7 mereswine because 1) armv7 uses the stack very
efficiently (thanks to 32-bitness and maybe also the Link Register) and
2) such old machines are usually tight on memory.

I've analyzed buildfarm logs and found from the check stage of that failed run:
wget [2] -O log
grep 'SQL function "infinite_recurse" statement 1' log | wc -l
5818
(that is, the nesting depth is 5818 levels for a successful run of the test)

For comparison, mereswine on HEAD [3], [4] shows 5691 levels;
alimoche (aarch64) on HEAD [5] — 3535;
lapwing (i686) on HEAD [6] — 5034;
alligator (x86_64) on HEAD [7] — 3965;

So it seems to me that unlike [9] this failure may be explained by reaching
OOM condition.

I have an armv7 device with 2GB RAM that doesn't pass `make check` nor
even `TESTS=infinite_recurse make -s check-tests` from time to time due to:
2024-06-28 12:40:49.947 UTC postmaster[20019] LOG:  server process (PID 20078) was terminated by signal 11: Segmentation 
fault

2024-06-28 12:40:49.947 UTC postmaster[20019] DETAIL:  Failed process was 
running: select infinite_recurse();
...
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `postgres: android regression [local] SELECT  
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  downcase_identifier (ident=0xa006d837 "infinite_recurse", len=16, 
warn=true, truncate=truncate@entry=true)
    at scansup.c:52
52  result = palloc(len + 1);
(gdb) p $sp
$1 = (void *) 0xbe9b0020

It looks more like [9], but I think the OOM effect is OS/kernel dependent.

Though the test passes reliably with lower max_stack_depth values, so I've
analyzed how much memory the backend consumes (total size and the size of
it's largest segment) depending on the value:
1500kB
adfe1000 220260K rw---   [ anon ]
 total   419452K
---
1600kB
ac7e5000 234748K rw---   [ anon ]
 total   434040
---
1700kB
acf61000 249488K rw---   [ anon ]
 total   448880K
---
default value (2048kB)
aac65000 300528K rw---   [ anon ]
 total   501424K

(roughly, increasing max_stack_depth by 100kB increases the backend memory
consumption by 15MB during the test)

So I think reducing max_stack_depth for mereswine, say to 1000kB, should
prevent such failures in the future.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mereswine&dt=2024-06-26%2002%3A10%3A45&stg=check&raw=1
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2024-06-26%2016%3A48%3A07
[4] 
https://buildfarm.postgresql.o

Re: Check lateral references within PHVs for memoize cache keys

2024-06-28 Thread Andrei Lepikhov


On 6/18/24 08:47, Richard Guo wrote:

On Mon, Mar 18, 2024 at 4:36 PM Richard Guo  wrote:

Here is another rebase over master so it applies again.  I also added a
commit message to help review.  Nothing else has changed.


AFAIU currently we do not add Memoize nodes on top of join relation
paths.  This is because the ParamPathInfos for join relation paths do
not maintain ppi_clauses, as the set of relevant clauses varies
depending on how the join is formed.  In addition, joinrels do not
maintain lateral_vars.  So we do not have a way to extract cache keys
from joinrels.

(Besides, there are places where the code doesn't cope with Memoize path
on top of a joinrel path, such as in get_param_path_clause_serials.)

Therefore, when extracting lateral references within PlaceHolderVars,
there is no need to consider those that are due to be evaluated at
joinrels.

Hence, here is v7 patch for that.  In passing, this patch also includes
a comment explaining that Memoize nodes are currently not added on top
of join relation paths (maybe we should have a separate patch for this?).

Hi,
I have reviewed v7 of the patch. This improvement is good enough to be 
applied, thought. Here is some notes:


Comment may be rewritten for clarity:
"Determine if the clauses in param_info and innerrel's lateral_vars" -
I'd replace lateral_vars with 'lateral references' to combine in one 
phrase PHV from rel and root->placeholder_list sources.


I wonder if we can add whole PHV expression instead of the Var (as 
discussed above) just under some condition:
if (!bms_intersect(pull_varnos(root, (Node *) phinfo->ph_var->phexpr), 
innerrelids))

{
  // Add whole PHV
}
else
{
  // Add only pulled vars
}

I got the point about Memoize over join, but as a join still calls 
replace_nestloop_params to replace parameters in its clauses, why not to 
invent something similar to find Memoize keys inside specific JoinPath 
node? It is not the issue of this patch, though - but is it doable?


IMO, the code:
if (bms_nonempty_difference(outerPlan->chgParam, node->keyparamids))
  cache_purge_all(node);

is a good place to check an assertion: is it really the parent query 
parameters that make a difference between memoize keys and node list of 
parameters?


Generally, this patch looks good for me to be committed.

--
regards, Andrei Lepikhov

Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-28 Thread Peter Geoghegan

On Thu, Jun 27, 2024 at 11:06 PM  wrote:
> OK. I would like to understand more about your proposed patch. I
> have also registered as a reviewer in the commitfests entry.

Great!

> Although I haven't looked on your patch yet, if it's difficult to know
> how it can optimize during the planning phase, it's enough for me to just
> show "Skip Scan Cond (or Non-Key Filter)". This is because users can
> understand that inefficient index scans *may* occur.

That makes sense.

The goal of your patch is to highlight when an index scan is using an
index that is suboptimal for a particular query (a query that the user
runs through EXPLAIN or EXPLAIN ANALYZE). The underlying rules that
determine "access predicate vs. filter predicate" are not very
complicated -- they're intuitive, even. But even an expert can easily
make a mistake on a bad day.

It seems to me that all your patch really needs to do is to give the
user a friendly nudge in that direction, when it makes sense to. You
want to subtly suggest to the user "hey, are you sure that the index
the plan uses is exactly what you expected?". Fortunately, even when
skip scan works well that should still be a useful nudge. If we assume
that the query that the user is looking at is much more important than
other queries, then the user really shouldn't be using skip scan in
the first place. Even a good skip scan is a little suspicious (it's
okay if it "stands out" a bit).

> In terms of the concept of EXPLAIN output, I thought that runtime partition
> pruning is similar. "EXPLAIN without ANALYZE" only shows the possibilities and
> "EXPLAIN ANALYZE" shows the actual results.

That seems logical.

-- 
Peter Geoghegan

Re: Converting README documentation to Markdown

2024-06-28 Thread Peter Eisentraut


On 28.06.24 11:56, Jelte Fennema-Nio wrote:

On Fri, 28 Jun 2024 at 09:38, Peter Eisentraut  wrote:

Getting that right in Markdown can be quite tricky.


I agree that in some cases it's tricky. But what's the worst case that
can happen when you get it wrong? It renders weird on github.com.


I have my "less" set up so that "less somefile.md" automatically renders 
the markdown.  That's been pretty useful.  But if that now keeps making 
a mess out of PostgreSQL's README files, then I'm going to have to keep 
fixing things, and I might get really mad.  That's the worst that could 
happen. ;-)


So I don't agree with "aspirational markdown".  If we're going to do it, 
then I expect that the files are marked up correctly at all times.


Conversely, what's the best that could happen?

Restart pg_usleep when interrupted

2024-06-28 Thread Imseih (AWS), Sami

Attachment protected by Amazon:

[0001-Handle-Sleep-interrupts.patch]
https://us-east-1.secure-attach.amazon.com/fcdc82ce-7887-4aa1-af9e-c1161a6b1d6f/bc81fa24-41de-48f9-a767-a6d15801754b

Amazon has replaced attachment with download link. Downloads will be available 
until July 28, 2024, 19:59 (UTC+00:00).
[Tell us what you think] 
https://amazonexteu.qualtrics.com/jfe/form/SV_ehuz6zGo8YnsRKK
[For more information click here] https://docs.secure-attach.amazon.com/guide


Hi,

In the proposal by Bertrand [1] to implement vacuum cost delay tracking
in pg_stat_progress_vacuum, it was discovered that the vacuum cost delay
ends early on the leader process of a parallel vacuum due to parallel workers
reporting progress on index vacuuming, which was introduced in 17
with commit [2]. With this patch, everytime a parallel worker
completes a vacuum index, it will send a completion message to the leader.

The facility that allows a parallel worker to report progress to the leader was
introduced in commit [3].

In the case of the patch being proposed by Bertrand, the number of interrupts
will be much more frequent as parallel workers would send a message to the 
leader
to update the vacuum delay counters every vacuum_delay_point call.

Looking into this, the ideal solution appears to provide the ability for a 
pg_usleep
call to restart after being interrupted. Thanks to the usage of nanosleep as
of commit[4], this should be trivial to do as nanosleep
provides a remaining time, which can be used to restart the sleep. This
idea was also mentioned in [5].

I am attaching a POC patch that does the $SUBJECT. Rather than changing the
existing pg_usleep, the patch introduces a function, pg_usleep_handle_interrupt,
that takes in the sleep duration and a boolean flag to force handling of the
interrupt, or not.

This function can replace pg_usleep inside vacuum_delay_point, and could be
useful in other cases in which it’s important to handle interrupts.

For Windows, the “force” = “true” flag of the new function uses SleepEx which
from what I can tell based on the code comments is a non-interruptible sleep.

Thoughts?

[1] 
https://www.postgresql.org/message-id/ZnbIIUQpFJKAJx2W%40ip-10-97-1-34.eu-west-3.compute.internal
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=46ebdfe164c61fbac961d1eb7f40e9a684289ae6
[3] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f1889729dd3ab0352dc0ccc2ffcc1b1901f8e39f
[4] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a948e49e2ef11815be0b211723bfc5b53b7f75a8
[5] https://www.postgresql.org/message-id/24848.1473607575%40sss.pgh.pa.us


Regards,

Sami Imseih
Amazon Web Services (AWS)

Re: Track the amount of time waiting due to cost_delay

2024-06-28 Thread Imseih (AWS), Sami

> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers 
> reports
> progress, and we currently don't handle interrupts by restarting the sleep 
> with
> the remaining time. nanosleep does provide the ability to restart with the 
> remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

After discussing offline with Bertrand, it may be better to have 
a solution to deal with the interrupts and allows the sleep to continue to
completion. This will simplify this patch and will be useful 
for other cases in which parallel workers need to send a message
to the leader. This is the thread [1] for that discussion.

[1] 
https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-00%40email.amazonses.com

Regards,

Sami

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Robert Haas

Hi,

I think you need to find a way to disable this "Attachment protected
by Amazon" thing:

http://postgr.es/m/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-000...@email.amazonses.com

We want patches to be in the pgsql-hackers archives, not temporarily
accessible via some link.

...Robert

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Imseih (AWS), Sami

> I think you need to find a way to disable this "Attachment protected
> by Amazon" thing:

Yes, I am looking into that. I only noticed after sending the email.

Sorry about that.

Sami

Doc: Move standalone backup section, mention -X argument

2024-06-28 Thread David G. Johnston

A documentation comment came in [1] causing me to review some of our backup
documentation and I left the current content and location of the standalone
backups was odd.  I propose to move it to a better place, under file system
backups.

Adding to commitfest.

David J.

[1]
https://www.postgresql.org/message-id/CAKFQuwZ%3DWxdWJ6O66yQ9dnWTLO12p7h3HpfhowCj%2B0U_bNrzdg%40mail.gmail.com


v1-0001-docs-move-standalone-pg_basebackup-docs-to-file-syst.patch
Description: Binary data

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Sami Imseih


> We want patches to be in the pgsql-hackers archives, not temporarily
> accessible via some link.
> 
> …Robert
> 

Moving to another email going forward.

Reattaching the patch.




0001-Handle-Sleep-interrupts.patch
Description: Binary data


Regards,

Sami Imseih
Amazon Web Services

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Tom Lane

Sami Imseih  writes:
> Reattaching the patch.

I feel like this is fundamentally a wrong solution, for the reasons
cited in the comment for pg_usleep: long sleeps are a bad idea
because of the resulting uncertainty about whether we'll respond to
interrupts and such promptly.  An example here is that if we get
a query cancel interrupt, we should probably not insist on finishing
out the current sleep before responding.

Therefore, rather than "improving" pg_usleep (and uglifying its API),
the right answer is to fix parallel vacuum leaders to not depend on
pg_usleep in the first place.  A better idea might be to use
pg_sleep() or equivalent code.

regards, tom lane

Re: Confine vacuum skip logic to lazy_scan_skip

2024-06-28 Thread Melanie Plageman

On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro  wrote:
>
> On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman
>  wrote:
> > I've rebased the attached v10 over top of the changes to
> > lazy_scan_heap() Heikki just committed and over the v6 streaming read
> > patch set. I started testing them and see that you are right, we no
> > longer pin too many buffers. However, the uncached example below is
> > now slower with streaming read than on master -- it looks to be
> > because it is doing twice as many WAL writes and syncs. I'm still
> > investigating why that is.

--snip--

> 4. For learning/exploration only, I rebased my experimental vectored
> FlushBuffers() patch, which teaches the checkpointer to write relation
> data out using smgrwritev().  The checkpointer explicitly sorts
> blocks, but I think ring buffers should naturally often contain
> consecutive blocks in ring order.  Highly experimental POC code pushed
> to a public branch[2], but I am not proposing anything here, just
> trying to understand things.  The nicest looking system call trace was
> with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads
> and WAL writes 128kB at a time:
>
> pwrite(32,...,131072,0xfc6000) = 131072 (0x2)
> fdatasync(32) = 0 (0x0)
> pwrite(27,...,131072,0x6c) = 131072 (0x2)
> pread(27,...,131072,0x73e000) = 131072 (0x2)
> pwrite(27,...,131072,0x6e) = 131072 (0x2)
> pread(27,...,131072,0x75e000) = 131072 (0x2)
> pwritev(27,[...],3,0x77e000) = 131072 (0x2)
> preadv(27,[...],3,0x77e000) = 131072 (0x2)
>
> That was a fun experiment, but... I recognise that efficient cleaning
> of ring buffers is a Hard Problem requiring more concurrency: it's
> just too late to be flushing that WAL.  But we also don't want to
> start writing back data immediately after dirtying pages (cf. OS
> write-behind for big sequential writes in traditional Unixes), because
> we're not allowed to write data out without writing the WAL first and
> we currently need to build up bigger WAL writes to do so efficiently
> (cf. some other systems that can write out fragments of WAL
> concurrently so the latency-vs-throughput trade-off doesn't have to be
> so extreme).  So we want to defer writing it, but not too long.  We
> need something cleaning our buffers (or at least flushing the
> associated WAL, but preferably also writing the data) not too late and
> not too early, and more in sync with our scan than the WAL writer is.
> What that machinery should look like I don't know (but I believe
> Andres has ideas).

I've attached a WIP v11 streaming vacuum patch set here that is
rebased over master (by Thomas), so that I could add a CF entry for
it. It still has the problem with the extra WAL write and fsync calls
investigated by Thomas above. Thomas has some work in progress doing
streaming write-behind to alleviate the issues with the buffer access
strategy and streaming reads. When he gets a version of that ready to
share, he will start a new "Streaming Vacuum" thread.

- Melanie
From 050b6c3fa73c9153aeef58fcd306533c1008802e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 26 Apr 2024 08:32:44 +1200
Subject: [PATCH v11 2/3] Refactor tidstore.c memory management.

Previously, TidStoreIterateNext() would expand the set of offsets for
each block into a buffer that it overwrote each time.  In order to be
able to collect the offsets for multiple blocks before working with
them, change the contract.  Now, the offsets are obtained by a separate
call to TidStoreGetBlockOffsets(), which can be called at a later time,
and TidStoreIteratorResult objects are safe to copy and store in a
queue.

This will be used by a later patch, to avoid the need for expensive
extra copies of offset array and associated memory management.
---
 src/backend/access/common/tidstore.c  | 68 +--
 src/backend/access/heap/vacuumlazy.c  |  9 ++-
 src/include/access/tidstore.h | 12 ++--
 .../modules/test_tidstore/test_tidstore.c |  9 ++-
 4 files changed, 53 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index fb3949d69f6..c3c1987204b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -147,9 +147,6 @@ struct TidStoreIter
 	TidStoreIterResult output;
 };
 
-static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
-	   BlocktableEntry *page);
-
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
@@ -486,13 +483,6 @@ TidStoreBeginIterate(TidStore *ts)
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
-	/*
-	 * We start with an array large enough to contain at least the offsets
-	 * from one completely full bitmap element.
-	 */
-	iter->output.max_offset = 2 * BITS_PER_BITMAPWORD;
-	iter->output.offsets = palloc(sizeof(OffsetNumber) * iter

Re: Add LSN <-> time conversion functionality

2024-06-28 Thread Melanie Plageman

On Wed, Jun 26, 2024 at 10:04 PM Melanie Plageman
 wrote:
>
> I've implemented these review points in the attached v4.

I realized the docs had a compilation error. Attached v5 fixes that as
well as three bugs I found while using this patch set more intensely
today.

I see Michael has been working on some crash safety for stats here
[1]. I wonder if that would be sufficient for the LSNTimeStream. I
haven't examined his patch functionality yet, though.

I also had an off-list conversation with Robert where he suggested I
could perhaps change the user-facing functions for estimating an
LSN/time conversion to instead return a floor and a ceiling -- instead
of linearly interpolating a guess. This would be a way to keep users
from misunderstanding the accuracy of the functions to translate LSN
<-> time. I'm interested in what others think of this.

I like this idea a lot because it allows me to worry less about how I
decide to compress the data and whether or not it will be accurate for
use cases different than my own (the opportunistic freezing
heuristic). If I can provide a floor and a ceiling that are definitely
accurate, I don't have to worry about misleading people.

- Melanie

[1] https://www.postgresql.org/message-id/ZnEiqAITL-VgZDoY%40paquier.xyz
From f492af31c1b9917aa27ba3ad76560e59f3fd5c9b Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v5 2/6] Add LSNTimeStream for converting LSN <-> time

Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an
LSN, time pair. The LSNTimeStream is fixed size, so when a new LSNTime
is inserted to a full LSNTimeStream, an LSNTime is dropped and the new
LSNTime is inserted. We drop the LSNTime whose absence would cause the
least error when interpolating between its adjoining points.

LSN <-> time conversions can be done using linear interpolation with two
LSNTimes on the LSNTimeStream.

This commit does not add a global instance of LSNTimeStream. It adds the
structures and functions needed to maintain and access such a stream.
---
 src/backend/utils/activity/pgstat_wal.c | 233 
 src/include/pgstat.h|  32 
 src/tools/pgindent/typedefs.list|   2 +
 3 files changed, 267 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 0e374f133a9..cef9429994c 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@

 #include "postgres.h"

+#include "access/xlog.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
 #include "utils/pgstat_internal.h"
+#include "utils/timestamp.h"

 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,11 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;

+static void lsntime_insert(LSNTimeStream *stream, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr	estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +192,228 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Given three LSNTimes, calculate the area of the triangle they form were they
+ * plotted with time on the X axis and LSN on the Y axis.
+ */
+static int
+lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right)
+{
+	int			rectangle_all = (right->time - left->time) * (right->lsn - left->lsn);
+	int			triangle1 = rectangle_all / 2;
+	int			triangle2 = (mid->lsn - left->lsn) * (mid->time - left->time) / 2;
+	int			triangle3 = (right->lsn - mid->lsn) * (right->time - mid->time) / 2;
+	int			rectangle_part = (right->lsn - mid->lsn) * (mid->time - left->time);
+
+	return rectangle_all - triangle1 - triangle2 - triangle3 - rectangle_part;
+}
+
+/*
+ * Determine which LSNTime to drop from a full LSNTimeStream. Once the LSNTime
+ * is dropped, points between it and either of its adjacent LSNTimes will be
+ * interpolated between those two LSNTimes instead. To keep the LSNTimeStream
+ * as accurate as possible, drop the LSNTime whose absence would have the least
+ * impact on future interpolations.
+ *
+ * We determine the error that would be introduced by dropping a point on the
+ * stream by calculating the area of the triangle formed by the LSNTime and its
+ * adjacent LSNTimes. We do this for each LSNTime in the stream (except for the
+ * first and last LSNTimes) and choose the LSNTime with the smallest error
+ * (area). We avoid extrapolation by never dropping the first or last points.
+ */
+static int
+lsntime_to_drop(LSNTimeStream *stream)
+{
+	int			min_area = INT_MAX;
+	int			target_point = stream->length - 1;
+
+	/* Don't drop points if free space available */
+	Assert(stream->length == LSNTIMESTREA

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Thomas Munro

On Sat, Jun 29, 2024 at 9:34 AM Tom Lane  wrote:
> Therefore, rather than "improving" pg_usleep (and uglifying its API),
> the right answer is to fix parallel vacuum leaders to not depend on
> pg_usleep in the first place.  A better idea might be to use
> pg_sleep() or equivalent code.

In case it's useful for someone looking into that, in earlier
discussions we figured out that it is possible to have high resolution
timeouts AND support latch multiplexing on Linux, FreeBSD, macOS:

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BhC9mFx8tEcBsyo7-cAfWgtbRy1eDizeFuff2K7T%3D4bA%40mail.gmail.com

Re: Restart pg_usleep when interrupted

2024-06-28 Thread Sami Imseih

Thanks for the feedback!

> On Jun 28, 2024, at 4:34 PM, Tom Lane  wrote:
> 
> Sami Imseih  writes:
>> Reattaching the patch.
> 
> I feel like this is fundamentally a wrong solution, for the reasons
> cited in the comment for pg_usleep: long sleeps are a bad idea
> because of the resulting uncertainty about whether we'll respond to
> interrupts and such promptly.  An example here is that if we get
> a query cancel interrupt, we should probably not insist on finishing
> out the current sleep before responding.

The case which brought up this discussion is the pg_usleep that 
is called within the vacuum_delay_point being interrupted. 

When I read the same code comment you cited, it sounded to me 
that “long sleeps” are those that are in seconds or minutes. The longest 
vacuum delay allowed is 100ms.

> Therefore, rather than "improving" pg_usleep (and uglifying its API),
> the right answer is to fix parallel vacuum leaders to not depend on
> pg_usleep in the first place.  A better idea might be to use
> pg_sleep() or equivalent code.

Yes, that is a good idea to explore and it will not require introducing
an awkward new API. I will look into using something similar to
pg_sleep.

Regards,

Sami

Re: cfbot update: Using GitHub for patch review

2024-06-28 Thread Thomas Munro

On Sat, Jun 29, 2024 at 1:10 AM Ashutosh Bapat
 wrote:
> I need to sign in to github to add my review comments. So those who do not 
> have a github account can not use it for review. But I don't think that can 
> be fixed. We need a way to know who left review comments.

I don't think Jelte was talking about moving review discussion to
Github, just providing a link to *view* the patches there.  Now I'm
wondering if there is a way to disable comments on commits in the
postgresql-cfbot GH account.  I guess they'd be lost after 48 hours
anyway when the branch gets force-pushed and commit hash changes?  I
don't want people to start posting comments there that no one is
looking at.

> There was some discussion at pgconf.dev about using gitlab instead of github. 
> How easy is it to use gitlab if we decide to go that way?

cfbot could certainly be configured to push (ie mirror) the same
branches to gitlab too (I don't have much experience with Gitlab, but
if it's just a matter of registering an account + ssh key, adding it
as a remote and pushing...).  Then there could be [View on Github]
[View on Gitlab] buttons, if people think that's useful (note "View",
not "Review"!).  The Cirrus CI system is currently only capable of
testing stuff pushed to Github, though, so cfbot would continue to
push stuff there.

If memory servers, Cirrus used to say that they were planning to add
support for testing code in public Gitlab next, but now their FAQ says
their next public git host will be Bit Bucket:
https://cirrus-ci.org/faq/#only-github-support

Given that cfbot is currently only using Github because we have to to
reach Cirrus CI, not because we actually want Github features like
issue tracking or pull requests with review discussion, it hardly
matters if it's Github, Gitlab or any other public git host.  And if
we eventually decide to move our whole workflow to one of those
systems and shut down the CF app, then cfbot will be retired, and
you'll just create PRs on that system.  But so far, we continue to
prefer the CF app + email.

The reason we liked Cirrus so much despite the existence of many other
CI systems including the ones build into GH, GL, etc and many 3rd
party ones, was because it was the only provider that allowed enough
compute minutes for our needs, supported lots of operating systems,
and had public links to log files suitable for sharing on out mailing
list or cfbot's web interface (if you click to see the log, it doesn't
say "Rol up roll up, welcome to Foo Corporation, get your tickets
here!").  I still don't know of any other CI system that would be as
good for us, other than building our own.  I would say it's been a
very good choice so far.  The original cfbot goal was "feed the
mailing list to a CI system", with Github just a necessary part of the
plumbing.  It is a nice way to view patches though.

Re: race condition in pg_class

2024-06-28 Thread Noah Misch

On Fri, Jun 28, 2024 at 01:17:22AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Pushed.  Buildfarm member prion is failing the new inplace-inval.spec, 
> > almost
> > surely because prion uses -DCATCACHE_FORCE_RELEASE and inplace-inval.spec is
> > testing an extant failure to inval a cache entry.  Naturally, inexorable 
> > inval
> > masks the extant bug.  Ideally, I'd just skip the test under any kind of 
> > cache
> > clobber option.  I don't know a pleasant way to do that, so these are
> > known-feasible things I'm considering:
> 
> > 1. Neutralize the test in all branches, probably by having it just not 
> > report
> >the final answer.  Undo in the later fix patch.
> 
> > 2. v14+ has pg_backend_memory_contexts.  In the test, run some plpgsql that
> >uses heuristics on that to deduce whether caches are getting released.
> >Have a separate expected output for the cache-release scenario.  Perhaps
> >also have the test treat installcheck like cache-release, since
> >installcheck could experience sinval reset with similar consequences.
> >Neutralize the test in v12 & v13.
> 
> > 3. Add a test module with a C function that reports whether any kind of 
> > cache
> >clobber is active.  Call it in this test.  Have a separate expected 
> > output
> >for the cache-release scenario.
> 
> > Preferences or other ideas?  I'm waffling between (1) and (2).  I'll give it
> > more thought over the next day.
> 
> I'd just go for (1).  We were doing fine without this test case.
> I can't see expending effort towards hiding its result rather
> than actually fixing anything.

Good point, any effort on (2) would be wasted once the fixes get certified.  I
pushed (1).  I'm attaching the rebased fix patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_up

Re: Inval reliability, especially for inplace updates

2024-06-28 Thread Noah Misch

On Mon, Jun 17, 2024 at 04:58:54PM -0700, Noah Misch wrote:
> attached v2 patch stack.

Rebased.  This applies on top of three patches from
https://postgr.es/m/20240629024251.03.nmi...@google.com.  I'm attaching those
to placate cfbot, but this thread is for review of the last patch only.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
+  (VACUUM), or a mode with strictly more conflicts.  If the update targets a
+  row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
+  on the table.  Locking the index rel is optional.  (This allows VACUUM to
+  overwrite per-index pg_class while holding a lock on the table alone.)  We
+  could allow weaker locks, in which case the next paragraph would simply call
+  for stronger locks for its class of commands.  heap_inplace_update_scan()
+  acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
+  ExclusiveLock, on each tuple it overwrites.
+
+  pg_class heap_update() callers: before copying the tuple to modify, take a
+  lock that conflicts with at least one of those from the preceding paragraph.
+  SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
+  After heap_update(), release any LOCKTAG_TUPLE.  Most of these callers opt
+  to acquire just the LOCKTAG_RELATION.
+
+  pg_database: before copying the

Re: Backporting BackgroundPsql

2024-06-28 Thread Pavan Deolasee

On Thu, Jun 27, 2024 at 10:05 PM Heikki Linnakangas  wrote:

>
> Ok, I pushed commits to backport BackgroundPsql down to v12. I used
> "option 2", i.e. I changed background_psql() to return the new
> BackgroundPsql object.
>
>
Don't we need to add install and uninstall rules for the new module, like
we did in
https://git.postgresql.org/pg/commitdiff/a4c17c86176cfa712f541b81b2a026ae054b275e
and
https://git.postgresql.org/pg/commitdiff/7039c7cff6736780c3bbb41a90a6dfea0f581ad2
?

Thanks,
Pavan

Incremental backup from a streaming replication standby

2024-06-28 Thread Laurenz Albe

I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try  
to take an incremental backup, I get an error that I understand to mean  
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it  
should be documented.

Yours,
Laurenz Albe

43 matches

Mail list logo