from:"Alexander Korotkov"

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-28 Thread Alexander Korotkov

On Sun, Apr 28, 2024 at 2:00 PM Alexander Lakhin  wrote:
> 28.04.2024 03:59, Alexander Korotkov wrote:
> > The revised patchset is attached.  I'm going to push it if there are
> > no objections.
>
> I have one additional question regarding security, if you don't mind:
> What permissions should a user have to perform split/merge?
>
> When we deal with mixed ownership, say, bob is an owner of a
> partitioned table, but not an owner of a partition, should we
> allow him to perform merge with that partition?
> Consider the following script:
> CREATE ROLE alice;
> GRANT CREATE ON SCHEMA public TO alice;
>
> SET SESSION AUTHORIZATION alice;
> CREATE TABLE t (i int PRIMARY KEY, t text, u text) PARTITION BY RANGE (i);
> CREATE TABLE tp_00 PARTITION OF t FOR VALUES FROM (0) TO (10);
> CREATE TABLE tp_10 PARTITION OF t FOR VALUES FROM (10) TO (20);
>
> CREATE POLICY p1 ON tp_00 USING (u = current_user);
> ALTER TABLE tp_00 ENABLE ROW LEVEL SECURITY;
>
> INSERT INTO t(i, t, u)  VALUES (0, 'info for bob', 'bob');
> INSERT INTO t(i, t, u)  VALUES (1, 'info for alice', 'alice');
> RESET SESSION AUTHORIZATION;
>
> CREATE ROLE bob;
> GRANT CREATE ON SCHEMA public TO bob;
> ALTER TABLE t OWNER TO bob;
> GRANT SELECT ON TABLE tp_00 TO bob;
>
> SET SESSION AUTHORIZATION bob;
> SELECT * FROM tp_00;
> --- here bob can see his info only
> \d
>   Schema | Name  |   Type| Owner
> +---+---+---
>   public | t | partitioned table | bob
>   public | tp_00 | table | alice
>   public | tp_10 | table | alice
>
> -- but then bob can do:
> ALTER TABLE t MERGE PARTITIONS (tp_00, tp_10) INTO tp_00;
> -- (yes, he also can detach the partition tp_00, but then he couldn't
> -- re-attach nor read it)
>
> \d
>   Schema | Name  |   Type| Owner
> +---+---+---
>   public | t | partitioned table | bob
>   public | tp_00 | table | bob
>
> Thus bob effectively have captured the partition with the data.
>
> What do you think, does this create a new security risk?

Alexander, thank you for discovering this.  I believe that the one who
merges partitions should have permissions for all the partitions
merged.  I'll recheck this and provide the patch.

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-27 Thread Alexander Korotkov

Hi Justin,

Thank you for your review.  Please check v9 of the patchset [1].

On Wed, Apr 24, 2024 at 11:26 PM Justin Pryzby  wrote:
> This patch also/already fixes the schema issue I reported.  Thanks.
>
> If you wanted to include a test case for that:
>
> begin;
> CREATE SCHEMA s;
> CREATE SCHEMA t;
> CREATE TABLE p(i int) PARTITION BY RANGE(i);
> CREATE TABLE s.c1 PARTITION OF p FOR VALUES FROM (1)TO(2);
> CREATE TABLE s.c2 PARTITION OF p FOR VALUES FROM (2)TO(3);
> ALTER TABLE p MERGE PARTITIONS (s.c1, s.c2) INTO s.c1; -- misbehaves if 
> merging into the same name as an existing partition
> \d+ p
> ...
> Partitions: c1 FOR VALUES FROM (1) TO (3)

There is already a test which checks merging into the same name as an
existing partition.  And there are tests with schema-qualified names.
I'm not yet convinced we need a test with both these properties
together.

> > 0002
> > The persistence of the new partition is copied as suggested in [1].
> > But the checks are in-place, because search_path could influence new
> > table persistence.  Per review [2], commit message typos are fixed,
> > documentation is revised, revised tests to cover schema-qualification,
> > usage of search_path.
>
> Subject: [PATCH v8 2/7] Make new partitions with parent's persistence during 
> MERGE/SPLIT operations
>
> This patch adds documentation saying:
> +  Any indexes, constraints and user-defined row-level triggers that exist
> +  in the parent table are cloned on new partitions [...]
>
> Which is good to say, and addresses part of my message [0]
> [0] ZiJW1g2nbQs9ekwK@pryzbyj2023
>
> But it doesn't have anything to do with "creating new partitions with
> parent's persistence".  Maybe there was a merge conflict and the docs
> ended up in the wrong patch ?

Makes sense.  Extracted this into a separate patch in v10.

> Also, defaults, storage options, compression are also copied.  As will
> be anything else from LIKE.  And since anything added in the future will
> also be copied, maybe it's better to just say that the tables will be
> created the same way as "LIKE .. INCLUDING ALL EXCLUDING ..", or
> similar.  Otherwise, the next person who adds a new option for LIKE
> would have to remember to update this paragraph...

Reworded that way.  Thank you.

> Also, extended stats objects are currently cloned to new child tables.
> But I suggested in [0] that they probably shouldn't be.

I will explore this.  Do we copy extended stats when we do CREATE
TABLE ... PARTITION OF?  I think we need to do the same here.

> > 007 – doc review by Justin [3]
>
> I suggest to drop this patch for now.  I'll send some more minor fixes to
> docs and code comments once the other patches are settled.

Your edits are welcome.  Dropped this for now.  And waiting for the
next revision from you.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfduYuYECrqpHMgcOsNr%2B4j3uJK%2BJPUJ_zDBn-tqjjh3p1Q%40mail.gmail.com

--
Regards,
Alexander Korotkov
Supabase

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Alexander Korotkov

On Fri, Apr 26, 2024 at 2:54 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Many congratulations!  Well deserved!

--
Regards,
Alexander Korotkov

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-04-24 Thread Alexander Korotkov

On Wed, Apr 17, 2024 at 9:38 AM Peter Eisentraut  wrote:
> On 24.10.23 22:13, Alexander Korotkov wrote:
> > On Wed, Sep 28, 2022 at 11:44 AM Aleksander Alekseev
> >  wrote:
> >>> I think, this patch was marked as "Waiting on Author", probably, by 
> >>> mistake. Since recent changes were done without any significant code 
> >>> changes and CF bot how happy again.
> >>>
> >>> I'm going to move it to RfC, could I? If not, please tell why.
> >>
> >> I restored the "Ready for Committer" state. I don't think it's a good
> >> practice to change the state every time the patch has a slight
> >> conflict or something. This is not helpful at all. Such things happen
> >> quite regularly and typically are fixed in a couple of days.
> >
> > This patch seems useful to me.  I went through the thread, it seems
> > that all the critics are addressed.
> >
> > I've rebased this patch.   Also, I've run perltidy for tests, split
> > long errmsg() into errmsg(), errdetail() and errhint(), and do other
> > minor enchantments.
> >
> > I think this patch is ready to go.  I'm going to push it if there are
> > no objections.
>
> I just found the new pg_amcheck option --checkunique in PG17-to-be.
> Could we rename this to --check-unique?  Seems friendlier.  Maybe also
> rename the bt_index_check function argument to check_unique.

+1 from me
Let's do so if nobody objects.

--
Regards,
Alexander Korotkov

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2024-04-24 Thread Alexander Korotkov

On Wed, Apr 17, 2024 at 6:41 PM Pavel Borisov  wrote:
>>  I did notice (I meant to point out) that I have concerns about this
>> part of the new uniqueness check code:
>> "
>> if (P_IGNORE(topaque) || !P_ISLEAF(topaque))
>> break;
>> "
>>
>> My concern here is with the !P_ISLEAF(topaque) test -- it shouldn't be
>> required
>
> I agree. But I didn't see the need to check uniqueness constraints violations 
> in internal pages. Furthermore, it doesn't mean only a violation of 
> constraint, but a major index corruption. I agree that checking and reporting 
> this type of corruption separately is a possible thing.

I think we could just throw an error in case of an unexpected internal
page.  It doesn't seem reasonable to continue the check with this type
of corruption detected.  If the tree linkage is corrupted we may enter
an endless loop or something.

>> Separately, I dislike the way the target block changes within
>> bt_target_page_check(). The general idea behind verify_nbtree.c's
>> target block is that every block becomes the target exactly once, in a
>> clearly defined place. All corruption (in the index structure itself)
>> is formally considered to be a problem with that particular target
>> block. I want to be able to clearly distinguish between the target and
>> target's right sibling here, to explain my concerns, but they're kinda
>> both the target, so that's a lot harder than it should be. (Admittedly
>> directly blaming the target block has always been a little bit
>> arbitrary, at least in certain cases, but even there it provides
>> structure that makes things much easier to describe unambiguously.)
>
> The possible way to load the target block only once is to get rid of the 
> cross-page uniqueness violation check. I introduced it to catch more possible 
> cases of uniqueness violations. Though they are expected to be extremely 
> rare, and anyway the algorithm doesn't get any warranty, just does its best 
> to catch what is possible. I don't object to this change.

I think we could probably just avoid setting state->target during
cross-page check.  Just save that into a local variable and pass as an
argument where needed.

Skipping the visibility checks for "only one tuple for scan key" case
looks like very valuable optimization [1].

I also think we should wrap lVis_* variables into struct.  That would
make the way we pass them to functions more elegant.

Links.
1. https://www.postgresql.org/message-id/20240325020323.fd.nmisch%40google.com

--
Regards,
Alexander Korotkov

Re: POC: GROUP BY optimization

2024-04-22 Thread Alexander Korotkov

Hi!

On Thu, Apr 18, 2024 at 1:57 PM Alexander Korotkov  wrote:
> Thank you for the fixes you've proposed.  I didn't look much into
> details yet, but I think the main concern Tom expressed in [1] is
> whether the feature is reasonable at all.  I think at this stage the
> most important thing is to come up with convincing examples showing
> how huge performance benefits it could cause.  I will return to this
> later today and will try to provide some convincing examples.

I took a fresh look at 0452b461b, and have the following thoughts:
1) Previously, preprocess_groupclause() tries to reorder GROUP BY
clauses to match the required ORDER BY order.  It only reorders if
GROUP BY pathkeys are the prefix of ORDER BY pathkeys or vice versa.
So, both of them need to be satisfied by one ordering.  0452b461b also
tries to match GROUP BY clauses to ORDER BY clauses, but takes into
account an incremental sort.  Actually, instead of removing
preprocess_groupclause(), we could just teach it to take incremental
sort into account.
2) The get_useful_group_keys_orderings() function takes into account 3
orderings of pathkeys and clauses: original order as written in GROUP
BY, matching ORDER BY clauses as much as possible, and matching the
input path as much as possible.  Given that even before 0452b461b,
preprocess_groupclause() could change the original order of GROUP BY
clauses, so we don't need to distinguish the first two.  We could just
consider output of new preprocess_groupclause() taking into account an
incremental sort and the ordering matching the input path as much as
possible.  This seems like significant simplification.

Let me also describe my thoughts about the justification of the
feature itself.  As Tom pointed upthread, Sort + Grouping is generally
unlikely faster than Hash Aggregate.  The main point of this feature
is being able to match the order of GROUP BY clauses to the order of
the input path.  That input path could be Index Scan or Subquery.
Let's concentrate on Index Scan.  Index Scan can give us the required
order, so we can do grouping without Sort or with significantly
cheaper Incremental Sort.  That could be much faster than Hash
Aggregate.  But if we scan the full table (or its significant
fraction), Index Scan is typically much more expensive than Sequential
Scan because of random page access.  However, there are cases when
Index Scan could be cheaper.
1) If the heap row is wide and the index contains all the required
columns, Index Only Scan can be cheaper than Sequential Scan because
of lesser volume.
2) If the query predicate is selective enough and matches the index,
Index Scan might be significantly cheaper.  One may say that if the
query predicate is selective enough then there are not that many
matching rows, so aggregating them either way isn't a big deal anyway.
However, given that data volumes are growing tremendously, it's not
hard to imagine that the index selected a small fraction of table
rows, but they are still enough to be costly for aggregating.
Therefore, I think there are significant cases where matching GROUP BY
clauses to the order of the input path could give a substantial
improvement over Hash Aggregate.

While there are some particular use-cases by Jian He, I hope that
above could give some rationale.

--
Regards,
Alexander Korotkov

Re: POC: GROUP BY optimization

2024-04-18 Thread Alexander Korotkov

Hi, Andrei!

On Thu, Apr 18, 2024 at 11:54 AM Andrei Lepikhov
 wrote:
> On 4/12/24 06:44, Tom Lane wrote:
> > * Speaking of pathnodes.h, PathKeyInfo is a pretty uninformative node
> > type name, and the comments provided for it are not going to educate
> > anybody.  What is the "association" between the pathkeys and clauses?
> > I guess the clauses are supposed to be SortGroupClauses, but not all
> > pathkeys match up to SortGroupClauses, so what then?  This is very
> > underspecified, and fuzzy thinking about data structures tends to lead
> > to bugs.
> I'm not the best in English documentation and naming style. So, feel
> free to check my version.
> >
> > So I'm quite afraid that there are still bugs lurking here.
> > What's more, given that the plans this patch makes apparently
> > seldom win when you don't put a thumb on the scales, it might
> > take a very long time to isolate those bugs.  If the patch
> > produces a faulty plan (e.g. not sorted properly) we'd never
> > notice if that plan isn't chosen, and even if it is chosen
> > it probably wouldn't produce anything as obviously wrong as
> > a crash.
> I added checkings on the proper order of pathkeys and clauses.
>   If you really care about that, we should spend additional time and
> rewrite the code to generate an order of clauses somewhere in the plan
> creation phase. For example, during the create_group_plan call, we could
> use the processed_groupClause list, cycle through subpath->pathkeys, set
> the order, and detect whether the pathkeys list corresponds to the
> group-by or is enough to build a grouping plan.
> Anyway, make this part of code more resistant to blunders is another story.

Thank you for the fixes you've proposed.  I didn't look much into
details yet, but I think the main concern Tom expressed in [1] is
whether the feature is reasonable at all.  I think at this stage the
most important thing is to come up with convincing examples showing
how huge performance benefits it could cause.  I will return to this
later today and will try to provide some convincing examples.

Links
1. https://www.postgresql.org/message-id/266850.1712879082%40sss.pgh.pa.us

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-18 Thread Alexander Korotkov

Hi, Dmitry!

On Mon, Apr 15, 2024 at 6:26 PM Dmitry Koval  wrote:
>
> Hi!
>
> > Please, find a my version of this fix attached.
>
> Is it possible to make a small addition to the file v6-0001 ... .patch
> (see attachment)?
>
> Most important:
> 1) Line 19:
>
> + mergePartName = makeRangeVar(cmd->name->schemaname, tmpRelName, -1);
>
> (temporary table should use the same schema as the partition);
>
> 2) Lines 116-123:
>
> +RESET search_path;
> +
> +-- Can't merge persistent partitions into a temporary partition
> +ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO pg_temp.tp_0_2;
> +
> +SET search_path = pg_temp, public;
>
> (Alexandr Lakhin's test for using of pg_temp schema explicitly).
>
>
> The rest of the changes in v6_afterfix.diff are not very important and
> can be ignored.

Thank you.  I've integrated your changes.

The revised patchset is attached.
1) I've split the fix for the CommandCounterIncrement() issue and the
fix for relation persistence issue into a separate patch.
2) I've validated that the lock on the new partition is held in
createPartitionTable() after ProcessUtility() as pointed out by
Robert.  So, no need to place the lock again.
3) Added fix for problematic error message as a separate patch [1].
4) Added rename "salemans" => "salesmen" for tests as a separate patch.

I think these fixes are reaching committable shape, but I'd like
someone to check it before I push.

Links.
1. 
https://postgr.es/m/20240408.152402.1485994009160660141.horikyota.ntt%40gmail.com

--
Regards,
Alexander Korotkov

v6-0002-Verify-persistence-of-new-partitions-during-MERGE.patch
Description: Binary data

v6-0004-Grammar-fix-for-tests-of-partition-MERGE-SPLIT-op.patch
Description: Binary data

v6-0003-Fix-error-message-in-check_partition_bounds_for_s.patch
Description: Binary data

v6-0001-Add-missing-CommandCounterIncrement-to-ATExecMerg.patch
Description: Binary data

Re: Stack overflow issue

2024-04-17 Thread Alexander Korotkov

On Wed, Apr 17, 2024 at 2:37 PM Alexander Korotkov  wrote:
> On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov  
> wrote:
> > On Tue, Apr 16, 2024 at 6:35 PM Andres Freund  wrote:
> > > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  
> > > > wrote:
> > > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > > > 0001 Turn tail recursion into iteration in 
> > > > > > CommitTransactionCommand()
> > > > > > I did minor revision of comments and code blocks order to improve 
> > > > > > the
> > > > > > readability.
> > > > >
> > > > > After sending
> > > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > > > I looked some more at important areas where changes didn't have code
> > > > > coverage. One thing I noticed was that the "non-internal" part of
> > > > > AbortCurrentTransaction() is uncovered:
> > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > > > >
> > > > > Which made me try to understand fefd9a3fed2.  I'm a bit confused 
> > > > > about why
> > > > > some parts are handled in 
> > > > > CommitCurrentTransaction()/AbortCurrentTransaction()
> > > > > and others are in the *Internal functions.
> > > > >
> > > > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't 
> > > > > understand
> > > > > why that means having some code in in the non-internal and some in the
> > > > > internal functions?  Wouldn't it be easier to just have all the state 
> > > > > handling
> > > > > code in the Internal() function and just break after the
> > > > > CleanupSubTransaction() calls?
> > > >
> > > > I'm not sure I correctly get what you mean.  Do you think the attached
> > > > patch matches the direction you're pointing?  The patch itself is not
> > > > final, it requires cleanup and comments revision, just to check the
> > > > direction.
> > >
> > > Something like that, yea. The split does seem less confusing that way to 
> > > me,
> > > but also not 100% certain.
> >
> > Thank you for your feedback.  I'm going to go ahead and polish this patch.
>
> I've invested more time into polishing this.  I'm intended to push
> this.  Could you, please, take a look before?

Just after sending this I spotted a typo s/untill/until/.  The updated
version is attached.

--
Regards,
Alexander Korotkov


v3-0001-Refactoring-for-CommitTransactionCommand-AbortCur.patch
Description: Binary data

Re: Stack overflow issue

2024-04-17 Thread Alexander Korotkov

On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov  wrote:
>
> On Tue, Apr 16, 2024 at 6:35 PM Andres Freund  wrote:
> > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > > I did minor revision of comments and code blocks order to improve the
> > > > > readability.
> > > >
> > > > After sending
> > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > > I looked some more at important areas where changes didn't have code
> > > > coverage. One thing I noticed was that the "non-internal" part of
> > > > AbortCurrentTransaction() is uncovered:
> > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > > >
> > > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about 
> > > > why
> > > > some parts are handled in 
> > > > CommitCurrentTransaction()/AbortCurrentTransaction()
> > > > and others are in the *Internal functions.
> > > >
> > > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't 
> > > > understand
> > > > why that means having some code in in the non-internal and some in the
> > > > internal functions?  Wouldn't it be easier to just have all the state 
> > > > handling
> > > > code in the Internal() function and just break after the
> > > > CleanupSubTransaction() calls?
> > >
> > > I'm not sure I correctly get what you mean.  Do you think the attached
> > > patch matches the direction you're pointing?  The patch itself is not
> > > final, it requires cleanup and comments revision, just to check the
> > > direction.
> >
> > Something like that, yea. The split does seem less confusing that way to me,
> > but also not 100% certain.
>
> Thank you for your feedback.  I'm going to go ahead and polish this patch.

I've invested more time into polishing this.  I'm intended to push
this.  Could you, please, take a look before?

--
Regards,
Alexander Korotkov


v2-0001-Refactoring-for-CommitTransactionCommand-AbortCur.patch
Description: Binary data

Re: Stack overflow issue

2024-04-16 Thread Alexander Korotkov

On Tue, Apr 16, 2024 at 6:35 PM Andres Freund  wrote:
> On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote:
> > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > > I did minor revision of comments and code blocks order to improve the
> > > > readability.
> > >
> > > After sending
> > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> > > I looked some more at important areas where changes didn't have code
> > > coverage. One thing I noticed was that the "non-internal" part of
> > > AbortCurrentTransaction() is uncovered:
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
> > >
> > > Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> > > some parts are handled in 
> > > CommitCurrentTransaction()/AbortCurrentTransaction()
> > > and others are in the *Internal functions.
> > >
> > > I understand that fefd9a3fed2 needed to remove the recursion in
> > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't 
> > > understand
> > > why that means having some code in in the non-internal and some in the
> > > internal functions?  Wouldn't it be easier to just have all the state 
> > > handling
> > > code in the Internal() function and just break after the
> > > CleanupSubTransaction() calls?
> >
> > I'm not sure I correctly get what you mean.  Do you think the attached
> > patch matches the direction you're pointing?  The patch itself is not
> > final, it requires cleanup and comments revision, just to check the
> > direction.
>
> Something like that, yea. The split does seem less confusing that way to me,
> but also not 100% certain.

Thank you for your feedback.  I'm going to go ahead and polish this patch.

--
Regards,
Alexander Korotkov

Re: Stack overflow issue

2024-04-16 Thread Alexander Korotkov

On Tue, Apr 16, 2024 at 1:48 AM Andres Freund  wrote:
> On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote:
> > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > I did minor revision of comments and code blocks order to improve the
> > readability.
>
> After sending
> https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
> I looked some more at important areas where changes didn't have code
> coverage. One thing I noticed was that the "non-internal" part of
> AbortCurrentTransaction() is uncovered:
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403
>
> Which made me try to understand fefd9a3fed2.  I'm a bit confused about why
> some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction()
> and others are in the *Internal functions.
>
> I understand that fefd9a3fed2 needed to remove the recursion in
> CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand
> why that means having some code in in the non-internal and some in the
> internal functions?  Wouldn't it be easier to just have all the state handling
> code in the Internal() function and just break after the
> CleanupSubTransaction() calls?

I'm not sure I correctly get what you mean.  Do you think the attached
patch matches the direction you're pointing?  The patch itself is not
final, it requires cleanup and comments revision, just to check the
direction.

--
Regards,
Alexander Korotkov


abort_commit_transaction.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-16 Thread Alexander Korotkov

On Mon, Apr 15, 2024 at 11:17 PM Andres Freund  wrote:
> On 2024-04-15 16:02:00 -0400, Robert Haas wrote:
> > Do you want that patch applied, not applied, or applied with some set of
> > modifications?
>
> I think we should apply Alexander's proposed revert and then separately
> discuss what we should do about 041b96802ef.

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling.  So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks.  Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple().  What do you think?

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-16 Thread Alexander Korotkov

On Mon, Apr 15, 2024 at 11:11 PM Andres Freund  wrote:
> On 2024-04-12 01:04:03 +0300, Alexander Korotkov wrote:
> > 1) If we just apply my revert patch and leave c6fc50cb4028 and
> > 041b96802ef in the tree, then we get our table AM API narrowed.  As
> > you expressed the current API requires block numbers to be 1:1 with
> > the actual physical on-disk location [2].  Not a secret I think the
> > current API is quite restrictive.  And we're getting the ANALYZE
> > interface narrower than it was since 737a292b5de.  Frankly speaking, I
> > don't think this is acceptable.
>
> As others already pointed out, c6fc50cb4028 was committed quite a while
> ago. I'm fairly unhappy about c6fc50cb4028, fwiw, but didn't realize that
> until it was too late.

+1

> > In token of all of the above, is the in-tree state that bad? (if we
> > abstract the way 27bc1772fc and dd1f6b0c17 were committed).
>
> To me the 27bc1772fc doesn't make much sense on its own. You added calls
> directly to heapam internals to a file in src/backend/commands/, that just
> doesn't make sense.
>
> Leaving that aside, I think the interface isn't good on its own:
> table_relation_analyze() doesn't actually do anything, it just sets callbacks,
> that then later are called from analyze.c, which doesn't at all fit to the
> name of the callback/function.  I realize that this is kinda cribbed from the
> FDW code, but I don't think that is a particularly good excuse.
>
> I don't think dd1f6b0c17 improves the situation, at all. It sets global
> variables to redirect how an individual acquire_sample_rows invocation
> works:
> void
> block_level_table_analyze(Relation relation,
>   AcquireSampleRowsFunc *func,
>   BlockNumber *totalpages,
>   BufferAccessStrategy 
> bstrategy,
>   ScanAnalyzeNextBlockFunc 
> scan_analyze_next_block_cb,
>   ScanAnalyzeNextTupleFunc 
> scan_analyze_next_tuple_cb)
> {
> *func = acquire_sample_rows;
> *totalpages = RelationGetNumberOfBlocks(relation);
> vac_strategy = bstrategy;
> scan_analyze_next_block = scan_analyze_next_block_cb;
> scan_analyze_next_tuple = scan_analyze_next_tuple_cb;
> }
>
> Notably it does so within the ->relation_analyze tableam callback, which does
> *NOT* not actually do anything other than returning a callback.  So if
> ->relation_analyze() for another relation is called, the acquire_sample_rows()
> for the earlier relation will do something different.  Note that this isn't a
> theoretical risk, acquire_inherited_sample_rows() actually collects the
> acquirefunc for all the inherited relations before calling acquirefunc.

You're right.  No sense trying to fix this.  Reverted.

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-13 Thread Alexander Korotkov

Hi, Dmitry!

On Fri, Apr 12, 2024 at 10:59 PM Dmitry Koval  wrote:
>
> Thanks, Alexander!
>
> > Still now we're able to create a partition in the pg_temp schema
> > explicitly.
>
> Attached patches with fix.

Please, find a my version of this fix attached.  I think we need to
check relpersistence in a similar way ATTACH PARTITION or CREATE TABLE
... PARTITION OF do.  I'm going to polish this a little bit more.

--
Regards,
Alexander Korotkov

v6-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-13 Thread Alexander Korotkov

Hi, Melanie!

On Fri, Apr 12, 2024 at 8:48 PM Melanie Plageman
 wrote:
>
> On Thu, Apr 11, 2024 at 6:04 PM Alexander Korotkov  
> wrote:
> >
> > On Fri, Apr 12, 2024 at 12:04 AM Andres Freund  wrote:
> > > On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:
> > > > I hope this work is targeting pg18.
> > >
> > > I think anything of the scope discussed by Melanie would be very clearly
> > > targeting 18. For 17, I don't know yet whether we should revert the the
> > > ANALYZE streaming read user (041b96802ef), just do a bit of comment 
> > > polishing,
> > > or some other small change.
> > >
> > > One oddity is that before 041b96802ef, the opportunities for making the
> > > interface cleaner were less apparent, because c6fc50cb4028 increased the
> > > coupling between analyze.c and the way the table storage works.
> >
> > Thank you for pointing this out about c6fc50cb4028, I've missed this.
> >
> > > > Otherwise, do I get this right that this post feature-freeze works on
> > > > designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> > > > committed on Mar 30.
> > >
> > > Note that there were versions of the patch that were targeting the
> > > pre-27bc1772fc interface.
> >
> > Sure, I've checked this before writing.  It looks quite similar to the
> > result of applying my revert patch [1] to the head.
> >
> > Let me describe my view over the current situation.
> >
> > 1) If we just apply my revert patch and leave c6fc50cb4028 and
> > 041b96802ef in the tree, then we get our table AM API narrowed.  As
> > you expressed the current API requires block numbers to be 1:1 with
> > the actual physical on-disk location [2].  Not a secret I think the
> > current API is quite restrictive.  And we're getting the ANALYZE
> > interface narrower than it was since 737a292b5de.  Frankly speaking, I
> > don't think this is acceptable.
> >
> > 2) Pushing down the read stream and prefetch to heap am is related to
> > difficulties [3], [4].  That's quite a significant piece of work to be
> > done post FF.
>
> I had operated under the assumption that we needed to push the
> streaming read code into heap AM because that is what we did for
> sequential scan, but now that I think about it, I don't see why we
> would have to. Bilal's patch pre-27bc1772fc did not do this. But I
> think the code in acquire_sample_rows() isn't more tied to heap AM
> after 041b96802ef than it was before it. Are you of the opinion that
> the code with 041b96802ef ties acquire_sample_rows() more closely to
> heap format?

Yes, I think so.  Table AM API deals with TIDs and block numbers, but
doesn't force on what they actually mean.  For example, in ZedStore
[1], data is stored on per-column B-trees, where TID used in table AM
is just a logical key of that B-trees.  Similarly, blockNumber is a
range for B-trees.

c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
assumption that we are sampling physical blocks as they are stored in
data files.  That couldn't anymore be some "logical" block numbers
with meaning only table AM implementation knows.  That was pointed out
by Andres [2].  I'm not sure if ZedStore is alive, but there could be
other table AM implementations like this, or other implementations in
development, etc.  Anyway, I don't feel good about narrowing the API,
which is there from pg12.

Links.
1. 
https://www.pgcon.org/events/pgcon_2020/sessions/session/44/slides/13/Zedstore-PGCon2020-Virtual.pdf
2. 
https://www.postgresql.org/message-id/20240410212117.mxsldz2w6htrl36v%40awork3.anarazel.de

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-12 Thread Alexander Korotkov

On Wed, Apr 10, 2024 at 5:27 PM Robert Haas  wrote:
>
> On Mon, Apr 8, 2024 at 10:12 PM Tom Lane  wrote:
> > I have another one that I'm not terribly happy about:
> >
> > Author: Alexander Korotkov 
> > Branch: master [72bd38cc9] 2024-04-08 01:27:52 +0300
> >
> > Transform OR clauses to ANY expression
>
> I realize that this has been reverted now, but what's really
> frustrating about this case is that I reviewed this patch before and
> gave feedback similar to some of the feedback you gave, and it just
> didn't matter, and the patch was committed anyway.
>
> > I don't know that I'd call it scary exactly, but I do think it
> > was premature.  A week ago there was no consensus that it was
> > ready to commit, but Alexander pushed it (or half of it, anyway)
> > despite that.  A few concrete concerns:
> >
> > * Yet another planner GUC.  Do we really need or want that?
>
> IMHO, no, and I said so in
> https://www.postgresql.org/message-id/CA%2BTgmob%3DebuCHFSw327b55DJzE3JtOuZ5owxob%2BMgErb4me_Ag%40mail.gmail.com
>
> > * What the medical community would call off-label usage of
> > query jumbling.  I'm not sure this is even correct as-used,
> > and for sure it's using that code for something never intended.
> > Nor is the added code adequately (as in, at all) documented.
>
> And I raised this point here:
> https://www.postgresql.org/message-id/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com
>
> > * Patch refuses to group anything but Consts into the SAOP
> > transformation.  I realize that if you want to produce an
> > array Const you need Const inputs, but I wonder why it
> > wasn't considered to produce an ARRAY[] construct if there
> > are available clauses with pseudo-constant (eg Param)
> > comparison values.
> >
> > * I really, really dislike jamming this logic into prepqual.c,
> > where it has no business being.  I note that it was shoved
> > into process_duplicate_ors without even the courtesy of
> > expanding the header comment:
> >
> >  * process_duplicate_ors
> >  *Given a list of exprs which are ORed together, try to apply
> >  *the inverse OR distributive law.
> >
> > Another reason to think this wasn't a very well chosen place is
> > that the file's list of #include's went from 4 entries to 11.
> > Somebody should have twigged to the idea that this was off-topic
> > for prepqual.c.
>
> All of this seems like it might be related to my comments in the above
> email about the transformation being done too early.
>
> > * OrClauseGroupKey is not a Node type, so why does it have
> > a NodeTag?  I wonder what value will appear in that field,
> > and what will happen if the struct is passed to any code
> > that expects real Nodes.
>
> I don't think I raised this issue.
>
> > I could probably find some other nits if I spent more time
> > on it, but I think these are sufficient to show that this
> > was not commit-ready.
>
> Just imagine if someone had taken time to give similar feedback before
> the commit.

FWIW, I made my conclusion that it isn't worth to commit stuff like
this without explicit consent from Tom.  As well as it isn't worth to
commit table AM changes without explicit consent from Andres.  And it
isn't worth it to postpone large features to the last CF (it's better
to postpone to the next release then).

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-11 Thread Alexander Korotkov

On Thu, Apr 11, 2024 at 8:00 PM Alexander Lakhin  wrote:
> 11.04.2024 16:27, Dmitry Koval wrote:
> >
> > Added correction (and test), see 
> > v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.
> >
>
> Thank you for the correction, but may be an attempt to merge into implicit
> pg_temp should fail just like CREATE TABLE ... PARTITION OF ... does?
>
> Please look also at another anomaly with schemas:
> CREATE SCHEMA s1;
> CREATE TABLE t (i int) PARTITION BY RANGE (i);
> CREATE TABLE tp_0_2 PARTITION OF t
>FOR VALUES FROM (0) TO (2);
> ALTER TABLE t SPLIT PARTITION tp_0_2 INTO
>(PARTITION s1.tp0 FOR VALUES FROM (0) TO (1), PARTITION s1.tp1 FOR VALUES 
> FROM (1) TO (2));
> results in:
> \d+ s1.*
> Did not find any relation named "s1.*"
> \d+ tp*
>Table "public.tp0"
> ...
> Table "public.tp1"

+1
I think we shouldn't unconditionally copy schema name and
relpersistence from the parent table.  Instead we should throw the
error on a mismatch like CREATE TABLE ... PARTITION OF ... does.  I'm
working on revising this fix.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Alexander Korotkov

On Thu, Apr 11, 2024 at 11:27 PM Robert Haas  wrote:
> On Thu, Apr 11, 2024 at 1:46 PM Alexander Korotkov  
> wrote:
> > I understand that I'm the bad guy of this release, not sure if my
> > opinion counts.
> >
> > But what is going on here?  I hope this work is targeting pg18.
> > Otherwise, do I get this right that this post feature-freeze works on
> > designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> > committed on Mar 30.  So that couldn't justify why the proper API
> > wasn't designed in time.  Are we judging different commits with the
> > same criteria?
>
> I mean, Andres already said that the cleanup was needed possibly in
> 17, and possibly in 18.
>
> As far as fairness is concerned, you'll get no argument from me if you
> say the streaming read stuff was all committed far later than it
> should have been. I said that in the very first email I wrote on the
> "post-feature freeze cleanup" thread. But if you're going to argue
> that there's no opportunity for anyone to adjust patches that were
> sideswiped by the reverts of your patches, and that if any such
> adjustments seem advisable we should just revert the sideswiped
> patches entirely, I don't agree with that, and I don't see why anyone
> would agree with that. I think it's fine to have the discussion, and
> if the result of that discussion is that somebody says "hey, we want
> to do X in 17 for reason Y," then we can discuss that proposal on its
> merits, taking into account the answers to questions like "why wasn't
> this done before the freeze?" and "is that adjustment more or less
> risky than just reverting?" and "how about we just leave it alone for
> now and deal with it next release?".

I don't think 041b96802e could be sideswiped by 27bc1772fc.  The "Use
streaming I/O in ANALYZE" patch has the same issue before 27bc1772fc,
which was committed on Mar 30.  So, in the worst case 27bc1772fc
steals a week of work.  I can imagine without 27bc1772fc , a new API
could be proposed days before FF.  This means I saved patch authors
from what you name in my case "desperate rush".  Huh!

> > IMHO, 041b96802e should be just reverted.
>
> IMHO, it's too early to decide that, because we don't know what change
> concretely is going to be proposed, and there has been no discussion
> of why that change, whatever it is, belongs in this release or next
> release.
>
> I understand that you're probably not feeling great about being asked
> to revert a bunch of stuff here, and I do think it is a fair point to
> make that we need to be even-handed and not overreact. Just because
> you had some patches that had some problems doesn't mean that
> everything that got touched by the reverts can or should be whacked
> around a whole bunch more post-freeze, especially since that stuff was
> *also* committed very late, in haste, way closer to feature freeze
> than it should have been. At the same time, it's also important to
> keep in mind that our goal here is not to punish people for being bad,
> or to reward them for being good, or really to make any moral
> judgements at all, but to produce a quality release. I'm sure that,
> where possible, you'd prefer to fix bugs in a patch you committed
> rather than revert the whole thing as soon as anyone finds any
> problem. I would also prefer that, both for your patches, and for
> mine. And everyone else deserves that same consideration.

I expressed my thoughts about producing a better release without a
desperate rush post-FF in my reply to Andres [2].

Links.
1. 
https://www.postgresql.org/message-id/CA%2BTgmobZUnJQaaGkuoeo22Sydf9%3DmX864W11yZKd6sv-53-aEQ%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/CAPpHfdt%2BcCj6j6cR5AyBThP6SyDf6wxAz4dU-0NdXjfpiFca7Q%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Alexander Korotkov

On Fri, Apr 12, 2024 at 12:04 AM Andres Freund  wrote:
> On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:
> > I hope this work is targeting pg18.
>
> I think anything of the scope discussed by Melanie would be very clearly
> targeting 18. For 17, I don't know yet whether we should revert the the
> ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
> or some other small change.
>
> One oddity is that before 041b96802ef, the opportunities for making the
> interface cleaner were less apparent, because c6fc50cb4028 increased the
> coupling between analyze.c and the way the table storage works.

Thank you for pointing this out about c6fc50cb4028, I've missed this.

> > Otherwise, do I get this right that this post feature-freeze works on
> > designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> > committed on Mar 30.
>
> Note that there were versions of the patch that were targeting the
> pre-27bc1772fc interface.

Sure, I've checked this before writing.  It looks quite similar to the
result of applying my revert patch [1] to the head.

Let me describe my view over the current situation.

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed.  As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2].  Not a secret I think the
current API is quite restrictive.  And we're getting the ANALYZE
interface narrower than it was since 737a292b5de.  Frankly speaking, I
don't think this is acceptable.

2) Pushing down the read stream and prefetch to heap am is related to
difficulties [3], [4].  That's quite a significant piece of work to be
done post FF.

In token of all of the above, is the in-tree state that bad? (if we
abstract the way 27bc1772fc and dd1f6b0c17 were committed).

The in-tree state provides quite a general API for analyze, supporting
even non-block storages.  There is a way to reuse existing
acquire_sample_rows() for table AMs, which have block numbers 1:1 with
the actual physical on-disk location.  It requires some cleanup for
comments and docs, but does not require us to redesing the API post
FF.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU%3D-%3DiHc0A7qo9HBEJw%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/20240410212117.mxsldz2w6htrl36v%40awork3.anarazel.de
3. 
https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
4. 
https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Alexander Korotkov

Hi!

On Thu, Apr 11, 2024 at 7:19 PM Melanie Plageman
 wrote:
> On Wed, Apr 10, 2024 at 5:21 PM Andres Freund  wrote:
> > On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:
> > > This brings up a question about the prefetching. We never had to have
> > > this discussion for sequential scan streaming read because it didn't
> > > (and still doesn't) do prefetching. But, if we push the streaming read
> > > code down into the heap AM layer, it will be doing the prefetching.
> > > So, do we remove the prefetching from acquire_sample_rows() and expect
> > > other table AMs to implement it themselves or use the streaming read
> > > API?
> >
> > The prefetching added to acquire_sample_rows was quite narrowly tailored to
> > something heap-like - it pretty much required that block numbers to be 1:1
> > with the actual physical on-disk location for the specific AM.  So I think
> > it's pretty much required for this to be pushed down.
> >
> > Using a read stream is a few lines for something like this, so I'm not 
> > worried
> > about it. I guess we could have a default implementation for block based 
> > AMs,
> > similar what we have around table_block_parallelscan_*, but not sure it's
> > worth doing that, the complexity is much lower than in the
> > table_block_parallelscan_ case.
>
> This makes sense.
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.
>
> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I understand that I'm the bad guy of this release, not sure if my
opinion counts.

But what is going on here?  I hope this work is targeting pg18.
Otherwise, do I get this right that this post feature-freeze works on
designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
committed on Mar 30.  So that couldn't justify why the proper API
wasn't designed in time.  Are we judging different commits with the
same criteria?

IMHO, 041b96802e should be just reverted.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Alexander Korotkov

On Thu, Apr 11, 2024 at 8:11 PM Jeff Davis  wrote:
> On Wed, 2024-04-10 at 15:19 +0300, Alexander Korotkov wrote:
> > 1) 9bd99f4c26 comprises the reworked patch after working with notes
> > from Jeff Davis.  I agree it would be better to wait for him to
> > express explicit agreement.  Before reverting this, I would prefer to
> > hear his opinion.
>
> On this particular feature, I had tried it in the past myself, and
> there were a number of minor frustrations and I left it unfinished. I
> quickly recognized that commit c95c25f9af was too simple to work.
>
> Commit 9bd99f4c26 looked substantially better, but I was surprised to
> see it committed so soon after the redesign. I thought a revert was
> likely outcome, but I put it on my list of things to review more deeply
> in the next couple weeks so I could give productive feedback.

Thank you for your feedback, Jeff.

> It would benefit from more discussion in v18, and I apologize for not
> getting involved earlier when the patch still could have made it into
> v17.

I believe you don't have to apologize.  It's definitely not your fault
that I've committed this patch in this shape.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-11 Thread Alexander Korotkov

Hi, Heikki!

Thank you for your interest in the subject.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas  wrote:
> On 07/04/2024 00:52, Alexander Korotkov wrote:
> > On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera  
> > wrote:
> >> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> >> Does this mean that if a process throws an error while waiting, it'll
> >> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> >> seems odd.
> >
> > I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
> > that together with the improvements upthread.
>
> Race condition:
>
> 1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
> 2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend
> process to the LSN heap and returns
> 3. replay:  rm_redo record '0/1234'
> 4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
> 5. backend: current replay LSN location is '0/1234', so we exit the loop
> 6. replay: calls WaitLSNSetLatches()
>
> In a nutshell, it's possible for the loop in WaitForLSN to exit without
> cleaning up the process from the heap. I was able to hit that by adding
> a delay after the addLSNWaiter() call:
>
> > TRAP: failed Assert("!procInfo->inHeap"), File: 
> > "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > postgres: heikki postgres [local] 
> > CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > postgres: heikki postgres [local] 
> > CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > postgres: heikki postgres [local] 
> > CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > postgres: heikki postgres [local] 
> > CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
>
> I think there's a similar race condition if the timeout is reached at
> the same time that the startup process wakes up the process.

Thank you for catching this.  I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

> >* At first, we check that pg_wal_replay_wait() is called in a 
> > non-atomic
> >* context.  That is, a procedure call isn't wrapped into a 
> > transaction,
> >* another procedure call, or a function call.
> >*
>
> It's pretty unfortunate to have all these restrictions. It would be nice
> to do:
>
> select pg_wal_replay_wait('0/1234'); select * from foo;

This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
 i
---
(0 rows)

> in a single multi-query call, to avoid the round-trip to the client. You
> can avoid it with libpq or protocol level pipelining, too, but it's more
> complicated.
>
> >* Secondly, according to PlannedStmtRequiresSnapshot(), even in an 
> > atomic
> >* context, CallStmt is processed with a snapshot.  Thankfully, we 
> > can pop
> >* this snapshot, because PortalRunUtility() can tolerate this.
>
> This assumption that PortalRunUtility() can tolerate us popping the
> snapshot sounds very fishy. I haven't looked at what's going on there,
> but doesn't sound like a great assumption.

This is what PortalRunUtility() says about this.

/*
 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
 * should be the top one; pop it.  Note that this could be a different
 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
 */

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities.  But I agree this
decision needs some consensus.

> If recovery ends while a process is waiting for an LSN to arrive, does
> it ever get woken up?

If the recovery target is promote, then the user will get an error.
If the recovery target is shutdown, then connection will get
interrupted.  If the recovery target is pause, then waiting will
continue during the pause.  Not sure about the latter case.

> The docs could use some-copy-editing, but just to point out one issue:
>
> > There are also procedures to control the progress of recovery.
>
> That's copy-pasted from an earlier sentence at the table that lists
> functions like pg_promote(), pg_wal_replay_pause(), and
> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the
> progress of recovery like those functions do, it only causes the calling
> backend to wait.
>
> Overall, this feature doesn't feel quite ready for v17, and IMHO should
> be reverted. It's a nice feature, so I'd love to have it fixed and
> reviewed early in the v18 cycle.

Thank you for your review.  I've reverted this. Will repost this for early v18.

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-11 Thread Alexander Korotkov

Hi, Dmitry!

On Thu, Apr 11, 2024 at 4:27 PM Dmitry Koval  wrote:
> 1.
> Alexander Lakhin sent a question about index name after MERGE (partition
> name is the same as one of the merged partitions):
>
> start of quote
> I'm also confused by an index name after MERGE:
> CREATE TABLE t (i int) PARTITION BY RANGE (i);
>
> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
>
> CREATE INDEX tidx ON t(i);
> ALTER TABLE t MERGE PARTITIONS (tp_1_2, tp_0_1) INTO tp_1_2;
> \d+ t*
>
>   Table "public.tp_1_2"
>   Column |  Type   | Collation | Nullable | Default | Storage |
> Compression | Stats target | Description
> +-+---+--+-+-+-+--+-
>   i  | integer |   |  | | plain   |
> |  |
> Partition of: t FOR VALUES FROM (0) TO (2)
> Partition constraint: ((i IS NOT NULL) AND (i >= 0) AND (i < 2))
> Indexes:
>  "merge-16385-3A14B2-tmp_i_idx" btree (i)
>
> Is the name "merge-16385-3A14B2-tmp_i_idx" valid or it's something
> temporary?
> end of quote
>
> Fix for this case added to file
> v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.
>
> 
>
> 2.
>  >It seems to me that v2-0001-Fix-for-SPLIT-MERGE-partitions-of-
>  >temporary-table.patch is not complete either.
>
> Added correction (and test), see
> v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.

Thank you, I'll review this later today.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Alexander Korotkov

Hi Andres,

On Wed, Apr 10, 2024 at 7:52 PM Andres Freund  wrote:
> On 2024-04-08 14:54:46 -0400, Robert Haas wrote:
> > Exactly how much is getting reverted here? I see these, all since March 
> > 23rd:
>
> IMO:
>
>
> > dd1f6b0c17 Provide a way block-level table AMs could re-use
> > acquire_sample_rows()
>
> Should be reverted.
>
>
> > 9bd99f4c26 Custom reloptions for table AM
>
> Hm. There are some oddities here:
>
> - It doesn't seem great that relcache.c now needs to know about the default
>   values for all kinds of reloptions.
>
> - why is there table_reloptions() and tableam_reloptions()?
>
> - Why does extractRelOptions() need a TableAmRoutine parameter, extracted by a
>   caller, instead of doing that work itself?
>
>
>
> > 97ce821e3e Fix the parameters order for
> > TableAmRoutine.relation_copy_for_cluster()
>
> Shouldn't be, this is a clear fix.
>
>
> > b1484a3f19 Let table AM insertion methods control index insertion
>
> I'm not sure. I'm not convinced this is right, nor the opposite. If the
> tableam takes control of index insertion, shouldn't nodeModifyTuple know this
> earlier, so it doesn't prepare a bunch of index insertion state?  Also,
> there's pretty much no motivating explanation in the commit.
>
>
> > 27bc1772fc Generalize relation analyze in table AM interface
>
> Should be reverted.
>
>
> > 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
>
> Strongly suspect this should be reverted. The last time this was committed it
> was far from ready. It's very easy to cause corruption due to subtle bugs in
> this area.
>
>
> > c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
>
> If the AM returns a different slot, who is responsible for cleaning it up? And
> how is creating a new slot for every insert not going to be a measurable
> overhead?
>
>
> > 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
>
> I am doubtful this is right.  Is it really sufficient to have a callback for
> freeing? What happens when relcache entries are swapped as part of a rebuild?
> That works for "flat" caches, but I don't immediately see how it works for
> more complicated datastructures.  At least from the commit message it's hard
> to evaluate how this actually intended to be used.

Thank you for your feedback.  I've reverted all of above.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-10 Thread Alexander Korotkov

On Wed, Apr 10, 2024 at 4:19 PM Robert Haas  wrote:
>
> On Wed, Apr 10, 2024 at 8:20 AM Alexander Korotkov  
> wrote:
> > I agree with the facts.  But I have a different interpretation on
> > this.  The patch was committed as 11470f544e on March 23, 2023, then
> > reverted on April 3.  I've proposed the revised version, but Andres
> > complained that this is the new API design days before FF.
>
> Well, his first complaint that your committed patch was full of bugs:
>
> https://www.postgresql.org/message-id/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
>
> When you commit a patch and another committer writes a post-commit
> review saying that your patch has so many serious problems that he
> gave up on reviewing before enumerating all of them, that's a really
> bad sign. That should be a strong signal to you to step back and take
> a close look at whether you really understand the area of the code
> that you're touching well enough to be doing whatever it is that
> you're doing. If I got a review like that, I would have reverted the
> patch instantly, given up for the release cycle, possibly given up on
> the patch permanently, and most definitely not tried again to commit
> unless I was absolutely certain that I'd learned a lot in the meantime
> *and* had the agreement of the committer who wrote that review (or
> maybe some other committer who was acknowledged as an expert in that
> area of the code).
>
> What you did instead is try to do a bunch of post-commit fixup in a
> desperate rush right before feature freeze, to which Andres
> understandably objected. But that was your second mistake, not your
> first one.
>
> > Then the
> > patch with this design was published in the thread for the year with
> > periodical rebases.  So, I think I expressed my intention with that
> > design before 2023 FF, nobody prevented me from expressing objections
> > or other feedback during the year.  Then I realized that 2024 FF is
> > approaching and decided to give this another try for pg18.
>
> This doesn't seem to match the facts as I understand them. It appears
> to me that there was no activity on the thread from April until
> November. The message in November was not written by you. Your first
> post to the thread after April of 2023 was on March 19, 2024. Five
> days later you said you wanted to commit. That doesn't look to me like
> you worked diligently on the patch set throughout the year and other
> people had reasonable notice that you planned to get the work done
> this cycle. It looks like you ignored the patch for 11 months and then
> committed it without any real further feedback from anyone. True,
> Pavel did post and say that he thought the patches were in good shape.
> But you could hardly take that as evidence that Andres was now content
> that the problems he'd raised earlier had been fixed, because (a)
> Pavel had also been involved beforehand and had not raised the
> concerns that Andres later raised and (b) Pavel wrote nothing in his
> email specifically about why he thought your changes or his had
> resolved those concerns. I certainly agree that Andres doesn't always
> give as much review feedback as I'd like to have from him in, and it's
> also true that he doesn't always give that feedback as quickly as I'd
> like to have it ... but you know what?
>
> It's not Andres's job to make sure my patches are not broken. It's my
> job. That applies to the patches I write, and the patches written by
> other people that I commit. If I commit something and it turns out
> that it is broken, that's my bad. If I commit something and it turns
> out that it does not have consensus, that is also my bad. It is not
> the fault of the other people for not helping me get my patches to a
> state where they are up to project standard. It is my fault, and my
> fault alone, for committing something that was not ready. Now that
> does not mean that it isn't frustrating when I can't get the help I
> need. It is extremely frustrating. But the solution is not to commit
> anyway and then blame the other people for not providing feedback.
>
> I mean, committing without explicit agreement from someone else is OK
> if you're pretty sure that you've got everything sorted out correctly.
> But I don't think that the paper trail here supports the narrative
> that you worked on this diligently throughout the year and had every
> reason to believe it would be acceptable to the community. If I'd
> looked at this thread, I would have concluded that you'd abandoned the
> project. I would have expected that, when you picked it up again,
> there would be a series of emails over a period of time carefully
> working through the various issues that had been raised, invi

Re: Table AM Interface Enhancements

2024-04-10 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov  
> wrote:
> > Yes, it was my mistake. I got rushing trying to fit this to FF, even doing 
> > significant changes just before commit.
> > I'll revert this later today.

The patch to revert is attached.  Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

> Alexander,
>
> Exactly how much is getting reverted here? I see these, all since March 23rd:
>
> dd1f6b0c17 Provide a way block-level table AMs could re-use
> acquire_sample_rows()
> 9bd99f4c26 Custom reloptions for table AM
> 97ce821e3e Fix the parameters order for
> TableAmRoutine.relation_copy_for_cluster()
> 867cc7b6dd Revert "Custom reloptions for table AM"
> b1484a3f19 Let table AM insertion methods control index insertion
> c95c25f9af Custom reloptions for table AM
> 27bc1772fc Generalize relation analyze in table AM interface
> 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
> c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
> 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
>
> I'm not really feeling very good about all of this, because:
>
> - 87985cc925 was previously committed as 11470f544e on March 23, 2023,
> and almost immediately reverted. Now you tried again on March 26,
> 2024. I know there was a bunch of rework in the middle, but there are
> times in the year that things can be committed other than right before
> the feature freeze. Like, don't wait a whole year for the next attempt
> and then again do it right before the cutoff.

I agree with the facts.  But I have a different interpretation on
this.  The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3.  I've proposed the revised version, but Andres
complained that this is the new API design days before FF.  Then the
patch with this design was published in the thread for the year with
periodical rebases.  So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year.  Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

But I don't yet see it's wrong with this patch.  I waited a year for
feedback.  I waited 2 days after saying "I will push this if no
objections". Given your feedback now, I get that it would be better to
do another attempt to commit this earlier.

I admit my mistake with dd1f6b0c17.  I get rushed trying to fix the
things actually making things worse.  I apologise for this.  But if
I'm forced to revert 87985cc925 without even hearing any reasonable
critics besides imperfection of timing, I feel like this is the
punishment for my mistake with dd1f6b0c17.  Pretty unreasonable
punishment in my view.

> - The Discussion links in the commit messages do not seem to stand for
> the proposition that these particular patches ought to be committed in
> this form. Some of them are just links to the messages where the patch
> was originally posted, which is probably not against policy or
> anything, but it'd be nicer to see links to versions of the patch with
> which people are, in nearby emails, agreeing. Even worse, some of
> these are links to emails where somebody said, "hey, some earlier
> commit does not look good." In particular,
> dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
> Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
> it's not clear how that justifies the new commit.

I have to repeat again, that I admit my mistake with dd1f6b0c17,
apologize for that, and make my own conclusions to not repeat this.
But dd1f6b0c17 seems to be the only one that has a link to the message
with complains.  I went through the list of commits above, it seems
that others have just linked to the first message of the thread.
Probably, there is a lack of consensus for some of them.  But I never
heard about a policy to link not just the discussion start, but also
exact messages expressing agreeing.  And I didn't see others doing
that.

> - The commit message for 867cc7b6dd says "This reverts commit
> c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
> spotted after commit." That's not a very good justification for then
> trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
> good justification for there being no meaningful discussion links in
> the commit message for 9bd99f4c26. They're just the same links you had
> in the previous attempt, so it's pretty hard for anybody to understand
> what got fixed and whether all of the concerns were really addressed.
> Just looking over the commit, it's pretty hard to understand what is
> b

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-09 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 11:43 PM Dmitry Koval  wrote:
> Attached fix for the problems found by Alexander Lakhin.
>
> About grammar errors.
> Unfortunately, I don't know English well.
> Therefore, I plan (in the coming days) to show the text to specialists
> who perform technical translation of documentation.

Thank you.  I've pushed this fix with minor corrections from me.

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 11:42 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > On Tue, Apr 9, 2024 at 7:27 PM Tom Lane  wrote:
> >> Yeah, that's one of the reasons I'm dubious that the committed
> >> patch was ready.
>
> > While inventing this GUC, I was thinking more about avoiding
> > regressions rather than about unleashing the full power of this
> > optimization.  But now I see that that wasn't good enough.  And it was
> > definitely hasty to commit to this shape.  I apologize for this.
>
> > Tom, I think you are way more experienced in this codebase than me.
> > And, probably more importantly, more experienced in making decisions
> > for planner development.  If you see some way forward to polish this
> > post-commit, Andrei and I are ready to work hard on this with you.  If
> > you don't see (or don't think that's good), let's revert this.
>
> It wasn't ready to commit, and I think trying to fix it up post
> feature freeze isn't appropriate project management.  Let's revert
> it and work on it more in the v18 time frame.

Ok, let's do this.  I'd like to hear from you some directions for
further development of this patch if possible.

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 11:37 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > On Tue, Apr 9, 2024 at 5:12 AM Tom Lane  wrote:
> >> * OrClauseGroupKey is not a Node type, so why does it have
> >> a NodeTag?  I wonder what value will appear in that field,
> >> and what will happen if the struct is passed to any code
> >> that expects real Nodes.
>
> > I used that to put both not-subject-of-transform nodes together with
> > hash entries into the same list.  This is used to save the order of
> > clauses.  I think this is an important property, and I have already
> > expressed it in [1].
>
> What exactly is the point of having a NodeTag in the struct though?
> If you don't need it to be a valid Node, that seems pointless and
> confusing.  We certainly have plenty of other lists that contain
> plain structs without tags, so I don't buy that the List
> infrastructure is making you do that.

This code mixes Expr's and hash entries in the single list.  The point
of having a NodeTag in the struct is the ability to distinguish them
later.

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 7:27 PM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > On Mon, Apr 8, 2024 at 10:12 PM Tom Lane  wrote:
> >> I don't know that I'd call it scary exactly, but I do think it
> >> was premature.  A week ago there was no consensus that it was
> >> ready to commit, but Alexander pushed it (or half of it, anyway)
> >> despite that.
>
> > Some of the most compelling cases for the transformation will involve
> > path keys. If the transformation enables the optimizer to build a
> > plain index scan (or index-only scan) with useful path keys, then that
> > might well lead to a far superior plan compared to what's possible
> > with BitmapOrs.
>
> I did not say it isn't a useful thing to have.  I said the patch
> did not appear ready to go in.
>
> > I understand that it'll still be possible to use OR expression
> > evaluation in such cases, without applying the transformation (via
> > filter quals), so in principle you don't need the transformation to
> > get an index scan that can (say) terminate the scan early due to the
> > presence of an "ORDER BY ... LIMIT n". But I suspect that that won't
> > work out much of the time, because the planner will believe (rightly
> > or wrongly) that the filter quals will incur too many heap page
> > accesses.
>
> That's probably related to the fact that we don't have a mechanism
> for evaluating non-indexed quals against columns that are retrievable
> from the index.  We really oughta work on getting that done.  But
> I've been keeping a side eye on the work to unify plain and index-only
> scans, because that's going to touch a lot of the same code so it
> doesn't seem profitable to pursue those goals in parallel.
>
> > Another problem (at least as I see it) with the new
> > or_to_any_transform_limit GUC is that its design seems to have nothing
> > to say about the importance of these sorts of cases. Most of these
> > cases will only have 2 or 3 constants, just because that's what's most
> > common in general.
>
> Yeah, that's one of the reasons I'm dubious that the committed
> patch was ready.

While inventing this GUC, I was thinking more about avoiding
regressions rather than about unleashing the full power of this
optimization.  But now I see that that wasn't good enough.  And it was
definitely hasty to commit to this shape.  I apologize for this.

Tom, I think you are way more experienced in this codebase than me.
And, probably more importantly, more experienced in making decisions
for planner development.  If you see some way forward to polish this
post-commit, Andrei and I are ready to work hard on this with you.  If
you don't see (or don't think that's good), let's revert this.

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 8:37 AM Andrei Lepikhov
 wrote:
> On 9/4/2024 09:12, Tom Lane wrote:
> > I have another one that I'm not terribly happy about:
> >
> >      Author: Alexander Korotkov 
> >  Branch: master [72bd38cc9] 2024-04-08 01:27:52 +0300
> >
> >  Transform OR clauses to ANY expression
> Because I'm primary author of the idea, let me answer.
> >
> > I don't know that I'd call it scary exactly, but I do think it
> > was premature.  A week ago there was no consensus that it was
> > ready to commit, but Alexander pushed it (or half of it, anyway)
> > despite that.  A few concrete concerns:
> >
> > * Yet another planner GUC.  Do we really need or want that?
> It is the most interesting question here. Looking around planner
> features designed but not applied for the same reason because they can
> produce suboptimal plans in corner cases, I think about inventing
> flag-type parameters and hiding some features that work better for
> different load types under such flagged parameters.

Yes, I have spotted this transformation could cause a bitmap scan plan
regressions in [1] and [2].  Fixing that required to treat ANY the
same as OR for bitmap scans.  Andrei implemented that in [3], but that
increases planning complexity and elimitates significant part of the
advantages of OR-to-ANY transformation.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/CAPpHfdtSXxhdv3mLOLjEewGeXJ%2BFtfhjqodn1WWuq5JLsKx48g%40mail.gmail.com
3. 
https://www.postgresql.org/message-id/6d27d752-db0b-4cac-9843-6ba3dd7a1e94%40postgrespro.ru

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 5:12 AM Tom Lane  wrote:
> * OrClauseGroupKey is not a Node type, so why does it have
> a NodeTag?  I wonder what value will appear in that field,
> and what will happen if the struct is passed to any code
> that expects real Nodes.

I used that to put both not-subject-of-transform nodes together with
hash entries into the same list.  This is used to save the order of
clauses.  I think this is an important property, and I have already
expressed it in [1].  That could be achieved without adding NodeTag to
hash entries, but that would require a level of indirection.  It's not
passed to code that expects real Nodes, it doesn't go to anything
except lists.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdutHt31sdt2rfU%3D4fsDMWxf6tvtnHARgCzLY2Tf21%2Bfgw%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 8:55 AM Tom Lane  wrote:
> Andrei Lepikhov  writes:
> > On 9/4/2024 09:12, Tom Lane wrote:
> >> I have another one that I'm not terribly happy about:
> >> Author: Alexander Korotkov 
> >> Branch: master [72bd38cc9] 2024-04-08 01:27:52 +0300
> >> Transform OR clauses to ANY expression
>
> >> * What the medical community would call off-label usage of
> >> query jumbling.  I'm not sure this is even correct as-used,
> >> and for sure it's using that code for something never intended.
> >> Nor is the added code adequately (as in, at all) documented.
>
> > I agree with documentation and disagree with critics on the expression
> > jumbling. It was introduced in the core. Why don't we allow it to be
> > used to speed up machinery with some hashing?
>
> I would back up from that a good deal: why do we need to hash here in
> the first place?  There's no evidence I'm aware of that it's needful
> from a performance standpoint.

I think the feature is aimed to deal with large OR lists.  I've seen a
significant degradation on 1 or-clause-groups.  That might seem
like awfully a lot, but actually it's not unachievable in generated
queries.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: removal of '{' from WORD_BREAKS

2024-04-09 Thread Alexander Korotkov

On Tue, Apr 9, 2024 at 6:18 PM Robert Haas  wrote:
> On Thu, Apr 4, 2024 at 10:38 PM Michael Paquier  wrote:
> > > It kind of looks like a libedit bug, but maybe we should dig more
> > > deeply.  I felt itchy about 927332b95e77 removing '{' from the
> > > WORD_BREAKS set, and wondered exactly how that would change readline's
> > > behavior.  But even if that somehow accounts for the extra backslash
> > > before '{', it's not clear how it could lead to '?' and '}' also
> > > getting backslashed.
> >
> > I don't have a clear idea, either.  I also feel uneasy about
> > 927332b95e77 and its change of WORD_BREAKS, but this has the smell
> > of a bug from an outdated libedit version.
>
> I too felt uneasy about that commit, for the same reason. However,
> there is a justification for the change in the commit message which is
> not obviously wrong, namely that ":{?name} is the only psql syntax
> using the '{' sign". And in fact, SQL basically doesn't use '{' for
> anything, either. We do see { showing up inside of quoted strings, for
> arrays or JSON, but I would guess that the word-break characters
> aren't going to affect behavior within a quoted string. So it still
> seems like it should be OK? Another thing that makes me think that my
> unease may be unfounded is that the matching character '}' isn't in
> WORD_BREAKS either, and I would have thought that if we needed one
> we'd need both.

FWIW, the default value of rl_basic_word_break_characters [1] has '{'
but doesn't have '}'.  The documentation says that this should "break
words for completion in Bash".  But I failed to find an explanation
why this should be so for Bash.  As you correctly get, my idea was
that our SQL isn't not heavily using '{' unlike Bash.

> But does anyone else have a more specific reason for thinking that
> this might be a problem?

I don't particularly object against reverting this commit, but I think
we should get to the bottom of this first.  Otherwise there is no
warranty to not run into the same problem again.

Links.
1. 
https://tiswww.case.edu/php/chet/readline/readline.html#index-rl_005fbasic_005fword_005fbreak_005fcharacters

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-08 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas  wrote:
> On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov  
> wrote:
> > Yes, it was my mistake. I got rushing trying to fit this to FF, even doing 
> > significant changes just before commit.
> > I'll revert this later today.

It appears to be a non-trivial revert, because 041b96802e already
revised the relation analyze after 27bc1772fc.  That is, I would need
to "backport" 041b96802e.  Sorry, I'm too tired to do this today.
I'll come back to this tomorrow.

> Alexander,
>
> Exactly how much is getting reverted here? I see these, all since March 23rd:
>
> dd1f6b0c17 Provide a way block-level table AMs could re-use
> acquire_sample_rows()
> 9bd99f4c26 Custom reloptions for table AM
> 97ce821e3e Fix the parameters order for
> TableAmRoutine.relation_copy_for_cluster()
> 867cc7b6dd Revert "Custom reloptions for table AM"
> b1484a3f19 Let table AM insertion methods control index insertion
> c95c25f9af Custom reloptions for table AM
> 27bc1772fc Generalize relation analyze in table AM interface
> 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
> c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
> 02eb07ea89 Allow table AM to store complex data structures in rd_amcache

It would be discouraging to revert all of this.  Some items are very
simple, some items get a lot of work.  I'll come back tomorrow and
answer all your points.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-08 Thread Alexander Korotkov

On Mon, Apr 8, 2024, 19:08 Andres Freund  wrote:

> On 2024-04-08 08:37:44 -0700, Andres Freund wrote:
> > On 2024-04-08 11:17:51 +0400, Pavel Borisov wrote:
> > > On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov 
> > > > I was under the impression there are not so many out-of-core table
> > > > AMs, which have non-dummy analysis implementations.  And even if
> there
> > > > are some, duplicating acquire_sample_rows() isn't a big deal.
> > > >
> > > > But given your feedback, I'd like to propose to keep both options
> > > > open.  Turn back the block-level API for analyze, but let table-AM
> > > > implement its own analyze function.  Then existing out-of-core AMs
> > > > wouldn't need to do anything (or probably just set the new API method
> > > > to NULL).
> > > >
> > > I think that providing both new and old interface functions for
> block-based
> > > and non-block based custom am is an excellent compromise.
> >
> > I don't agree, that way lies an unmanageable API. To me the new API
> doesn't
> > look well polished either, so it's not a question of a smoother
> transition or
> > something like that.
> >
> > I don't think redesigning extension APIs at this stage of the release
> cycle
> > makes sense.
>
> Wait, you already pushed an API redesign? With a design that hasn't even
> seen
> the list from what I can tell? Without even mentioning that on the list?
> You
> got to be kidding me.
>

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing
significant changes just before commit.
I'll revert this later today.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-08 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 10:18 AM Pavel Borisov  wrote:
> On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov  wrote:
>>
>> Hi,
>>
>> On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
>> > On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
>> > > I've pushed 0001, 0002 and 0006.
>> >
>> > I briefly looked at 27bc1772fc81 and I don't think the state post this 
>> > commit
>> > makes sense. Before this commit another block based AM could implement 
>> > analyze
>> > without much code duplication. Now a large portion of analyze.c has to be
>> > copied, because they can't stop acquire_sample_rows() from calling
>> > heapam_scan_analyze_next_block().
>> >
>> > I'm quite certain this will break a few out-of-core AMs in a way that can't
>> > easily be fixed.
>>
>> I was under the impression there are not so many out-of-core table
>> AMs, which have non-dummy analysis implementations.  And even if there
>> are some, duplicating acquire_sample_rows() isn't a big deal.
>>
>> But given your feedback, I'd like to propose to keep both options
>> open.  Turn back the block-level API for analyze, but let table-AM
>> implement its own analyze function.  Then existing out-of-core AMs
>> wouldn't need to do anything (or probably just set the new API method
>> to NULL).
>
> I think that providing both new and old interface functions for block-based 
> and non-block based custom am is an excellent compromise.
>
> The patch v1-0001-Turn-back.. is mainly an undo of part of the 27bc1772fc81 
> that had turned off _analyze_next_tuple..analyze_next_block for external 
> callers. If some extensions are already adapted to the old interface 
> functions, they are free to still use it.

Please, check this.  Instead of keeping two APIs, it generalizes
acquire_sample_rows().  The downside is change of
AcquireSampleRowsFunc signature, which would need some changes in FDWs
too.

--
Regards,
Alexander Korotkov


v2-0001-Generalize-acquire_sample_rows.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-07 Thread Alexander Korotkov

On Mon, Apr 8, 2024 at 2:25 AM Alexander Korotkov  wrote:
> On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
> > On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> > > I've pushed 0001, 0002 and 0006.
> >
> > I briefly looked at 27bc1772fc81 and I don't think the state post this 
> > commit
> > makes sense. Before this commit another block based AM could implement 
> > analyze
> > without much code duplication. Now a large portion of analyze.c has to be
> > copied, because they can't stop acquire_sample_rows() from calling
> > heapam_scan_analyze_next_block().
> >
> > I'm quite certain this will break a few out-of-core AMs in a way that can't
> > easily be fixed.
>
> I was under the impression there are not so many out-of-core table
> AMs, which have non-dummy analysis implementations.  And even if there
> are some, duplicating acquire_sample_rows() isn't a big deal.
>
> But given your feedback, I'd like to propose to keep both options
> open.  Turn back the block-level API for analyze, but let table-AM
> implement its own analyze function.  Then existing out-of-core AMs
> wouldn't need to do anything (or probably just set the new API method
> to NULL).

The attached patch was to illustrate the approach.  It surely needs
some polishing.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-07 Thread Alexander Korotkov

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund  wrote:
> On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:
> > I've pushed 0001, 0002 and 0006.
>
> I briefly looked at 27bc1772fc81 and I don't think the state post this commit
> makes sense. Before this commit another block based AM could implement analyze
> without much code duplication. Now a large portion of analyze.c has to be
> copied, because they can't stop acquire_sample_rows() from calling
> heapam_scan_analyze_next_block().
>
> I'm quite certain this will break a few out-of-core AMs in a way that can't
> easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations.  And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open.  Turn back the block-level API for analyze, but let table-AM
implement its own analyze function.  Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

> And even for non-block based AMs, the new interface basically requires
> reimplementing all of analyze.c.
.
Non-lock base AM needs to just provide an alternative implementation
for what acquire_sample_rows() does.  This seems like reasonable
effort for me, and surely not reimplementing all of analyze.c.

--
Regards,
Alexander Korotkov

v1-0001-Turn-back-the-block-level-API-for-relation-analyz.patch
Description: Binary data

Re: POC, WIP: OR-clause support for indexes

2024-04-07 Thread Alexander Korotkov

Hi!

On Mon, Apr 1, 2024 at 9:38 AM Andrei Lepikhov
 wrote:
> On 28/3/2024 16:54, Alexander Korotkov wrote:
> > The current patch has a boolean guc enable_or_transformation.
> > However, when we have just a few ORs to be transformated, then we
> > should get less performance gain from the transformation and higher
> > chances to lose a good bitmap scan plan from that.  When there is a
> > huge list of ORs to be transformed, then the performance gain is
> > greater and it is less likely we could lose a good bitmap scan plan.
> >
> > What do you think about introducing a GUC threshold value: the minimum
> > size of list to do OR-to-ANY transformation?
> > min_list_or_transformation or something.
> I labelled it or_transformation_limit (see in attachment). Feel free to
> rename it.
> It's important to note that the limiting GUC doesn't operate
> symmetrically for forward, OR -> SAOP, and backward SAOP -> OR
> operations. In the forward case, it functions as you've proposed.
> However, in the backward case, we only check whether the feature is
> enabled or not. This is due to our existing limitation,
> MAX_SAOP_ARRAY_SIZE, and the fact that we can't match the length of the
> original OR list with the sizes of the resulting SAOPs. For instance, a
> lengthy OR list with 100 elements can be transformed into 3 SAOPs, each
> with a size of around 30 elements.
> One aspect that requires attention is the potential inefficiency of our
> OR -> ANY transformation when we have a number of elements less than
> MAX_SAOP_ARRAY_SIZE. This is because we perform a reverse transformation
> ANY -> OR at the stage of generating bitmap scans. If the BitmapScan
> path dominates, we may have done unnecessary work. Is this an occurrence
> that we should address?
> But the concern above may just be a point of improvement later: We can
> add one more strategy to the optimizer: testing each array element as an
> OR clause; we can also provide a BitmapOr path, where SAOP is covered
> with a minimal number of partial indexes (likewise, previous version).

I've revised the patch.  Did some beautification, improvements for
documentation, commit messages etc.

I've pushed the 0001 patch without 0002.  I think 0001 is good by
itself given that there is the or_to_any_transform_limit GUC option.
The more similar OR clauses are here the more likely grouping them
into SOAP will be a win.  But I've changed the default value to 5.
This will make it less invasive and affect only queries with obvious
repeating patterns.  That also reduced the changes in the regression
tests expected outputs.

Regarding 0002, it seems questionable since it could cause a planning
slowdown for SAOP's with large arrays.  Also, it might reduce the win
of transformation made by 0001.  So, I think we should skip it for
now.

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-07 Thread Alexander Korotkov

Hi, Alexander!

On Sun, Apr 7, 2024 at 10:00 PM Alexander Lakhin  wrote:
> 07.04.2024 01:22, Alexander Korotkov wrote:
> > I've pushed 0001 and 0002.  I didn't push 0003 for the following reasons.
>
> Please try the following (erroneous) query:
> CREATE TABLE t1(i int, t text) PARTITION BY LIST (t);
> CREATE TABLE t1pa PARTITION OF t1 FOR VALUES IN ('A');
>
> CREATE TABLE t2 (i int, t text) PARTITION BY RANGE (t);
> ALTER TABLE t2 SPLIT PARTITION t1pa INTO
>(PARTITION t2a FOR VALUES FROM ('A') TO ('B'),
> PARTITION t2b FOR VALUES FROM ('B') TO ('C'));
>
> that triggers an assertion failure:
> TRAP: failed Assert("datums != NIL"), File: "partbounds.c", Line: 3434, PID: 
> 1841459
>
> or a segfault (in a non-assert build):
> Program terminated with signal SIGSEGV, Segmentation fault.
>
> #0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
> 1866if (VARATT_IS_COMPRESSED(datum) || VARATT_IS_EXTERNAL(datum))
> (gdb) bt
> #0  pg_detoast_datum_packed (datum=0x0) at fmgr.c:1866
> #1  0x55f38c5d5e3f in bttextcmp (...) at varlena.c:1834
> #2  0x55f38c6030dd in FunctionCall2Coll (...) at fmgr.c:1161
> #3  0x55f38c417c83 in partition_rbound_cmp (...) at partbounds.c:3525
> #4  check_partition_bounds_for_split_range (...) at partbounds.c:5221
> #5  check_partitions_for_split (...) at partbounds.c:5688
> #6  0x55f38c256c49 in transformPartitionCmdForSplit (...) at 
> parse_utilcmd.c:3451
> #7  transformAlterTableStmt (...) at parse_utilcmd.c:3810
> #8  0x55f38c2bdf9c in ATParseTransformCmd (...) at tablecmds.c:5650

Thank you for spotting this.  This seems like a missing check.  I'm
going to get a closer look at this tomorrow.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-06 Thread Alexander Korotkov

Hi, Pavel!

On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov  wrote:
> On Tue, 2 Apr 2024 at 19:17, Jeff Davis  wrote:
>>
>> On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:
>> > I don't like the idea that every custom table AM reltoptions should
>> > begin with StdRdOptions.  I would rather introduce the new data
>> > structure with table options, which need to be accessed outside of
>> > table AM.  Then reloptions will be a backbox only directly used in
>> > table AM, while table AM has a freedom on what to store in reloptions
>> > and how to calculate externally-visible options.  What do you think?
>>
>> Hi Alexander!
>>
>> I agree with all of that. It will take some refactoring to get there,
>> though.
>>
>> One idea is to store StdRdOptions like normal, but if an unrecognized
>> option is found, ask the table AM if it understands the option. In that
>> case I think we'd just use a different field in pg_class so that it can
>> use whatever format it wants to represent its options.
>>
>> Regards,
>> Jeff Davis
>
> I tried to rework a patch regarding table am according to the input from 
> Alexander and Jeff.
>
> It splits table reloptions into two categories:
> - common for all tables (stored in a fixed size structure and could be 
> accessed from outside)
> - table-am specific (variable size, parsed and accessed by access method only)

Thank you for your work.  Please, check the revised patch.

It makes CommonRdOptions a separate data structure, not directly
involved in parsing the reloption.  Instead table AM can fill it on
the base of its reloptions or calculate the other way.  Patch comes
with a test module, which comes with heap-based table AM.  This table
AM has "enable_parallel" reloption, which is used as the base to set
the value of CommonRdOptions.parallel_workers.

--
Regards,
Alexander Korotkov


v10-0001-Custom-reloptions-for-table-AM.patch
Description: Binary data

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-06 Thread Alexander Korotkov

Hi, Dmitry!

On Fri, Apr 5, 2024 at 4:00 PM Dmitry Koval  wrote:
> > I've revised the patchset.
>
> Thanks for the corrections (especially ddl.sgml).
> Could you also look at a small optimization for the MERGE PARTITIONS
> command (in a separate file
> v31-0003-Additional-patch-for-ALTER-TABLE-.-MERGE-PARTITI.patch, I wrote
> about it in an email 2024-03-31 00:56:50)?
>
> Files v31-0001-*.patch, v31-0002-*.patch are the same as
> v30-0001-*.patch, v30-0002-*.patch (after rebasing because patch stopped
> applying due to changes in upstream).

I've pushed 0001 and 0002.  I didn't push 0003 for the following reasons.
1) This doesn't keep functionality equivalent to 0001.  With 0003, the
merged partition will inherit indexes, constraints, and so on from the
one of merging partitions.
2) This is not necessarily an optimization. Without 0003 indexes on
the merged partition are created after moving the rows in
attachPartitionTable(). With 0003 we merge data into the existing
partition which saves its indexes.  That might cause a significant
performance loss because mass inserts into indexes may be much slower
than building indexes from scratch.
I think both aspects need to be carefully considered.  Even if we
accept them, this needs to be documented.  I think now it's too late
for both of these.  So, this should wait for v18.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-06 Thread Alexander Korotkov

Hi, Alvaro!

Thank you for your care on this matter.

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera  wrote:
> BTW I noticed that
> https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
> says that lsn_cmp is not covered by the tests.  This probably indicates
> that the tests are a little too light, but I'm not sure how much extra
> effort we want to spend.

I'm aware of this.  Ivan promised to send a patch to improve the test.
If he doesn't, I'll care about it.

> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> Does this mean that if a process throws an error while waiting, it'll
> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> seems odd.

I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
that together with the improvements upthread.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov

On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov  wrote:
> On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov  wrote:
>>
>> On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera  
>> wrote:
>> >
>> > On 2024-Apr-03, Alexander Korotkov wrote:
>> >
>> > > Regarding the shmem data structure for LSN waiters.  I didn't pick
>> > > LWLock or ConditionVariable, because I needed the ability to wake up
>> > > only those waiters whose LSN is already replayed.  In my experience
>> > > waking up a process is way slower than scanning a short flat array.
>> >
>> > I agree, but I think that's unrelated to what I was saying, which is
>> > just the patch I attach here.
>>
>> Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
>> meant this very clearly!
>>
>> I'm good with the patch.  Attached revision contains a bit of a commit 
>> message.
>>
>> > > However, I agree that when the number of waiters is very high and flat
>> > > array may become a problem.  It seems that the pairing heap is not
>> > > hard to use for shmem structures.  The only memory allocation call in
>> > > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
>> > > be able to initialize the pairing heap in-place, and it will be fine
>> > > for shmem.
>> >
>> > Ok.
>> >
>> > With the code as it stands today, everything in WaitLSNState apart from
>> > the pairing heap is accessed without any locking.  I think this is at
>> > least partly OK because each backend only accesses its own entry; but it
>> > deserves a comment.  Or maybe something more, because WaitLSNSetLatches
>> > does modify the entry for other backends.  (Admittedly, this could only
>> > happens for backends that are already sleeping, and it only happens
>> > with the lock acquired, so it's probably okay.  But clearly it deserves
>> > a comment.)
>>
>> Please, check 0002 patch attached.  I found it easier to move two
>> assignments we previously moved out of lock, into the lock; then claim
>> WaitLSNState.procInfos is also protected by WaitLSNLock.
>
> Could you re-attach 0002. Seems it failed to attach to the previous message.

I actually forgot both!

--
Regards,
Alexander Korotkov


v2-0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patch
Description: Binary data


v2-0002-Clarify-what-is-protected-by-WaitLSNLock.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera  wrote:
>
> On 2024-Apr-03, Alexander Korotkov wrote:
>
> > Regarding the shmem data structure for LSN waiters.  I didn't pick
> > LWLock or ConditionVariable, because I needed the ability to wake up
> > only those waiters whose LSN is already replayed.  In my experience
> > waking up a process is way slower than scanning a short flat array.
>
> I agree, but I think that's unrelated to what I was saying, which is
> just the patch I attach here.

Oh, sorry for the confusion.  I'd re-read your message.  Indeed you
meant this very clearly!

I'm good with the patch.  Attached revision contains a bit of a commit message.

> > However, I agree that when the number of waiters is very high and flat
> > array may become a problem.  It seems that the pairing heap is not
> > hard to use for shmem structures.  The only memory allocation call in
> > paritingheap.c is in pairingheap_allocate().  So, it's only needed to
> > be able to initialize the pairing heap in-place, and it will be fine
> > for shmem.
>
> Ok.
>
> With the code as it stands today, everything in WaitLSNState apart from
> the pairing heap is accessed without any locking.  I think this is at
> least partly OK because each backend only accesses its own entry; but it
> deserves a comment.  Or maybe something more, because WaitLSNSetLatches
> does modify the entry for other backends.  (Admittedly, this could only
> happens for backends that are already sleeping, and it only happens
> with the lock acquired, so it's probably okay.  But clearly it deserves
> a comment.)

Please, check 0002 patch attached.  I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

> Don't we need to WaitLSNCleanup() during error recovery or something?

Yes, there is WaitLSNCleanup(). It's currently only called from one
place, given that waiting for LSN can't be invoked from background
workers or inside the transaction.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Alexander Korotkov

Hi, Alvaro!

Thank you for your feedback.

On Wed, Apr 3, 2024 at 9:58 AM Alvaro Herrera  wrote:
> Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
> spinlock) to protect the contents of waitLSN -- and it's used to walk an
> arbitrary long list of processes waiting ... and also, an arbitrary
> number of processes could be calling this code.  I think using a
> spinlock for this is unwise, as it'll cause busy-waiting whenever
> there's contention.  Wouldn't it be better to use an LWLock for this?
> Then the processes would sleep until the lock is freed.
>
> While nosing about the code, other things struck me:
>
> I think there should be more comments about WaitLSNProcInfo and
> WaitLSNState in waitlsn.h.
>
> In addLSNWaiter it'd be better to assign 'cur' before acquiring the
> lock.
>
> Is a plan array really the most efficient data structure for this,
> considering that you have to reorder each time you add an element?
> Maybe it is, but wouldn't it make sense to use memmove() when adding one
> element rather iterating all the remaining elements to the end of the
> queue?
>
> I think the include list in waitlsn.c could be tightened a bit:

I've just pushed commit, which shortens the include list and fixes the
order of 'cur' assigning and taking spinlock in addLSNWaiter().

Regarding the shmem data structure for LSN waiters.  I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed.  In my experience
waking up a process is way slower than scanning a short flat array.

However, I agree that when the number of waiters is very high and flat
array may become a problem.  It seems that the pairing heap is not
hard to use for shmem structures.  The only memory allocation call in
paritingheap.c is in pairingheap_allocate().  So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

I'll come back with switching to the pairing heap shortly.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov

On Tue, Apr 2, 2024 at 2:47 PM Andy Fan  wrote:
> Bharath Rupireddy  writes:
>
> > On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
> >  wrote:
> >>
> >> 8 years, we tried to add this feature, and now we suggest the best way
> >> for this feature is to commit the minimal version first.
> >
> > Just curious, do you or anyone else have an immediate use for this
> > function? If yes, how are they achieving read-after-write-consistency
> > on streaming standbys in their application right now without a
> > function like this?
>
> The link [1] may be helpful and I think the reason there is reasonable
> to me.
>
> Actually we also disucss how to make sure the "read your writes
> consistency" internally, and the soluation here looks good to me.
>
> Glad to know that this patch will be committed very soon.
>
> [1]
> https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com

Thank you for your feedback.

I also can confirm that a lot of users would be very happy to have
"read your writes consistency" and ready to do something to achieve
this at an application level.  However, they typically don't know what
exactly they need.

So, blogging about pg_wal_replay_wait() and spreading words about it
at conferences would be highly appreciated.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov

On Tue, Apr 2, 2024 at 1:11 PM Kartyshov Ivan
 wrote:
> On 2024-04-02 11:14, Andy Fan wrote:
> > If so, when users call pg_wal_replay_wait, they can get informed when
> > the wal is replaied to the target_lsn, but they can't know how long
> > time
> > it waits unless they check it in application side, I think such
> > information will be useful for monitor purpose sometimes.
> >
> > select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
> > this case, I want this function return 1.
>
> Hi Andy, to get timing we can use \time in psql.
> Here is an example.
> postgres=# \timing
> Timing is on.
> postgres=# select 1;
>   ?column?
> --
>  1
> (1 row)
>
> Time: 0.536 ms
>
>
> >void
> And returning VOID is the best option, rather than returning TRUE|FALSE
> or timing. It left the logic of the procedure very simple, we get an
> error if LSN is not reached.
>
> 8 years, we tried to add this feature, and now we suggest the best way
> for this feature is to commit the minimal version first.
>
> Let's discuss further improvements in future versions.

+1,
It seems there was not yet a precedent of builtin PostgreSQL function
returning its duration.  And I don't think we need to introduce such
precedent, at least now.  This seems like we're placing the
responsibility on monitoring resources usage to an application.

I'd also like to note that pg_wal_replay_wait() comes with a dedicated
wait event.  So, one could monitor the average duration of these waits
using sampling of wait events.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-02 Thread Alexander Korotkov

Hi, Jeff!

On Tue, Apr 2, 2024 at 8:19 AM Jeff Davis  wrote:
> On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote:
> > I've pushed 0001, 0002 and 0006.
>
> Sorry to jump in to this discussion so late. I had worked on something
> like the custom reloptions (0002) in the past, and there were some
> complications that don't seem to be addressed in commit c95c25f9af.
>
> * At minimum I think it needs some direction (comments, docs, tests)
> that show how it's supposed to be used.
>
> * The bytea returned by the reloptions() method is not in a trivial
> format. It's a StdRelOptions struct with string values stored after the
> end of the struct. To build the bytea internally, there's some
> infrastructure like allocateRelOptStruct() and fillRelOptions(), and
> it's not very easy to extend those to support a few custom options.
>
> * If we ever decide to add a string option to StdRdOptions, I think the
> design breaks, because the code that looks for those string values
> wouldn't know how to skip over the custom options. Perhaps we can just
> promise to never do that, but we should make it explicit somehow.
>
> * Most existing heap reloptions (other than fillfactor) are used by
> other parts of the system (like autovacuum) so should be considered
> valid for any AM. Most AMs will just want to add a handful of their own
> options on top, so it would be good to demonstrate how this should be
> done.
>
> * There are still places that are inappropriately calling
> heap_reloptions directly. For instance, in ProcessUtilitySlow(), it
> seems to assume that a toast table is a heap?

Thank you for the detailed explanation.  This piece definitely needs
more work.  I've just reverted the c95c25f9af.

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions.  I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM.  Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options.  What do you think?

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Alexander Korotkov

Hi, Andy!

On Tue, Apr 2, 2024 at 6:29 AM Andy Fan  wrote:
> >  Did you forget to attach the new patch?
> >
> > Yes, here it is.
> >
> > [4. text/x-diff; 
> > v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...
>
> +
> +pg_wal_replay_wait (
> +  target_lsn pg_lsn,
> +  timeout bigint 
> DEFAULT 0)
> +void
> +   
>
> Should we return the millseconds of waiting time?  I think this
> information may be useful for customer if they want to know how long
> time it waits for for minitor purpose.

Please, check it more carefully.  In v17 timeout is in integer milliseconds.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-01 Thread Alexander Korotkov

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane  wrote:
> Coverity complained about what you did in RelationParseRelOptions
> in c95c25f9a:
>
> *** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 
> 499 in RelationParseRelOptions()
> 493
> 494 /*
> 495  * Fetch reloptions from tuple; have to use a hardwired 
> descriptor because
> 496  * we might not have any other for pg_class yet (consider 
> executing this
> 497  * code for pg_class itself)
> 498  */
> >>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> >>> Passing null pointer "tableam" to "extractRelOptions", which 
> >>> dereferences it.
> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
> 500 
> tableam, amoptsfn);
> 501
>
> I see that extractRelOptions only uses the tableam argument for some
> relkinds, and RelationParseRelOptions does set it up for those
> relkinds --- but Coverity's complaint isn't without merit, because
> those two switch statements are looking at *different copies of the
> relkind*, which in theory could be different.  This all seems quite
> messy and poorly factored.  Can't we do better?  Why do we need to
> involve two copies of allegedly the same pg_class tuple, anyhow?

I wasn't registered at Coverity yet.  Now I've registered and am
waiting for approval to access the PostgreSQL analysis data.

I wonder why Coverity complains about tableam, but not amoptsfn.
Their usage patterns are very similar.

It appears that relation->rd_rel isn't the full copy of pg_class tuple
(see AllocateRelationDesc).  RelationParseRelOptions() is going to
update relation->rd_options, and thus needs a full pg_class tuple to
fetch options out of it.  However, it is really unnecessary to access
both tuples at the same time.  We can use a full tuple, not
relation->rd_rel, in both cases.  See the attached patch.

--
Regards,
Alexander Korotkov

fix-RelationParseRelOptions-Coverity-complain.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-01 Thread Alexander Korotkov

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane  wrote:
>
> Coverity complained about what you did in RelationParseRelOptions
> in c95c25f9a:
>
> *** CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 
> 499 in RelationParseRelOptions()
> 493
> 494 /*
> 495  * Fetch reloptions from tuple; have to use a hardwired 
> descriptor because
> 496  * we might not have any other for pg_class yet (consider 
> executing this
> 497  * code for pg_class itself)
> 498  */
> >>> CID 1595992:  Null pointer dereferences  (FORWARD_NULL)
> >>> Passing null pointer "tableam" to "extractRelOptions", which 
> >>> dereferences it.
> 499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
> 500 
> tableam, amoptsfn);
> 501
>
> I see that extractRelOptions only uses the tableam argument for some
> relkinds, and RelationParseRelOptions does set it up for those
> relkinds --- but Coverity's complaint isn't without merit, because
> those two switch statements are looking at *different copies of the
> relkind*, which in theory could be different.  This all seems quite
> messy and poorly factored.  Can't we do better?  Why do we need to
> involve two copies of allegedly the same pg_class tuple, anyhow?

Thank you for reporting this, Tom.
I'm planning to investigate this later today.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-01 Thread Alexander Korotkov

On Mon, Apr 1, 2024 at 5:25 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov 
> wrote:
> >
> > > 9. To me the following query blocks even though I didn't mention
> timeout.
> > > CALL pg_wal_replay_wait('0/fff');
> >
> > If your primary server is freshly initialized, you need to do quite
> > data modifications to reach this LSN.
>
> Right, but why pg_wal_replay_wait  blocks without a timeout? It must
> return an error saying it can't reach the target LSN, no?
>

How can replica know this?  It doesn't look feasible to distinguish this
situation from the situation when connection between primary and replica
became slow.  I'd keep this simple.  We have pg_sleep_for() which waits for
the specified time whatever it is.  And we have pg_wal_replay_wait() waits
till replay lsn grows to the target whatever it is.  It's up to the user to
specify the correct value.

> Did you forget to attach the new patch?
>

Yes, here it is.

--
Regards,
Alexander Korotkov

v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-31 Thread Alexander Korotkov

Hi Bharath,

Thank you for your feedback.

On Sun, Mar 31, 2024 at 8:44 AM Bharath Rupireddy
 wrote:
> On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov  
> wrote:
> Thanks for the patch. I have a few comments on the v16 patch.
>
> 1. Is there any specific reason for pg_wal_replay_wait() being a
> procedure rather than a function? I haven't read the full thread, but
> I vaguely noticed the discussion on the new wait mechanism holding up
> a snapshot or some other resource. Is that the reason to use a stored
> procedure over a function? If yes, can we specify it somewhere in the
> commit message and just before the procedure definition in
> system_functions.sql?

Surely, there is a reason.  Function should be executed in a snapshot,
which can prevent WAL records from being replayed.  See [1] for a
particular test scenario.  In a procedure we may enforce non-atomic
context and release the snapshot.

I've mentioned that in the commit message and in the procedure code.
I don't think system_functions.sql is the place for this type of
comment.  We only use system_functions.sql to push the default values.

> 2. Is the pg_wal_replay_wait first procedure that postgres provides
> out of the box?

Yes, it appears first.  I see nothing wrong about that.

> 3. Defining a procedure for the first time in system_functions.sql
> which is supposed to be for functions seems a bit unusual to me.

>From the scope of DDL and system catalogue procedure is just another
kind of function (prokind == 'p').  So, I don't feel wrong about that.

> 4.
> +
> +endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
> +
>
> +if (timeout > 0)
> +{
> +delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
> +latch_events |= WL_TIMEOUT;
> +if (delay_ms <= 0)
> +break;
> +}
>
> Why is endtime calculated even for timeout <= 0 only to just skip it
> later? Can't we just do a fastpath exit if timeout = 0 and targetLSN <

OK, fixed.

> 5.
>  Parameter
> +timeout is the time in milliseconds to wait
> +for the target_lsn
> +replay. When timeout value equals to zero no
> +timeout is applied.
> +replay. When timeout value equals to zero no
> +timeout is applied.
>
> It turns out to be "When timeout value equals to zero no timeout is
> applied." I guess, we can just say something like the following which
> I picked up from pg_terminate_backend timeout parameter description.
>
>
> If timeout is not specified or zero, this
> function returns if  the WAL upto
> target_lsn  is replayed.
> If the timeout is specified (in
> milliseconds) and greater than zero, the function waits until the
> server actually replays the WAL upto target_lsn  or
> until the given time has passed. On timeout, an error is emitted.
>

Applied as you suggested with some edits from me.

> 6.
> +ereport(ERROR,
> +(errcode(ERRCODE_QUERY_CANCELED),
> + errmsg("canceling waiting for LSN due to timeout")));
>
> We can be a bit more informative here and say targetLSN and currentLSN
> something like - "timed out while waiting for target LSN %X/%X to be
> replayed; current LSN %X/%X"?

Done this way.  Adjusted other ereport()'s as well.

> 7.
> +if (context->atomic)
> +ereport(ERROR,
> +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("pg_wal_replay_wait() must be only called in
> non-atomic context")));
> +
>
> Can we say what a "non-atomic context '' is in a user understandable
> way like explicit txn or whatever that might  be? "non-atomic context'
> might not sound great to the end -user.

Added errdetail() to this ereport().

> 8.
> +the movie table and get the lsn 
> after
> +changes just made.  This example uses
> pg_current_wal_insert_lsn
> +to get the lsn given that
> synchronous_commit
> +could be set to off.
>
> Can we just mention that run pg_current_wal_insert_lsn on the primary?

The mention is added.

> 9. To me the following query blocks even though I didn't mention timeout.
> CALL pg_wal_replay_wait('0/fff');

If your primary server is freshly initialized, you need to do quite
data modifications to reach this LSN.

> 10. Can't we do some input validation on the timeout parameter and
> emit an error for negative values just like pg_terminate_backend?
> CALL pg_wal_replay_wait('0/ff', -100);

Reasonable, added.

> 11.
> +
> +if (timeout > 0)
> +{
> +delay_ms = (endtime - G

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-30 Thread Alexander Korotkov

Hi!

On Sat, Mar 30, 2024 at 6:14 PM Kartyshov Ivan
 wrote:
>
> Thank you Alexander for working on patch, may be we should change some
> names:
> 1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and
> waitlsn.h

I renamed that to 043_wal_replay_wait.pl to match the name of SQL procedure.

> In waitlsn.c and waitlsn.h variables:
> 2) targret_lsn -> trgLSN like curLSN

I prefer this to match the SQL procedure parameter name.

> 3) lsn -> trgLSN like curLSN

Done.

Also I implemented termination of wal replay waits on standby
promotion (with test).

In the test I change recovery_min_apply_delay to 1s in order to make
the test pass faster.

--
Regards,
Alexander Korotkov

v16-0001-Implement-pg_wal_replay_wait-stored-procedure.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-03-30 Thread Alexander Korotkov

Hi, Pavel!

I've pushed 0001, 0002 and 0006.

On Fri, Mar 29, 2024 at 5:23 PM Pavel Borisov  wrote:
>>
>> I think for better code look this could be removed:
>> >vlock:
>>  >   CHECK_FOR_INTERRUPTS();
>> together with CHECK_FOR_INTERRUPTS(); in heapam_tuple_insert_with_arbiter 
>> placed in the beginning of while loop.
>
> To clarify things, this I wrote only about CHECK_FOR_INTERRUPTS(); 
> rearrangement.

Thank you for your review of this patch.  But I still think there is a
problem that this patch moves part of the executor to table AM which
directly uses executor data structures and functions.  This works, but
not committable since it breaks the encapsulation.

I think the way forward might be to introduce the new API, which would
isolate executor details from table AM.  We may introduce a new data
structure InsertWithArbiterContext which would contain EState and a
set of callbacks which would avoid table AM from calling the executor
directly.  That should be our goal for pg18.  Now, this is too close
to FF to design a new API.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-30 Thread Alexander Korotkov

On Fri, Mar 29, 2024 at 6:50 PM Euler Taveira  wrote:
> On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:
>
> This generally makes sense, but I'm not sure about this.  The
> milliseconds timeout was used initially but received critics in [1].
>
>
> Alexander, I see why you changed the patch.
>
> Peter suggested to use an interval but you proposed another data type:
> float. The advantage of the interval data type is that you don't need to
> carefully think about the unit, however, if you use the integer data
> type you have to propose one. (If that's the case, milliseconds is a
> good granularity for this feature.) I don't have a strong preference
> between integer and interval data types but I don't like the float for
> this case. The 2 main reasons are (a) that we treat time units (hours,
> minutes, seconds, ...) as integers so it seems natural for a human being
> to use a unit time as integer and (b) depending on the number of digits
> after the decimal separator you still don't have an integer in the
> internal unit, hence, you have to round it to integer.
>
> We already have functions that use integer (such as pg_terminate_backend)
> and interval (such as pg_sleep_for) and if i searched correctly it will
> be the first timeout argument as float.

Thank you for the detailed explanation.  Float seconds are used in
pg_sleep() just similar to the interval in pg_sleep_for().  However,
that's a delay, not exactly a timeout.  Given the precedent of
milliseconds timeout in pg_terminate_backend(), your and Pavel's
points, I've switched back to integer milliseconds timeout.

Some fixes spotted off-list by Alexander Lakhin.
1) We don't need an explicit check for the postmaster being alive as
soon as we pass WL_EXIT_ON_PM_DEATH to WaitLatch().
2) When testing for unreachable LSN, we need to select LSN well in
advance so that autovacuum couldn't affect that.

I'm going to push this if no objections.

--
Regards,
Alexander Korotkov

v15-0001-Implement-pg_wal_replay_wait-stored-procedure.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-29 Thread Alexander Korotkov

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira  wrote:
> On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
>
> Fixed along with other issues spotted by Alexander Lakhin.
>
>
> [I didn't read the whole thread. I'm sorry if I missed something ...]
>
> You renamed the function in a previous version but let me suggest another one:
> pg_wal_replay_wait. It uses the same pattern as the other recovery control
> functions [1]. I think "for" doesn't add much for the function name and "lsn" 
> is
> used in functions that return an LSN (that's not the case here).
>
> postgres=# \df pg_wal_replay*
> List of functions
> -[ RECORD 1 ]---+-
> Schema  | pg_catalog
> Name| pg_wal_replay_pause
> Result data type| void
> Argument data types |
> Type| func
> -[ RECORD 2 ]---+-
> Schema  | pg_catalog
> Name| pg_wal_replay_resume
> Result data type| void
> Argument data types |
> Type| func

Makes sense to me.  I tried to make a new procedure name consistent
with functions acquiring various WAL positions.  But you're right,
it's better to be consistent with other functions controlling wal
replay.

> Regarding the arguments, I think the timeout should be bigint. There is at 
> least
> another function that implements a timeout that uses bigint.
>
> postgres=# \df pg_terminate_backend
> List of functions
> -[ RECORD 1 ]---+--
> Schema  | pg_catalog
> Name| pg_terminate_backend
> Result data type| boolean
> Argument data types | pid integer, timeout bigint DEFAULT 0
> Type| func
>
> I also suggests that the timeout unit should be milliseconds, hence, using
> bigint is perfectly fine for the timeout argument.
>
> +   
> +Throws an ERROR if the target lsn was not replayed
> +on standby within given timeout.  Parameter 
> timeout
> +is the time in seconds to wait for the 
> target_lsn
> +replay. When timeout value equals to zero no
> +timeout is applied.
> +   

This generally makes sense, but I'm not sure about this.  The
milliseconds timeout was used initially but received critics in [1].

Links.
1. 
https://www.postgresql.org/message-id/b45ff979-9d12-4828-a22a-e4cb327e115c%40eisentraut.org

--
Regards,
Alexander Korotkov


v14-0001-Implement-pg_wal_replay_wait-stored-procedure.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov

On Fri, Mar 22, 2024 at 3:45 PM Kartyshov Ivan
 wrote:
> On 2024-03-20 12:11, Alexander Korotkov wrote:
> > On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> >  wrote:
> >> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> >> > unboundedly, shouldn't we check if the specified LSN is more than
> >> > pg_last_wal_receive_lsn() error out?
> >
> > I think limiting wait lsn by current received lsn would destroy the
> > whole value of this feature.  The value is to wait till given LSN is
> > replayed, whether it's already received or not.
>
> Ok sounds reasonable, I`ll rollback the changes.
>
> > But I don't see a problem here. On the replica, it's out of our
> > control to check which lsn is good and which is not.  We can't check
> > whether the lsn, which is in future for the replica, is already issued
> > by primary.
> >
> > For the case of wrong lsn, which could cause potentially infinite
> > wait, there is the timeout and the manual query cancel.
>
> Fully agree with this take.
>
> >> > 4.3 With an unreasonably high wait time, BEGIN command waits
> >> > unboundedly, shouldn't we restrict the wait time to some max
> > value,
> >> > say a day or so?
> >> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> >> > BEGIN AFTER :'future_receive_lsn' WITHIN 10;
> >>
> >> Good idea, I put it 1 day. But this limit we should to discuss.
> >
> > Do you think that specifying timeout in milliseconds is suitable?  I
> > would prefer to switch to seconds (with ability to specify fraction of
> > second).  This was expressed before by Alexander Lakhin.
>
> It sounds like an interesting idea. Please review the result.
>
> >> > https://github.com/macdice/redo-bench or similar tools?
> >
> > Ivan, could you do this?
>
> Yes, test redo-bench/crash-recovery.sh
> This patch on master
> 91.327, 1.973
> 105.907, 3.338
> 98.412, 4.579
> 95.818, 4.19
>
> REL_13-STABLE
> 116.645, 3.005
> 113.212, 2.568
> 117.644, 3.183
> 111.411, 2.782
>
> master
> 124.712, 2.047
> 117.012, 1.736
> 116.328, 2.035
> 115.662, 1.797
>
> Strange behavior, patched version is faster then REL_13-STABLE and
> master.

I've run this test on my machine with v13 of the path.

patched
53.663, 0.466
53.884, 0.402
54.102, 0.441

master
55.216, 0.441
54.52, 0.464
51.479, 0.438

It seems that difference is less than variance.

--
Regards,
Alexander Korotkov

Re: POC, WIP: OR-clause support for indexes

2024-03-28 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 7:17 AM Andrei Lepikhov
 wrote:
> On 14/3/2024 16:31, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov
> > As you can see this case is not related to partial indexes.  Just no
> > index selective for the whole query.  However, splitting scan by the OR
> > qual lets use a combination of two selective indexes.
> I have rewritten the 0002-* patch according to your concern. A candidate
> and some thoughts are attached.
> As I see, we have a problem here: expanding each array and trying to
> apply an element to each index can result in a lengthy planning stage.
> Also, an index scan with the SAOP may potentially be more effective than
> with the list of OR clauses.
> Originally, the transformation's purpose was to reduce a query's
> complexity and the number of optimization ways to speed up planning and
> (sometimes) execution. Here, we reduce planning complexity only in the
> case of an array size larger than MAX_SAOP_ARRAY_SIZE.
> Maybe we can fall back to the previous version of the second patch,
> keeping in mind that someone who wants to get maximum profit from the
> BitmapOr scan of multiple indexes can just disable this optimization,
> enabling deep search of the most optimal scanning way?
> As a compromise solution, I propose adding one more option to the
> previous version: if an element doesn't fit any partial index, try to
> cover it with a plain index.
> In this case, we still do not guarantee the most optimal fit of elements
> to the set of indexes, but we speed up planning. Does that make sense?

Thank you for your research Andrei.  Now things get more clear on the
advantages and disadvantages of this transformation.

The current patch has a boolean guc enable_or_transformation.
However, when we have just a few ORs to be transformated, then we
should get less performance gain from the transformation and higher
chances to lose a good bitmap scan plan from that.  When there is a
huge list of ORs to be transformed, then the performance gain is
greater and it is less likely we could lose a good bitmap scan plan.

What do you think about introducing a GUC threshold value: the minimum
size of list to do OR-to-ANY transformation?
min_list_or_transformation or something.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov

On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro  wrote:
>
> > v12
>
> Hi all,
>
> I didn't review the patch but one thing jumped out: I don't think it's
> OK to hold a spinlock while (1) looping over an array of backends and
> (2) making system calls (SetLatch()).

Good catch, thank you.

Fixed along with other issues spotted by Alexander Lakhin.

--
Regards,
Alexander Korotkov


v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov

On Thu, Mar 28, 2024 at 2:24 AM Alexander Korotkov  wrote:
> On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
>  wrote:
> > Thank you for your interest to the patch.
> > I understand you questions, but I fully support Alexander Korotkov idea
> > to commit the minimal required functionality. And then keep working on
> > other improvements.
>
> I did further improvements in the patch.
>
> Notably, I decided to rename the procedure to
> pg_wait_for_wal_replay_lsn().  This makes the name look consistent
> with other WAL-related functions.  Also it clearly states that we're
> waiting for lsn to be replayed (not received, written or flushed).
>
> Also, I did implements in the docs, commit message and some minor code fixes.
>
> I'm continuing to look at this patch.

Sorry, I forgot the attachment.

--
Regards,
Alexander Korotkov


v12-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-27 Thread Alexander Korotkov

On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
 wrote:
> Thank you for your interest to the patch.
> I understand you questions, but I fully support Alexander Korotkov idea
> to commit the minimal required functionality. And then keep working on
> other improvements.

I did further improvements in the patch.

Notably, I decided to rename the procedure to
pg_wait_for_wal_replay_lsn().  This makes the name look consistent
with other WAL-related functions.  Also it clearly states that we're
waiting for lsn to be replayed (not received, written or flushed).

Also, I did implements in the docs, commit message and some minor code fixes.

I'm continuing to look at this patch.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov

On Sun, Mar 24, 2024 at 4:39 AM Bharath Rupireddy
 wrote:
> I share the same concern as yours and had proposed something upthread
> [1]. The idea is something like how each query takes a snapshot at the
> beginning of txn/query (depending on isolation level), the same way
> the standby can wait for the primary's current LSN as of the moment
> (at the time of taking snapshot). And, primary keeps sending its
> current LSN as part of regular WAL to standbys so that the standbys
> doesn't have to make connections to the primary to know its current
> LSN every time. Perhps, this may not even fully guarantee (considered
> to be achieving) the read-after-write consistency on standbys unless
> there's a way for the application to tell the wait LSN.

Oh, no.  Please, check [1].  The idea is to wait for a particular
transaction to become visible.  The one who made a change on primary
brings the lsn value from there to replica.  For instance, an
application made a change on primary and then willing to run some
report on replica.  And the report should be guaranteed to contain the
change just made.  So, the application query the LSN from primary
after making a write transaction, then calls pg_wait_lsn() on
replicate before running the report.

This is quite simple server functionality, which could be used at
application-level, ORM-level or pooler-level.  And it unlocks the way
forward for in-protocol implementation as proposed by Peter
Eisentraut.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdtny81end69PzEdRsROKnsybsj%3DOs8DUM-6HeKGKnCuQQ%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov

On Fri, Mar 22, 2024 at 12:58 AM Peter Eisentraut  wrote:
> On 17.03.24 15:09, Alexander Korotkov wrote:
> > My current attempt was to commit minimal implementation as less
> > invasive as possible.  A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements.  But
> > yes, this is still a new qual.  And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
>
> I had written in [0] about my questions related to using this with
> connection poolers.  I don't think this was addressed at all.  I haven't
> seen any discussion about how to make this kind of facility usable in a
> full system.  You have to manually query and send LSNs; that seems
> pretty cumbersome.  Sure, this is part of something that could be
> useful, but how would an actual user with actual application code get to
> use this?

The current usage pattern of this functionality is the following.

1. Do the write transaction on primary
2. Query pg_current_wal_insert_lsn() on primary
3. Call pg_wait_lsn() with the value obtained on the previous step on replica
4. Do the read transaction of replica

This usage pattern could be implemented either on the application
level, or on the pooler level.  For application level, it would
require a somewhat advanced level of database-aware programming, but
this is still a valid usage.  Regarding poolers, if some poolers
manage to automatically distinguish reading and writing queries,
dealing with LSNs wouldn't be too complex for them.

Having this functionality on protocol level would be ideal, but let's
do this step-by-step.  The built-in procedure isn't very invasive, but
that could give us some adoption and open the way forward.

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-26 Thread Alexander Korotkov

On Fri, Mar 22, 2024 at 12:50 AM Peter Eisentraut  wrote:
> On 19.03.24 18:38, Kartyshov Ivan wrote:
> >   CALL pg_wait_lsn('0/3002AE8', 1);
> >   BEGIN;
> >   SELECT * FROM tbl; // read fresh insertions
> >   COMMIT;
>
> I'm not endorsing this or any other approach, but I think the timeout
> parameter should be of type interval, not an integer with a unit that is
> hidden in the documentation.

I'm not sure a timeout needs to deal with complexity of our interval
datatype.  At the same time, the integer number of milliseconds looks
a bit weird.  Could the float8 number of seconds be an option?

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-25 Thread Alexander Korotkov

On Tue, Sep 20, 2022 at 3:21 PM Robert Haas  wrote:
> On Mon, Sep 19, 2022 at 4:42 PM Dmitry Koval  wrote:
> > Thanks for comments and advice!
> > I thought about this problem and discussed about it with colleagues.
> > Unfortunately, I don't know of a good general solution.
>
> Yeah, me neither.
>
> > But for specific situation like this (certain partition is not changing)
> > we can add CONCURRENTLY modifier.
> > Our DDL query can be like
> >
> > ALTER TABLE...SPLIT PARTITION [CONCURRENTLY];
> >
> > With CONCURRENTLY modifier we can lock partitioned table in
> > ShareUpdateExclusiveLock mode and split partition - in
> > AccessExclusiveLock mode. So we don't lock partitioned table in
> > AccessExclusiveLock mode and can modify other partitions during SPLIT
> > operation (except split partition).
> > If smb try to modify split partition, he will receive error "relation
> > does not exist" at end of operation (because split partition will be drop).
>
> I think that a built-in DDL command can't really assume that the user
> won't modify anything. You'd have to take a ShareLock.
>
> But you might be able to have a CONCURRENTLY variant of the command
> that does the same kind of multi-transaction thing as, e.g., CREATE
> INDEX CONCURRENTLY. You would probably have to be quite careful about
> race conditions (e.g. you commit the first transaction and before you
> start the second one, someone drops or detaches the partition you were
> planning to merge or split). Might take some thought, but feels
> possibly doable. I've never been excited enough about this kind of
> thing to want to put a lot of energy into engineering it, because
> doing it "manually" feels so much nicer to me, and doubly so given
> that we now have ATTACH CONCURRENTLY and DETACH CONCURRENTLY, but it
> does seem like a thing some people would probably use and value.

+1
Currently people are using external tools to implement this kind of
task.  However, having this functionality in core would be great.
Implementing concurrent merge/split seems quite a difficult task,
which needs careful design.  It might be too hard to carry around the
syntax altogether.  So, I think having basic syntax in-core is a good
step forward.  But I think we need a clear notice in the documentation
about the concurrency to avoid wrong user expectations.

--
Regards,
Alexander Korotkov

Re: Add Index-level REINDEX with multiple jobs

2024-03-25 Thread Alexander Korotkov

On Mon, Mar 25, 2024 at 4:47 AM Richard Guo  wrote:
> On Mon, Mar 25, 2024 at 10:07 AM Tom Lane  wrote:
>>
>> Alexander Korotkov  writes:
>> > Here goes the revised patch.  I'm going to push this if there are no 
>> > objections.
>>
>> Quite a lot of the buildfarm is complaining about this:
>>
>> reindexdb.c: In function 'reindex_one_database':
>> reindexdb.c:434:54: error: 'indices_tables_cell' may be used uninitialized 
>> in this function [-Werror=maybe-uninitialized]
>>   434 | strcmp(prev_index_table_name, indices_tables_cell->val) == 0)
>>   |   ~~~^
>>
>> I noticed it first on mamba, which is set up with -Werror, but a
>> scrape of the buildfarm logs shows many other animals reporting this
>> as a warning.
>
>
> I noticed the similar warning on cfbot:
> https://cirrus-ci.com/task/6298504306360320?logs=gcc_warning#L448
>
> reindexdb.c: In function ‘reindex_one_database’:
> reindexdb.c:437:24: error: ‘indices_tables_cell’ may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
>   437 |indices_tables_cell = indices_tables_cell->next;
>   |^~~
>
> Although it's complaining on line 437 not 434, I think they are the same
> issue.
>
>>
>> I think they are right.  Even granting that the
>> compiler realizes that "parallel && process_type == REINDEX_INDEX" is
>> enough to reach the one place where indices_tables_cell is
>> initialized, that's not really enough, because that place is
>>
>> if (indices_tables_list)
>> indices_tables_cell = indices_tables_list->head;
>>
>> So I believe this code will crash if get_parallel_object_list returns
>> an empty list.  Initializing indices_tables_cell to NULL in its
>> declaration would stop the compiler warning, but if I'm right it
>> will do nothing to prevent that crash.  This needs a bit more effort.
>
>
> Agreed.  And the comment of get_parallel_object_list() says that it may
> indeed return NULL.
>
> BTW, on line 373, it checks 'process_list' and bails out if this list is
> NULL.  But it seems to me that 'process_list' cannot be NULL in this
> case, because it's initialized to be 'user_list' and we have asserted
> that user_list is not NULL on line 360.  I wonder if we should check
> indices_tables_list instead of process_list on line 373.

Thank you.  I'm going to deal with this in the next few hours.

--
Regards,
Alexander Korotkov

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2024-03-23 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 5:20 PM Alexander Korotkov  wrote:
> On Tue, Nov 28, 2023 at 11:00 AM Pavel Borisov  wrote:
> > > You're designing new APIs, days before the feature freeze.
> > On Wed, 5 Apr 2023 at 06:54, Michael Paquier  wrote:
> > >
> > > On Tue, Apr 04, 2023 at 01:25:46AM +0300, Alexander Korotkov wrote:
> > > > Pavel, thank you for you review, revisions and rebase.
> > > > We'll reconsider this once v17 is branched.
> >
> > I've looked through patches v16 once more and think they're good
> > enough, and previous issues are all addressed. I see that there is
> > nothing that blocks it from being committed except the last iteration
> > was days before v16 feature freeze.
> >
> > Recently in another thread [1] Alexander posted a new version of
> > patches v16 (as 0001 and 0002) In 0001 only indenation, comments, and
> > commit messages changed from v16 in this thread. In 0002 new test
> > eval-plan-qual-2 was integrated into the existing eval-plan-qual test.
> > For maintaining the most recent versions in this thread I'm attaching
> > them under v17. I suppose that we can commit these patches to v17 if
> > there are no objections or additional reviews.
> >
> > [1] 
> > https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
>
> The new revision of patches is attached.
>
> It has updated commit messages, new comments, and some variables were
> renamed to be more consistent with surroundings.
>
> I also think that all the design issues spoken before are resolved.
> It would be nice to hear from Andres about this.
>
> I'll continue rechecking these patches myself.

I've re-read this thread.  It still seems to me that the issues raised
before are addressed now.  Fingers crossed, I'm going to push this if
there are no objections.

--
Regards,
Alexander Korotkov

Re: Add Index-level REINDEX with multiple jobs

2024-03-22 Thread Alexander Korotkov

On Wed, Mar 20, 2024 at 7:19 PM Alexander Korotkov  wrote:
> On Mon, Mar 11, 2024 at 3:44 PM Maxim Orlov  wrote:
> > On Tue, 6 Feb 2024 at 09:22, Michael Paquier  wrote:
> >> The problem may be actually trickier than that, no?  Could there be
> >> other factors to take into account for their classification, like
> >> their sizes (typically, we'd want to process the biggest one first, I
> >> guess)?
> >
> >
> > Sorry for a late reply.  Thanks for an explanation.  This is sounds 
> > reasonable to me.
> > Svetlana had addressed this in the patch v2.
>
> I think this patch is a nice improvement.  But it doesn't seem to be
> implemented in the right way.  There is no guarantee that
> get_parallel_object_list() will return tables in the same order as
> indexes.  Especially when there is "ORDER BY idx.relpages".  Also,
> sort_indices_by_tables() has quadratic complexity (probably OK since
> input list shouldn't be too lengthy) and a bit awkward.
>
> I've revised the patchset.  Now appropriate ordering is made in SQL
> query.  The original list of indexes is modified to match the list of
> tables.  The tables are ordered by the size of its greatest index,
> within table indexes are ordered by size.
>
> I'm going to further revise this patch, mostly comments and the commit 
> message.

Here goes the revised patch.  I'm going to push this if there are no objections.

--
Regards,
Alexander Korotkov


v4-0001-Add-the-index-level-REINDEX-with-multiple-jobs.patch
Description: Binary data

Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2024-03-22 Thread Alexander Korotkov

On Fri, Mar 22, 2024 at 11:58 AM Maxim Orlov  wrote:
> I've noticed this patch and had a quick look at it.  As far as I understand, 
> this bug
> does not lead to an incorrect matching, resulting only in degradation in 
> speed.
> Anyway, consider this patch useful, hope it will be committed soon.

Pushed.
Thanks to Maxim and Pavel.

--
Regards,
Alexander Korotkov

Re: Add Index-level REINDEX with multiple jobs

2024-03-20 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 3:44 PM Maxim Orlov  wrote:
> On Tue, 6 Feb 2024 at 09:22, Michael Paquier  wrote:
>> The problem may be actually trickier than that, no?  Could there be
>> other factors to take into account for their classification, like
>> their sizes (typically, we'd want to process the biggest one first, I
>> guess)?
>
>
> Sorry for a late reply.  Thanks for an explanation.  This is sounds 
> reasonable to me.
> Svetlana had addressed this in the patch v2.

I think this patch is a nice improvement.  But it doesn't seem to be
implemented in the right way.  There is no guarantee that
get_parallel_object_list() will return tables in the same order as
indexes.  Especially when there is "ORDER BY idx.relpages".  Also,
sort_indices_by_tables() has quadratic complexity (probably OK since
input list shouldn't be too lengthy) and a bit awkward.

I've revised the patchset.  Now appropriate ordering is made in SQL
query.  The original list of indexes is modified to match the list of
tables.  The tables are ordered by the size of its greatest index,
within table indexes are ordered by size.

I'm going to further revise this patch, mostly comments and the commit message.

--
Regards,
Alexander Korotkov

v3-0001-Add-Index-level-REINDEX-with-multiple-jobs.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-20 Thread Alexander Korotkov

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan 
wrote:
> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> > unboundedly, shouldn't we check if the specified LSN is more than
> > pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the whole
value of this feature.  The value is to wait till given LSN is replayed,
whether it's already received or not.

> > BEGIN AFTER '0/';
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn';
>
> This case will give ERROR cause '0/' + 1 is invalid pg_lsn

FWIW,

# SELECT '0/'::pg_lsn + 1;
 ?column?
--
 1/0
(1 row)

But I don't see a problem here. On the replica, it's out of our control to
check which lsn is good and which is not.  We can't check whether the lsn,
which is in future for the replica, is already issued by primary.

For the case of wrong lsn, which could cause potentially infinite wait,
there is the timeout and the manual query cancel.

> > 4.3 With an unreasonably high wait time, BEGIN command waits
> > unboundedly, shouldn't we restrict the wait time to some max value,
> > say a day or so?
> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> > BEGIN AFTER :'future_receive_lsn' WITHIN 10;
>
> Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable?  I would
prefer to switch to seconds (with ability to specify fraction of second).
This was expressed before by Alexander Lakhin.

> > 6.
> > +/* Set all latches in shared memory to signal that new LSN has been
> > replayed */
> > +void
> > +WaitLSNSetLatches(XLogRecPtr curLSN)
> > +{
> >
> > I see this patch is waking up all the waiters in the recovery path
> > after applying every WAL record, which IMO is a hot path. Is the
> > impact of this change on recovery measured, perhaps using
> > https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

> > 7. In continuation to comment #6, why not use Conditional Variables
> > instead of proc latches to sleep and wait for all the waiters in
> > WaitLSNSetLatches?
>
> Waiters are stored in the array sorted by LSN. This help us to wake
> only PIDs with replayed LSN. This saves us from scanning of whole
> array. So it`s not so hot path.

+1
This saves us from ConditionVariableBroadcast() every time we replay the
WAL record.

> Add some fixes
>
> 1) make waiting timeont more simple (as pg_terminate_backend())
> 2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
> long time, changed it to 0.5 seconds

I don't see this change in the patch.  Normally if a process gets a signal,
that causes WaitLatch() to exit immediately.  It also exists immediately on
query cancel.  IIRC, this 1 minute timeout is needed to handle some extreme
cases when an interrupt is missing.  Other places have it equal to 1
minute. I don't see why we should have it different.

> 3) add more tests
> 4) added and expanded sections in the documentation

I don't see this in the patch.  I see only a short description
in func.sgml, which is definitely not sufficient.  We need at least
everything we have in the docs before to be adjusted with the current
approach of procedure.

> 5) add default variant of timeout
> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout?  I think this should be documented.  Also,
I would prefer to see the timeout by default.  Probably one minute would be
good for default.

> 6) now big timeout will be restricted to 1 day (8640ms)
> CALL pg_wait_lsn('0/34FB5A1',100);
> WARNING:  Timeout for pg_wait_lsn() restricted to 1 day

I don't think we need to mention individuals, who made proposals, in the
source code comments.  Otherwise, our source code would be a crazy mess of
names.  Also, if this is the restriction, it has to be an error.  And it
should be a proper full ereport().

--
Regards,
Alexander Korotkov

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2024-03-19 Thread Alexander Korotkov

Hi, Pavel!

On Tue, Nov 28, 2023 at 11:00 AM Pavel Borisov  wrote:
> > You're designing new APIs, days before the feature freeze.
> On Wed, 5 Apr 2023 at 06:54, Michael Paquier  wrote:
> >
> > On Tue, Apr 04, 2023 at 01:25:46AM +0300, Alexander Korotkov wrote:
> > > Pavel, thank you for you review, revisions and rebase.
> > > We'll reconsider this once v17 is branched.
>
> I've looked through patches v16 once more and think they're good
> enough, and previous issues are all addressed. I see that there is
> nothing that blocks it from being committed except the last iteration
> was days before v16 feature freeze.
>
> Recently in another thread [1] Alexander posted a new version of
> patches v16 (as 0001 and 0002) In 0001 only indenation, comments, and
> commit messages changed from v16 in this thread. In 0002 new test
> eval-plan-qual-2 was integrated into the existing eval-plan-qual test.
> For maintaining the most recent versions in this thread I'm attaching
> them under v17. I suppose that we can commit these patches to v17 if
> there are no objections or additional reviews.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

The new revision of patches is attached.

It has updated commit messages, new comments, and some variables were
renamed to be more consistent with surroundings.

I also think that all the design issues spoken before are resolved.
It would be nice to hear from Andres about this.

I'll continue rechecking these patches myself.

--
Regards,
Alexander Korotkov


v18-0001-Allow-locking-updated-tuples-in-tuple_update-and.patch
Description: Binary data


v18-0002-Add-EvalPlanQual-delete-returning-isolation-test.patch
Description: Binary data

Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 4:48 PM Sushrut Shivaswamy
 wrote:
>
> If we query the DB directly, is it possible to know which new rows have been 
> added since the last query?
> Is there a change pump that can be latched onto?

Please, check this.
https://www.postgresql.org/docs/current/logicaldecoding.html

> I’m assuming the page data structs are encapsulated in specific headers which 
> can be used to list / read pages.
> Why would Postgres need to be stopped to read the data? The read / query path 
> in Postgres would also be reading these pages when the instance is running?

I think this would be a good point to start studying.
https://www.interdb.jp/
The information there should be more than enough to forget this idea forever :)

--
Regards,
Alexander Korotkov

Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 4:35 PM Sushrut Shivaswamy
 wrote:
> The binary I"m trying to create should automatically be able to read data 
> from a postgres instance without users having to
> run commands for backup / pg_dump etc.
> Having access to the appropriate source headers would allow me to read the 
> data.

Please, avoid the top-posting.
https://en.wikipedia.org/wiki/Posting_style#Top-posting

If you're looking to have a separate binary, why can't your binary
just *connect* to the postgres database and query the data?  This is
what pg_dump does, you can just do the same directly.  pg_dump doesn't
access the raw data.

Trying to read raw postgres data from the separate binary looks flat
wrong for your purposes.  First, you would have to replicate pretty
much postgres internals inside. Second, you can read the consistent
data only when postgres is stopped or didn't do any modifications
since the last checkpoint.

--
Regards,
Alexander Korotkov

Re: Read data from Postgres table pages

2024-03-19 Thread Alexander Korotkov

Hi

On Tue, Mar 19, 2024 at 4:23 PM Sushrut Shivaswamy
 wrote:
> I'm trying to build a postgres export tool that reads data from table pages 
> and exports it to an S3 bucket. I'd like to avoid manual commands like 
> pg_dump, I need access to the raw data.
>
> Can you please point me to the postgres source header / cc files that 
> encapsulate this functionality?
>  - List all pages for a table
> - Read a given page for a table
>
> Any pointers to the relevant source code would be appreciated.

Why do you need to work on the source code level?
Please, check this about having a binary  copy of the database on the
filesystem level.
https://www.postgresql.org/docs/current/backup-file.html

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-19 Thread Alexander Korotkov

On Tue, Mar 19, 2024 at 1:51 PM Amit Kapila  wrote:
> On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov  
> wrote:
> >
> > On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila  wrote:
> > > > 1. First, check that it was called with non-atomic context (that is,
> > > > it's not called within a transaction). Trigger error if called with
> > > > atomic context.
> > > > 2. Release a snapshot to be able to wait without risk of WAL replay
> > > > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > > > a hack to release a snapshot, but Vacuum statements already do so.
> > > >
> > >
> > > Can you please provide a bit more details with some example what is
> > > the existing problem with functions and how using procedures will
> > > resolve it? How will this this address the implicit transaction case
> > > or do we have any other workaround for those cases?
> >
> > Please check [1] and [2] for the explanation of the problem with functions.
> >
> > Also, please find a draft patch implementing the procedure.  The issue with 
> > the snapshot is addressed with the following lines.
> >
> > We first ensure we're in a non-atomic context, then pop an active snapshot 
> > (tricky, but ExecuteVacuum() does the same).  Then we should have no active 
> > snapshot and it's safe to wait for lsn replay.
> >
> > if (context->atomic)
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("pg_wait_lsn() must be only called in non-atomic 
> > context")));
> >
> > if (ActiveSnapshotSet())
> > PopActiveSnapshot();
> > Assert(!ActiveSnapshotSet());
> >
> > The function call could be added either before the BEGIN statement or 
> > before the implicit transaction.
> >
> > CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
> > CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;
> >
>
> I haven't thought in detail about whether there are any other problems
> with this idea but sounds like it should solve the problems you shared
> with a function call approach. BTW, if the application has to anyway
> know the LSN till where replica needs to wait, why can't they simply
> monitor the pg_last_wal_replay_lsn() value?

Amit, thank you for your feedback.

Yes, the application can monitor pg_last_wal_replay_lsn() value,
that's our state of the art solution.  But that's rather inconvenient
and takes extra latency and network traffic. And it can't be wrapped
into a server-side function in procedural language for the reasons we
can't implement it as a built-in function.

--
Regards,
Alexander Korotkov

Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-18 Thread Alexander Korotkov

On Mon, Mar 18, 2024 at 2:01 AM jian he  wrote:
>
> looking at it again.
> I found out we can just simply do
> `
> Datum
> pg_basetype(PG_FUNCTION_ARGS)
> {
> Oid oid;
>
> oid =  get_fn_expr_argtype(fcinfo->flinfo, 0);
> PG_RETURN_OID(getBaseType(oid));
> }
> `
>
> if the type is not a domain, work the same as  pg_typeof.
> if the type is domain,  pg_typeof return as is, pg_basetype return the
> base type.
> so it only diverges when the argument type is a type of domain.
>
> the doc:
> pg_basetype ( "any" )
> regtype
>
>
>Returns the OID of the base type of a domain. If the argument
> is not a type of domain,
>return the OID of the data type of the argument just like  linkend="function-pg-typeof">pg_typeof().
>If there's a chain of domain dependencies, it will recurse
> until finding the base type.
>
>
>
> also, I think this way, we only do one syscache lookup.

Looks good to me.  But should it be named pg_basetypeof()?

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-18 Thread Alexander Korotkov

On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila  wrote:
> On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov 
wrote:
> > On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
> >  wrote:
> > > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila 
wrote:
> > > > In general, it seems this patch has been stuck for a long time on
the
> > > > decision to choose an appropriate UI (syntax), and we thought of
> > > > moving it further so that the other parts of the patch can be
> > > > reviewed/discussed. So, I feel before pushing this we should see
> > > > comments from a few (at least two) other senior members who earlier
> > > > shared their opinion on the syntax. I know we don't have much time
> > > > left but OTOH pushing such a change (where we didn't have a
consensus
> > > > on syntax) without much discussion at this point of time could lead
to
> > > > discussions after commit.
> > >
> > > +1 to gain consensus first on the syntax changes. With this, we might
> > > be violating the SQL standard for explicit txn commands (I stand for
> > > correction about the SQL standard though).
> >
> > Thank you for your feedback.  Generally, I agree it's correct to get
> > consensus on syntax first.  And yes, this patch has been here since
> > 2016.  We didn't get consensus for syntax for 8 years.  Frankly
> > speaking, I don't see a reason why this shouldn't take another 8
> > years.  At the same time the ability to wait on standby given LSN is
> > replayed seems like pretty basic and simple functionality.  Thus, it's
> > quite frustrating it already took that long and still unclear when/how
> > this could be finished.
> >
> > My current attempt was to commit minimal implementation as less
> > invasive as possible.  A new clause for BEGIN doesn't require
> > additional keywords and doesn't introduce additional statements.  But
> > yes, this is still a new qual.  And, yes, Amit you're right that even
> > if I had committed that, there was still a high risk of further
> > debates and revert.
> >
> > Given my specsis about agreement over syntax, I'd like to check
> > another time if we could go without new syntax at all.  There was an
> > attempt to implement waiting for lsn as a function.  But function
> > holds a snapshot, which could prevent WAL records from being replayed.
> > Releasing a snapshot could break the parent query.  But now we have
> > procedures, which need a dedicated statement for the call and can even
> > control transactions.  Could we implement a waitlsn in a procedure
> > that:
> >
> > 1. First, check that it was called with non-atomic context (that is,
> > it's not called within a transaction). Trigger error if called with
> > atomic context.
> > 2. Release a snapshot to be able to wait without risk of WAL replay
> > stuck.  Procedure is still called within the snapshot.  It's a bit of
> > a hack to release a snapshot, but Vacuum statements already do so.
> >
>
> Can you please provide a bit more details with some example what is
> the existing problem with functions and how using procedures will
> resolve it? How will this this address the implicit transaction case
> or do we have any other workaround for those cases?

Please check [1] and [2] for the explanation of the problem with functions.

Also, please find a draft patch implementing the procedure.  The issue with
the snapshot is addressed with the following lines.

We first ensure we're in a non-atomic context, then pop an active snapshot
(tricky, but ExecuteVacuum() does the same).  Then we should have no active
snapshot and it's safe to wait for lsn replay.

if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("pg_wait_lsn() must be only called in non-atomic
context")));

if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());

The function call could be added either before the BEGIN statement or
before the implicit transaction.

CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;

Links
1.
https://www.postgresql.org/message-id/CAPpHfduBSN8j5j5Ynn5x%3DThD%3D8ypNd53D608VXGweBsPzxPvqA%40mail.gmail.com
2.
https://www.postgresql.org/message-id/CAPpHfdtiGgn0iS1KbW2HTam-1%2BoK%2BvhXZDAcnX9hKaA7Oe%3DF-A%40mail.gmail.com

--
Regards,
Alexander Korotkov


v11-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-17 Thread Alexander Korotkov

Hi Amit,
Hi Bharath,

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
 wrote:
> On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila  wrote:
> > In general, it seems this patch has been stuck for a long time on the
> > decision to choose an appropriate UI (syntax), and we thought of
> > moving it further so that the other parts of the patch can be
> > reviewed/discussed. So, I feel before pushing this we should see
> > comments from a few (at least two) other senior members who earlier
> > shared their opinion on the syntax. I know we don't have much time
> > left but OTOH pushing such a change (where we didn't have a consensus
> > on syntax) without much discussion at this point of time could lead to
> > discussions after commit.
>
> +1 to gain consensus first on the syntax changes. With this, we might
> be violating the SQL standard for explicit txn commands (I stand for
> correction about the SQL standard though).

Thank you for your feedback.  Generally, I agree it's correct to get
consensus on syntax first.  And yes, this patch has been here since
2016.  We didn't get consensus for syntax for 8 years.  Frankly
speaking, I don't see a reason why this shouldn't take another 8
years.  At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality.  Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible.  A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements.  But
yes, this is still a new qual.  And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all.  There was an
attempt to implement waiting for lsn as a function.  But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query.  But now we have
procedures, which need a dedicated statement for the call and can even
control transactions.  Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck.  Procedure is still called within the snapshot.  It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Amit, Bharath, what do you think about this approach?  Is this a way to go?

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Fri, Mar 15, 2024 at 10:32 PM Kartyshov Ivan
 wrote:
>
> On 2024-03-15 22:59, Kartyshov Ivan wrote:
> > On 2024-03-11 13:44, Alexander Korotkov wrote:
> >> I picked the second option and left only the AFTER clause for the
> >> BEGIN statement.  I think this should be enough for the beginning.
> >
> > Thank you for your rework on your patch, here I made some fixes:
> > 0) autocomplete
> > 1) less jumps
> > 2) more description and add cases in doc

Thank you!

> > I think, it will be useful to have stand-alone statement.
> > Why you would like to see only AFTER clause for the BEGIN statement?

Yes, stand-alone statements might be also useful.  But I think that
the best way for this feature to get into the core would be to commit
the minimal version first.  The BEGIN clause has minimal invasiveness
for the syntax and I believe covers most typical use-cases.  Once we
figure out it's OK and have positive feedback from users, we can do
more enchantments incrementally.

> Rebase and update patch.

Cool, I was just about to ask you to do this.

 --
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Fri, Mar 15, 2024 at 4:20 PM Alexander Korotkov 
wrote:

> On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
>  wrote:
> > I've decided to put my hands on this patch.
> >
> > On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila 
> wrote:
> > > +1 for the second one not only because it avoids new words in grammar
> > > but also sounds to convey the meaning. I think you can explain in docs
> > > how this feature can be used basically how will one get the correct
> > > LSN value to specify.
> >
> > I picked the second option and left only the AFTER clause for the
> > BEGIN statement.  I think this should be enough for the beginning.
> >
> > > As suggested previously also pick one of the approaches (I would
> > > advocate the second one) and keep an option for the second one by
> > > mentioning it in the commit message. I hope to see more
> > > reviews/discussions or usage like how will users get the LSN value to
> > > be specified on the core logic of the feature at this stage. IF
> > > possible, state, how real-world applications could leverage this
> > > feature.
> >
> > I've added a paragraph to the docs about the usage.  After you made
> > some changes on primary, you run pg_current_wal_insert_lsn().  Then
> > connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> > LSN.  Now you're guaranteed to see the changes made to the primary.
> >
> > Also, I've significantly reworked other aspects of the patch.  The
> > most significant changes are:
> > 1) Waiters are now stored in the array sorted by LSN.  This saves us
> > from scanning of wholeper-backend array.
> > 2) Waiters are removed from the array immediately once their LSNs are
> > replayed.  Otherwise, the WAL replayer will keep scanning the shared
> > memory array till waiters wake up.
> > 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> > shmem exit.  I think this is preferable over the previous approach to
> > remove from the queue before ProcessInterrupts().
> > 4) There is now condition to recheck if LSN is replayed after adding
> > to the shared memory array.  This should save from the race
> > conditions.
> > 5) I've renamed too generic names for functions and files.
>
> I went through this patch another time, and made some minor
> adjustments.  Now it looks good, I'm going to push it if no
> objections.
>

The revised patch version with cosmetic fixes proposed by Alexander Lakhin.

--
Regards,
Alexander Korotkov


v10-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
 wrote:
> I've decided to put my hands on this patch.
>
> On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila  wrote:
> > +1 for the second one not only because it avoids new words in grammar
> > but also sounds to convey the meaning. I think you can explain in docs
> > how this feature can be used basically how will one get the correct
> > LSN value to specify.
>
> I picked the second option and left only the AFTER clause for the
> BEGIN statement.  I think this should be enough for the beginning.
>
> > As suggested previously also pick one of the approaches (I would
> > advocate the second one) and keep an option for the second one by
> > mentioning it in the commit message. I hope to see more
> > reviews/discussions or usage like how will users get the LSN value to
> > be specified on the core logic of the feature at this stage. IF
> > possible, state, how real-world applications could leverage this
> > feature.
>
> I've added a paragraph to the docs about the usage.  After you made
> some changes on primary, you run pg_current_wal_insert_lsn().  Then
> connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> LSN.  Now you're guaranteed to see the changes made to the primary.
>
> Also, I've significantly reworked other aspects of the patch.  The
> most significant changes are:
> 1) Waiters are now stored in the array sorted by LSN.  This saves us
> from scanning of wholeper-backend array.
> 2) Waiters are removed from the array immediately once their LSNs are
> replayed.  Otherwise, the WAL replayer will keep scanning the shared
> memory array till waiters wake up.
> 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> shmem exit.  I think this is preferable over the previous approach to
> remove from the queue before ProcessInterrupts().
> 4) There is now condition to recheck if LSN is replayed after adding
> to the shared memory array.  This should save from the race
> conditions.
> 5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments.  Now it looks good, I'm going to push it if no
objections.

--
Regards,
Alexander Korotkov


v9-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: psql: fix variable existence tab completion

2024-03-14 Thread Alexander Korotkov

On Sun, Mar 3, 2024 at 5:37 PM Erik Wienhold  wrote:
> On 2024-03-03 03:00 +0100, Steve Chavez wrote:
> > psql has the :{?name} syntax for testing a psql variable existence.
> >
> > But currently doing \echo :{?VERB doesn't trigger tab completion.
> >
> > This patch fixes it. I've also included a TAP test.
>
> Thanks.  The code looks good, all tests pass, and the tab completion
> works as expected when testing manually.

A nice improvement.  I've checked why we have at all the '{' at
WORD_BREAKS and if we're going to break anything by removing that.  It
seems that '{' here from the very beginning and it comes from the
default value of rl_basic_word_break_characters [1].  It seems that
:{?name} is the only usage of '{' sign in psql.  So, removing it from
WORD_BREAKS shouldn't break anything.

I'm going to push this patch if no objections.

Links.
1. 
https://tiswww.case.edu/php/chet/readline/readline.html#index-rl_005fbasic_005fword_005fbreak_005fcharacters

--
Regards,
Alexander Korotkov

Re: POC, WIP: OR-clause support for indexes

2024-03-14 Thread Alexander Korotkov

On Thu, Mar 14, 2024 at 12:11 PM Andrei Lepikhov
 wrote:
>
> On 14/3/2024 16:31, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> >  > On 13/3/2024 18:05, Alexander Korotkov wrote:
> >  > > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
> >  > > Given all of the above, I think moving transformation to the
> >  > > canonicalize_qual() would be the right way to go.
> >  > Ok, I will try to move the code.
> >  > I have no idea about the timings so far. I recall the last time I got
> >  > bogged down in tons of duplicated code. I hope with an almost-ready
> >  > sketch, it will be easier.
> >
> > Thank you!  I'll be looking forward to the updated patch.
> Okay, I moved the 0001-* patch to the prepqual.c module. See it in the
> attachment. I treat it as a transient patch.
> It has positive outcomes as well as negative ones.
> The most damaging result you can see in the partition_prune test:
> partition pruning, in some cases, moved to the executor initialization
> stage. I guess, we should avoid it somehow in the next version.

Thank you, Andrei.  Looks like a very undesirable side effect.  Do you
have any idea why it happens?  Partition pruning should work correctly
for both transformed and non-transformed quals, why does
transformation hurt it?

--
Regards,
Alexander Korotkov

Re: POC, WIP: OR-clause support for indexes

2024-03-14 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov 
wrote:
> On 13/3/2024 18:05, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
> > Given all of the above, I think moving transformation to the
> > canonicalize_qual() would be the right way to go.
> Ok, I will try to move the code.
> I have no idea about the timings so far. I recall the last time I got
> bogged down in tons of duplicated code. I hope with an almost-ready
> sketch, it will be easier.

Thank you!  I'll be looking forward to the updated patch.

I also have notes about the bitmap patch.

/*
 * Building index paths over SAOP clause differs from the logic of OR
clauses.
 * Here we iterate across all the array elements and split them to SAOPs,
 * corresponding to different indexes. We must match each element to an
index.
 */

This covers the case I posted before.  But in order to fix all possible
cases we probably need to handle the SAOP clause in the same way as OR
clauses.  Check also this case.

Setup
create table t (a int not null, b int not null, c int not null);
insert into t (select 1, 1, i from generate_series(1,1) i);
insert into t (select i, 2, 2 from generate_series(1,1) i);
create index t_a_b_idx on t (a, b);
create statistics t_a_b_stat (mcv) on a, b from t;
create statistics t_b_c_stat (mcv) on b, c from t;
vacuum analyze t;

Plan with enable_or_transformation = on:
# explain select * from t where a = 1 and (b = 1 or b = 2) and c = 2;
  QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=156.55..440.56 rows=5001 width=12)
   Recheck Cond: (a = 1)
   Filter: ((b = ANY ('{1,2}'::integer[])) AND (c = 2))
   ->  Bitmap Index Scan on t_a_b_idx  (cost=0.00..155.29 rows=10001
width=0)
 Index Cond: (a = 1)
(5 rows)

Plan with enable_or_transformation = off:
# explain select * from t where a = 1 and (b = 1 or b = 2) and c = 2;
  QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=11.10..18.32 rows=5001 width=12)
   Recheck Cond: (((b = 1) AND (c = 2)) OR ((a = 1) AND (b = 2)))
   Filter: ((a = 1) AND (c = 2))
   ->  BitmapOr  (cost=11.10..11.10 rows=2 width=0)
 ->  Bitmap Index Scan on t_b_c_idx  (cost=0.00..4.30 rows=1
width=0)
   Index Cond: ((b = 1) AND (c = 2))
 ->  Bitmap Index Scan on t_a_b_idx  (cost=0.00..4.30 rows=1
width=0)
   Index Cond: ((a = 1) AND (b = 2))
(8 rows)

As you can see this case is not related to partial indexes.  Just no index
selective for the whole query.  However, splitting scan by the OR qual lets
use a combination of two selective indexes.

--
Regards,
Alexander Korotkov

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-03-13 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 11:48 AM Alexander Korotkov
 wrote:
>
> On Mon, Mar 11, 2024 at 5:43 AM Anton A. Melnikov
>  wrote:
> > On 11.03.2024 03:39, Alexander Korotkov wrote:
> > > Now that we distinguish stats for checkpoints and
> > > restartpoints, we need to update the docs.  Please, check the patch
> > > attached.
> >
> > Maybe bring the pg_stat_get_checkpointer_buffers_written() description in 
> > consistent with these changes?
> > Like that:
> >
> > --- a/src/include/catalog/pg_proc.dat
> > +++ b/src/include/catalog/pg_proc.dat
> > @@ -5740 +5740 @@
> > -  descr => 'statistics: number of buffers written by the checkpointer',
> > +  descr => 'statistics: number of buffers written during checkpoints and 
> > restartpoints',
>
> This makes sense.  I've included this into the revised patch.

Pushed.

--
Regards,
Alexander Korotkov

Re: POC, WIP: OR-clause support for indexes

2024-03-13 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
 wrote:
> On 12/3/2024 22:20, Alexander Korotkov wrote:
> > On Mon, Mar 11, 2024 at 2:43 PM Andrei Lepikhov
> >> I think you are right. It is probably a better place than any other to
> >> remove duplicates in an array. I just think we should sort and remove
> >> duplicates from entry->consts in one pass. Thus, this optimisation
> >> should be applied to sortable constants.
> >
> > Ok.
> New version of the patch set implemented all we have agreed on for now.
> We can return MAX_SAOP_ARRAY_SIZE constraint and Alena's approach to
> duplicates deletion for non-sortable cases at the end.
> >
> >> Hmm, we already tried to do it at that point. I vaguely recall some
> >> issues caused by this approach. Anyway, it should be done as quickly as
> >> possible to increase the effect of the optimization.
> >
> > I think there were provided quite strong reasons why this shouldn't be
> > implemented at the parse analysis stage [1], [2], [3].  The
> > canonicalize_qual() looks quite appropriate place for that since it
> > does similar transformations.
> Ok. Let's discuss these reasons. In Robert's opinion [1,3], we should do
> the transformation based on the cost model. But in the canonicalize_qual
> routine, we still make the transformation blindly. Moreover, the second
> patch reduces the weight of this reason, doesn't it? Maybe we shouldn't
> think about that as about optimisation but some 'general form of
> expression'?
> Peter [2] worries about the possible transformation outcomes at this
> stage. But remember, we already transform clauses like ROW() IN (...) to
> a series of ORs here, so it is not an issue. Am I wrong?
> Why did we discard the attempt with canonicalize_qual on the previous
> iteration? - The stage of parsing is much more native for building SAOP
> quals. We can reuse make_scalar_array_op and other stuff, for example.
> During the optimisation stage, the only list partitioning machinery
> creates SAOP based on a list of constants. So, in theory, it is possible
> to implement. But do we really need to make the code more complex?

As we currently do OR-to-ANY transformation at the parse stage, the
system catalog (including views, inheritance clauses, partial and
expression indexes, and others) would have a form depending on
enable_or_transformation at the moment of DDL execution.  I think this
is rather wrong.  The enable_or_transformation should be run-time
optimization which affects the resulting query plan, its result
shouldn't be persistent.

Regarding the ROW() IN (...) precedent.

1. AFAICS, this is not exactly an optimization.  This transformation
allows us to perform type matching individually for every value.
Therefore it allows the execute some queries which otherwise would end
up with error.
2. I don't think this is a sample of good design.  This is rather
hack, which is historically here, but we don't want to replicate this
experience.

Given all of the above, I think moving transformation to the
canonicalize_qual() would be the right way to go.

--
Regards,
Alexander Korotkov

Re: Transaction timeout

2024-03-13 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 7:56 AM Andrey M. Borodin  wrote:
> > On 13 Mar 2024, at 05:23, Alexander Korotkov  wrote:
> >
> > On Tue, Mar 12, 2024 at 10:28 AM Andrey M. Borodin  
> > wrote:
> >>> On 11 Mar 2024, at 16:18, Alexander Korotkov  wrote:
> >>>
> >>> I think if checking psql stderr is problematic, checking just logs is
> >>> fine.  Could we wait for the relevant log messages one by one with
> >>> $node->wait_for_log() just like 040_standby_failover_slots_sync.pl do?
> >>
> >> PFA version with $node->wait_for_log()
> >
> > I've slightly revised the patch.  I'm going to push it if no objections.
>
> One small note: log_offset was updated, but was not used.

Thank you. This is the updated version.

--
Regards,
Alexander Korotkov


v7-0001-Add-TAP-tests-for-timeouts.patch
Description: Binary data

Re: Transaction timeout

2024-03-12 Thread Alexander Korotkov

On Tue, Mar 12, 2024 at 10:28 AM Andrey M. Borodin  wrote:
> > On 11 Mar 2024, at 16:18, Alexander Korotkov  wrote:
> >
> > I think if checking psql stderr is problematic, checking just logs is
> > fine.  Could we wait for the relevant log messages one by one with
> > $node->wait_for_log() just like 040_standby_failover_slots_sync.pl do?
>
> PFA version with $node->wait_for_log()

I've slightly revised the patch.  I'm going to push it if no objections.

--
Regards,
Alexander Korotkov


v5-0001-Add-TAP-tests-for-timeouts.patch
Description: Binary data

Re: POC, WIP: OR-clause support for indexes

2024-03-12 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 2:43 PM Andrei Lepikhov
 wrote:
> On 11/3/2024 18:31, Alexander Korotkov wrote:
> >> I'm not convinced about this limit. The initial reason was to combine
> >> long lists of ORs into the array because such a transformation made at
> >> an early stage increases efficiency.
> >> I understand the necessity of this limit in the array decomposition
> >> routine but not in the creation one.
> >
> > The comment near MAX_SAOP_ARRAY_SIZE says that this limit is because
> > N^2 algorithms could be applied to arrays.  Are you sure that's not
> > true for our case?
> When you operate an array, indeed. But when we transform ORs to an
> array, not. Just check all the places in the optimizer and even the
> executor where we would pass along the list of ORs. This is why I think
> we should use this optimization even more intensively for huge numbers
> of ORs in an attempt to speed up the overall query.

Ok.

> >>> 3) Better save the original order of clauses by putting hash entries and
> >>> untransformable clauses to the same list.  A lot of differences in
> >>> regression tests output have gone.
> >> I agree that reducing the number of changes in regression tests looks
> >> better. But to achieve this, you introduced a hack that increases the
> >> complexity of the code. Is it worth it? Maybe it would be better to make
> >> one-time changes in tests instead of getting this burden on board. Or
> >> have you meant something more introducing the node type?
> >
> > For me the reason is not just a regression test.  The current code
> > keeps the original order of quals as much as possible.  The OR
> > transformation code reorders quals even in cases when it doesn't
> > eventually apply any optimization.  I don't think that's acceptable.
> > However, less hackery ways for this is welcome for sure.
> Why is it unacceptable? Can the user implement some order-dependent
> logic with clauses, and will it be correct?
> Otherwise, it is a matter of taste, and generally, this decision is up
> to you.

I think this is an important property that the user sees the quals in
the plan in the same order as they were in the query.  And if some
transformations are applied, then the order is saved as much as
possible.  I don't think we should sacrifice this property without
strong reasons.  A bit of code complexity is definitely not that
reason for me.

> >>> We don't make array values unique.  That might make query execution
> >>> performance somewhat worse, and also makes selectivity estimation
> >>> worse.  I suggest Andrei and/or Alena should implement making array
> >>> values unique.
> >> The fix Alena has made looks correct. But I urge you to think twice:
> >> The optimizer doesn't care about duplicates, so why do we do it?
> >> What's more, this optimization is intended to speed up queries with long
> >> OR lists. Using the list_append_unique() comparator on such lists could
> >> impact performance. I suggest sticking to the common rule and leaving
> >> the responsibility on the user's shoulders.
> >
> > I don't see why the optimizer doesn't care about duplicates for OR
> > lists.  As I showed before in an example, it successfully removes the
> > duplicate.  So, currently OR transformation clearly introduces a
> > regression in terms of selectivity estimation.  I think we should
> > evade that.
> I think you are right. It is probably a better place than any other to
> remove duplicates in an array. I just think we should sort and remove
> duplicates from entry->consts in one pass. Thus, this optimisation
> should be applied to sortable constants.

Ok.

> >> At least, we should do this optimization later, in one pass, with
> >> sorting elements before building the array. But what if we don't have a
> >> sort operator for the type?
> >
> > It was probably discussed before, but can we do our work later?  There
> > is a canonicalize_qual() which calls find_duplicate_ors().  This is
> > the place where currently duplicate OR clauses are removed.  Could our
> > OR-to-ANY transformation be just another call from
> > canonicalize_qual()?
> Hmm, we already tried to do it at that point. I vaguely recall some
> issues caused by this approach. Anyway, it should be done as quickly as
> possible to increase the effect of the optimization.

I think there were provided quite strong reasons why this shouldn't be
implemented at the parse analysis stage [1], [2], [3].  The
canonicalize_qual() looks quite appropriate place for that since it
does similar transformations.

Links.
1. 
https://www.postgresql.org/message-id/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/CAH2-Wzm2%3Dnf_JhiM3A2yetxRs8Nd2NuN3JqH%3Dfm_YWYd1oYoPg%40mail.gmail.com
3. 
https://www.postgresql.org/message-id/CA%2BTgmoaOiwMXBBTYknczepoZzKTp-Zgk5ss1%2BCuVQE-eFTqBmA%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: collect_corrupt_items_vacuum.patch

2024-03-12 Thread Alexander Korotkov

On Sun, Jan 14, 2024 at 4:35 AM Alexander Korotkov  wrote:
> On Sat, Jan 13, 2024 at 7:33 PM Dmitry Koval  wrote:
> > Thank you, there is one small point left (in the comment): can you
> > replace "guarantteed to be to be newer" to "guaranteed to be newer",
> > file src/backend/storage/ipc/procarray.c?
>
> Fixed.  Thank you for catching this.

I made the following improvements to the patch.
1. I find a way to implement the path with less changes to the core
code.  The GetRunningTransactionData() function allows to get the
least running xid, all I need is to add database-aware values.
2. I added the TAP test reproducing the original issue.

I'm going to push this if no objections.

--
Regards,
Alexander Korotkov

v6-0001-Fix-false-reports-in-pg_visibility.patch
Description: Binary data

Re: POC, WIP: OR-clause support for indexes

2024-03-11 Thread Alexander Korotkov

Hi Andrei,

Thank you for your response.

On Mon, Mar 11, 2024 at 7:13 AM Andrei Lepikhov
 wrote:
> On 7/3/2024 21:51, Alexander Korotkov wrote:
> > Hi!
> >
> > On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov
> > mailto:a.lepik...@postgrespro.ru>> wrote:
> >  > On 5/3/2024 12:30, Andrei Lepikhov wrote:
> >  > > On 4/3/2024 09:26, jian he wrote:
> >  > ... and the new version of the patchset is attached.
> >
> > I made some revisions for the patchset.
> Great!
> > 1) Use hash_combine() to combine hash values.
> Looks better
> > 2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE.
>
> I'm not convinced about this limit. The initial reason was to combine
> long lists of ORs into the array because such a transformation made at
> an early stage increases efficiency.
> I understand the necessity of this limit in the array decomposition
> routine but not in the creation one.

The comment near MAX_SAOP_ARRAY_SIZE says that this limit is because
N^2 algorithms could be applied to arrays.  Are you sure that's not
true for our case?

> > 3) Better save the original order of clauses by putting hash entries and
> > untransformable clauses to the same list.  A lot of differences in
> > regression tests output have gone.
> I agree that reducing the number of changes in regression tests looks
> better. But to achieve this, you introduced a hack that increases the
> complexity of the code. Is it worth it? Maybe it would be better to make
> one-time changes in tests instead of getting this burden on board. Or
> have you meant something more introducing the node type?

For me the reason is not just a regression test.  The current code
keeps the original order of quals as much as possible.  The OR
transformation code reorders quals even in cases when it doesn't
eventually apply any optimization.  I don't think that's acceptable.
However, less hackery ways for this is welcome for sure.

> > We don't make array values unique.  That might make query execution
> > performance somewhat worse, and also makes selectivity estimation
> > worse.  I suggest Andrei and/or Alena should implement making array
> > values unique.
> The fix Alena has made looks correct. But I urge you to think twice:
> The optimizer doesn't care about duplicates, so why do we do it?
> What's more, this optimization is intended to speed up queries with long
> OR lists. Using the list_append_unique() comparator on such lists could
> impact performance. I suggest sticking to the common rule and leaving
> the responsibility on the user's shoulders.

I don't see why the optimizer doesn't care about duplicates for OR
lists.  As I showed before in an example, it successfully removes the
duplicate.  So, currently OR transformation clearly introduces a
regression in terms of selectivity estimation.  I think we should
evade that.

> At least, we should do this optimization later, in one pass, with
> sorting elements before building the array. But what if we don't have a
> sort operator for the type?

It was probably discussed before, but can we do our work later?  There
is a canonicalize_qual() which calls find_duplicate_ors().  This is
the place where currently duplicate OR clauses are removed.  Could our
OR-to-ANY transformation be just another call from
canonicalize_qual()?

--
Regards,
Alexander Korotkov

Re: Transaction timeout

2024-03-11 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 12:53 PM Andrey M. Borodin  wrote:
> > On 7 Mar 2024, at 00:55, Alexander Korotkov  wrote:
> >
> > On Wed, Mar 6, 2024 at 10:22 AM Andrey M. Borodin  
> > wrote:
> >>> On 25 Feb 2024, at 21:50, Alexander Korotkov  wrote:
> >>>
> >>> Thank you for the patches.  I've pushed the 0001 patch to avoid
> >>> further failures on buildfarm.  Let 0004 wait till injections points
> >>> by Mechael are committed.
> >>
> >> Thanks!
> >>
> >> All prerequisites are committed. I propose something in a line with this 
> >> patch.
> >
> > Thank you.  I took a look at the patch.  Should we also check the
> > relevant message after the timeout is fired?  We could check it in
> > psql stderr or log for that.
>
> PFA version which checks log output.
> But I could not come up with a proper use of BackgroundPsql->query_until() to 
> check outputs. And there are multiple possible errors.
>
> We can copy test from src/bin/psql/t/001_basic.pl:
>
> # test behavior and output on server crash
> my ($ret, $out, $err) = $node->psql('postgres',
> "SELECT 'before' AS running;\n"
> . "SELECT pg_terminate_backend(pg_backend_pid());\n"
> . "SELECT 'AFTER' AS not_running;\n");
>
> is($ret, 2, 'server crash: psql exit code');
> like($out, qr/before/, 'server crash: output before crash');
> ok($out !~ qr/AFTER/, 'server crash: no output after crash');
> is( $err,
> 'psql::2: FATAL: terminating connection due to administrator command
> psql::2: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> psql::2: error: connection to server was lost',
> 'server crash: error message’);
>
> But I do not see much value in this.
> What do you think?

I think if checking psql stderr is problematic, checking just logs is
fine.  Could we wait for the relevant log messages one by one with
$node->wait_for_log() just like 040_standby_failover_slots_sync.pl do?

--
Regards,
Alexander Korotkov

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-11 Thread Alexander Korotkov

Hi!

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila  wrote:
> +1 for the second one not only because it avoids new words in grammar
> but also sounds to convey the meaning. I think you can explain in docs
> how this feature can be used basically how will one get the correct
> LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement.  I think this should be enough for the beginning.

> As suggested previously also pick one of the approaches (I would
> advocate the second one) and keep an option for the second one by
> mentioning it in the commit message. I hope to see more
> reviews/discussions or usage like how will users get the LSN value to
> be specified on the core logic of the feature at this stage. IF
> possible, state, how real-world applications could leverage this
> feature.

I've added a paragraph to the docs about the usage.  After you made
some changes on primary, you run pg_current_wal_insert_lsn().  Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN.  Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch.  The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN.  This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed.  Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit.  I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array.  This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

--
Regards,
Alexander Korotkov

v8-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-03-11 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 5:43 AM Anton A. Melnikov
 wrote:
> On 11.03.2024 03:39, Alexander Korotkov wrote:
> > Now that we distinguish stats for checkpoints and
> > restartpoints, we need to update the docs.  Please, check the patch
> > attached.
>
> Maybe bring the pg_stat_get_checkpointer_buffers_written() description in 
> consistent with these changes?
> Like that:
>
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -5740 +5740 @@
> -  descr => 'statistics: number of buffers written by the checkpointer',
> +  descr => 'statistics: number of buffers written during checkpoints and 
> restartpoints',

This makes sense.  I've included this into the revised patch.

> And after i took a fresh look at this patch i noticed a simple way to extract
> write_time and sync_time counters for restartpoints too.
>
> What do you think, is there a sense to do this?

I'm not sure we need this.  The ways we trigger checkpoints and
restartpoints are different.  This is why we needed separate
statistics.  But the process of writing buffers is the same.

--
Regards,
Alexander Korotkov


rename_pg_stat_get_checkpointer_fields_v2.patch
Description: Binary data

Re: Stack overflow issue

2024-03-10 Thread Alexander Korotkov

On Thu, Mar 7, 2024 at 11:07 AM Alexander Korotkov  wrote:
> On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin  wrote:
> >
> > > 6 march 2024 г., at 19:17, Alexander Korotkov  
> > > wrote:
> > >
> > > The revised set of remaining patches is attached.
> > >
> > > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > > I did minor revision of comments and code blocks order to improve the
> > > readability.
> > >
> > > 0002 Avoid stack overflow in ShowTransactionStateRec()
> > > I didn't notice any issues, leave this piece as is.
> > >
> > > 0003 Avoid recursion in MemoryContext functions
> > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> > > which I think is a bit more intuitive.  Also I fixed
> > > MemoryContextMemConsumed(), which was still trying to use the removed
> > > argument "print" of MemoryContextStatsInternal() function.
> > >
> > > Generally, I think this patchset fixes important stack overflow holes.
> > > It is quite straightforward, clear and the code has a good shape.  I'm
> > > going to push this if no objections.
> >
> > I have tested the scripts from message [1]. After applying these patches 
> > and Tom Lane’s patch from message [2], all of the above mentioned functions 
> > no longer caused the server to crash. I also tried increasing the values in 
> > the presented scripts, which also did not lead to server crashes. Thank you!
> > Also, I would like to clarify something. Will fixes from message [3] and 
> > others be backported to all other branches, not just the master branch? As 
> > far as I remember, Tom Lane made corrections to all branches. For example 
> > [4].
> >
> > Links:
> > 1. 
> > https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru
> > 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us
> > 3. 
> > https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com
> > 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b
>
> Thank you for your feedback!
>
> Initially I didn't intend to backpatch any of these.  But on second
> thought with the references you provided, I think we should backpatch
> simple check_stack_depth() checks from d57b7cc333 to all supported
> branches, but apply refactoring of memory contextes and transaction
> commit/abort just to master.  Opinions?

I've just backpatched check_stack_depth() checks to all supported branches.

--
Regards,
Alexander Korotkov

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1145 matches

Mail list logo