Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-04-07 Thread Tom Lane
Corey Huinker  writes:
>> I've also used the technique quite a lot, but only using the PK,
>> didn't know about the ctid trick, so many thanks for documenting it.

> tid-scans only became a thing a few versions ago (12?). Prior to that, PK
> was the only way to go.

I think we had TID scans for awhile before it was possible to use
them in joins, although I don't recall the details of that.
Anyway, pushed after some additional wordsmithing.

regards, tom lane




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-15 Thread Corey Huinker
>
> > As for whether it's commonplace, when I was a consultant I had a number
> > of customers that I had who bemoaned how large updates caused big
> > replica lag, basically punishing access to records they did care about
> > in order to properly archive or backfill records they don't care about.
> > I used the technique a lot, putting the update/delete in a loop, and
> > often running multiple copies of the same script at times when I/O
> > contention was low, but if load levels rose it was trivial to just kill
> > a few of the scripts until things calmed down.
>
> I've also used the technique quite a lot, but only using the PK,
> didn't know about the ctid trick, so many thanks for documenting it.


tid-scans only became a thing a few versions ago (12?). Prior to that, PK
was the only way to go.


Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-14 Thread Joel Jacobson
On Tue, Feb 13, 2024, at 23:56, Corey Huinker wrote:
> This patch came out of a discussion at the last PgCon with the person 
> who made the "fringe feature" quote, who seemed quite supportive of 
> documenting the technique. The comment may have been in regards to 
> actually implementing a LIMIT clause on UPDATE and DELETE, which isn't 
> in the SQL standard and would be difficult to implement as the two 
> statements have no concept of ordering. Documenting the workaround 
> would alleviate some interest in implementing a nonstandard feature.

Thanks for sharing the background story.

> As for whether it's commonplace, when I was a consultant I had a number 
> of customers that I had who bemoaned how large updates caused big 
> replica lag, basically punishing access to records they did care about 
> in order to properly archive or backfill records they don't care about. 
> I used the technique a lot, putting the update/delete in a loop, and 
> often running multiple copies of the same script at times when I/O 
> contention was low, but if load levels rose it was trivial to just kill 
> a few of the scripts until things calmed down.

I've also used the technique quite a lot, but only using the PK,
didn't know about the ctid trick, so many thanks for documenting it.

/Joel




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-13 Thread Corey Huinker
On Tue, Feb 13, 2024 at 11:51 AM Joel Jacobson  wrote:

> On Tue, Feb 13, 2024, at 10:28, Laurenz Albe wrote:
> > On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
> >> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready
> for committer".
> >>
> >> Commitfest entry reanimated.
> >
> > Truly... you created a revenant in the already closed commitfest.
> >
> > I closed that again and added a new entry in the open commitfest.
> >
> > Yours,
> > Laurenz Albe
>
> This thread reminded me of the old discussion "LIMIT for UPDATE and
> DELETE" from 2014 [1].
>
> Back in 2014, it was considered a "fringe feature" by some. It is thought
> to be more commonplace today?
>
> /Joel
>
> [1]
> https://www.postgresql.org/message-id/flat/CADB9FDf-Vh6RnKAMZ4Rrg_YP9p3THdPbji8qe4qkxRuiOwm%3Dmg%40mail.gmail.com


This patch came out of a discussion at the last PgCon with the person who
made the "fringe feature" quote, who seemed quite supportive of documenting
the technique. The comment may have been in regards to actually
implementing a LIMIT clause on UPDATE and DELETE, which isn't in the SQL
standard and would be difficult to implement as the two statements have no
concept of ordering. Documenting the workaround would alleviate some
interest in implementing a nonstandard feature.

As for whether it's commonplace, when I was a consultant I had a number of
customers that I had who bemoaned how large updates caused big replica lag,
basically punishing access to records they did care about in order to
properly archive or backfill records they don't care about. I used the
technique a lot, putting the update/delete in a loop, and often running
multiple copies of the same script at times when I/O contention was low,
but if load levels rose it was trivial to just kill a few of the scripts
until things calmed down.


Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-13 Thread Joel Jacobson
On Tue, Feb 13, 2024, at 10:28, Laurenz Albe wrote:
> On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
>> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready for 
>> > committer".
>> 
>> Commitfest entry reanimated. 
>
> Truly... you created a revenant in the already closed commitfest.
>
> I closed that again and added a new entry in the open commitfest.
>
> Yours,
> Laurenz Albe

This thread reminded me of the old discussion "LIMIT for UPDATE and DELETE" 
from 2014 [1].

Back in 2014, it was considered a "fringe feature" by some. It is thought to be 
more commonplace today?

/Joel

[1] 
https://www.postgresql.org/message-id/flat/CADB9FDf-Vh6RnKAMZ4Rrg_YP9p3THdPbji8qe4qkxRuiOwm%3Dmg%40mail.gmail.com




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-13 Thread Laurenz Albe
On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready for 
> > committer".
> 
> Commitfest entry reanimated. 

Truly... you created a revenant in the already closed commitfest.

I closed that again and added a new entry in the open commitfest.

Yours,
Laurenz Albe




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-12 Thread Corey Huinker
>
> Do you plan to add it to the commitfest?  If yes, I'd set it "ready for
> committer".
>
> Commitfest entry reanimated.


Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-12 Thread Laurenz Albe
On Mon, 2024-02-12 at 11:45 -0500, Corey Huinker wrote:
> 
> > - I am not sure if it is necessary to have the  at all.
> >   I'd say that it is just a trivial variation of the UPDATE example.
> >   On the other hand, a beginner might find the example useful.
> >   Not sure.
> 
> I think a beginner would find it useful. The join syntax for DELETE is 
> different from
> UPDATE in a way that has never made sense to me, and a person with only the 
> UPDATE
> example might try just replacing UPDATE WITH DELETE and eliminating the SET 
> clause,
> and frustration would follow. We have an opportunity to show the equivalent 
> join in
> both cases, let's use it.

I think we can leave the decision to the committer.

> > I think the "in" before between is unnecessary and had better be removed, 
> > but
> > I'll defer to the native speaker.
> 
> The "in" is more common when spoken. Removed.

The "in" is appropriate for intransitive use:
"I've been here and I've been there and I've been in between."
But: "I have been between here and there."

Do you plan to add it to the commitfest?  If yes, I'd set it "ready for 
committer".

Yours,
Laurenz Albe




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-12 Thread Corey Huinker
>
>
> - About the style: there is usually an empty line between an ending 
>   and the next starting .  It does not matter for correctness, but I
>   think it makes the source easier to read.
>

Done. I've seen them with spaces and without, and have no preference.


>
> - I would rather have only "here" as link text rather than "in greater
> details
>   here".  Even better would be something that gives the reader a clue where
>   the link will take her, like
>   the documentation of
> UPDATE.
>

Done.

>
> - I am not sure if it is necessary to have the  at all.
>   I'd say that it is just a trivial variation of the UPDATE example.
>   On the other hand, a beginner might find the example useful.
>   Not sure.
>

I think a beginner would find it useful. The join syntax for DELETE is
different from UPDATE in a way that has never made sense to me, and a
person with only the UPDATE example might try just replacing UPDATE WITH
DELETE and eliminating the SET clause, and frustration would follow. We
have an opportunity to show the equivalent join in both cases, let's use it.



> I think the "in" before between is unnecessary and had better be removed,
> but
> I'll defer to the native speaker.
>

The "in" is more common when spoken. Removed.
From a6b57bf3a88c5df614b5dede99af3e99fe8e8089 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 12 Feb 2024 11:32:49 -0500
Subject: [PATCH v3] Documentation: Show alternatives to LIMIT on UPDATE and
 DELETE

Show examples of how to simulate UPDATE or DELETE with a LIMIT clause.

These examples also serve to show the existence and utility of ctid self-joins.
---
 doc/src/sgml/ref/delete.sgml | 18 ++
 doc/src/sgml/ref/update.sgml | 37 
 2 files changed, 55 insertions(+)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..1544a28e18 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,24 @@ DELETE FROM films
In some cases the join style is easier to write or faster to
execute than the sub-select style.
   
+  
+   While there is no LIMIT clause for
+   DELETE, it is possible to get a similar effect
+   using the method for UPDATE operations described
+   the documentation of UPDATE.
+
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 1
+  FOR UPDATE
+)
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid;
+
  
 
  
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 2ab24b0523..ed3dd029c7 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -442,6 +442,43 @@ COMMIT;
 
 UPDATE films SET kind = 'Dramatic' WHERE CURRENT OF c_films;
 
+  
+   Updates affecting many rows can have negative effects on system performance,
+   such as table bloat, increased replica lag, increased lock contention,
+   and possible failure of the operation due to a deadlock. In such situations
+   it can make sense to perform the operation in smaller batches. Performing a
+   VACUUM operation on the table between batches can help
+   reduce table bloat. The
+   SQL standard does
+   not define a LIMIT clause for UPDATE
+   operations, but it is possible get a similar effect through the use of a
+   Common Table Expression and an
+   efficient self-join via the system column
+   ctid:
+
+WITH exceeded_max_retries AS (
+  SELECT w.ctid
+  FROM work_item AS w
+  WHERE w.status = 'active'
+  AND w.num_retries > 10
+  ORDER BY w.retry_timestamp
+  FOR UPDATE
+  LIMIT 5000
+)
+UPDATE work_item
+SET status = 'failed'
+FROM exceeded_max_retries AS emr
+WHERE work_item.ctid = emr.ctid
+
+If lock contention is a concern, then SKIP LOCKED can
+be added to the CTE. However, one final
+UPDATE without SKIP LOCKED or
+LIMIT will be needed to ensure that no matching rows
+were overlooked. The use of an ORDER BY clause allows
+the command to prioritize which rows will be locked and updated. This can
+also reduce contention with other update operations if they use the same
+ordering.
+  
  
 
  
-- 
2.43.0



Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-07 Thread Laurenz Albe
On Sat, 2024-02-03 at 15:27 -0500, Corey Huinker wrote:
> 
> Here's another attempt, applying Laurenz's feedback:

I like this patch much better.

Some comments:

> --- a/doc/src/sgml/ref/delete.sgml
> +++ b/doc/src/sgml/ref/delete.sgml
> @@ -234,6 +234,24 @@ DELETE FROM films
> In some cases the join style is easier to write or faster to
> execute than the sub-select style.
>
> +  
> +   While there is no LIMIT clause for
> +   DELETE, it is possible to get a similar effect
> +   using the method for UPDATE operations described
> +   in greater detail here.
> +
> +WITH delete_batch AS (
> +  SELECT l.ctid
> +  FROM user_logs AS l
> +  WHERE l.status = 'archived'
> +  ORDER BY l.creation_date
> +  LIMIT 1
> +  FOR UPDATE
> +)
> +DELETE FROM user_logs AS ul
> +USING delete_branch AS del
> +WHERE ul.ctid = del.ctid;
> +
>   
>  
>   

- About the style: there is usually an empty line between an ending 
  and the next starting .  It does not matter for correctness, but I
  think it makes the source easier to read.

- I would rather have only "here" as link text rather than "in greater details
  here".  Even better would be something that gives the reader a clue where
  the link will take her, like
  the documentation of 
UPDATE.

- I am not sure if it is necessary to have the  at all.
  I'd say that it is just a trivial variation of the UPDATE example.
  On the other hand, a beginner might find the example useful.
  Not sure.

If I had my way, I'd just keep the first paragraph, something like

  
   While there is no LIMIT clause for
   DELETE, it is possible to get a similar effect
   using a self-join with a common table expression as described in the
   UPDATE examples.
  


> diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
> index 2ab24b0523..49e0dc29de 100644
> --- a/doc/src/sgml/ref/update.sgml
> +++ b/doc/src/sgml/ref/update.sgml
> @@ -434,7 +434,6 @@ UPDATE wines SET stock = stock + 24 WHERE winename = 
> 'Chateau Lafite 2003';
>  COMMIT;
>  
>
> -
>
> Change the kind column of the table
> films in the row on which the cursor

Please don't.


I'm mostly fine with the UPDATE example.

> +   it can make sense to perform the operation in smaller batches. Performing 
> a
> +   VACUUM operation on the table in between batches can 
> help
> +   reduce table bloat. The

I think the "in" before between is unnecessary and had better be removed, but
I'll defer to the native speaker.

Yours,
Laurenz Albe




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-02-03 Thread Corey Huinker
>
> I have changed the status of commitfest entry to "Returned with
> Feedback" as Laurenz's comments have not yet been resolved. Please
> handle the comments and update the commitfest entry accordingly.
>
>
Here's another attempt, applying Laurenz's feedback:

I removed all changes to the SELECT documentation. That might seem strange
given that the heavy lifting happens in the SELECT, but I'm working from
the assumption that people's greatest need for a ctid self-join will be
because they are trying to find the LIMIT keyword on UPDATE/DELETE and
coming up empty.

Because the join syntax is subtly different between UPDATE and DELETE, I've
kept code examples in both, but the detailed explanation is in UPDATE under
the anchor "update-limit" and the DELETE example links to it.
From 298c812838491408e6910f7535067ea147abe5fc Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Sat, 3 Feb 2024 14:38:50 -0500
Subject: [PATCH v2] Documentation: Show alternatives to LIMIT on UPDATE and
 DELETE

Show examples of how to simulate UPDATE or DELETE with a LIMIT clause.

These examples also serve to show the existence and utility of ctid self-joins.
---
 doc/src/sgml/ref/delete.sgml | 18 +
 doc/src/sgml/ref/update.sgml | 38 +++-
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..21aae30e91 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,24 @@ DELETE FROM films
In some cases the join style is easier to write or faster to
execute than the sub-select style.
   
+  
+   While there is no LIMIT clause for
+   DELETE, it is possible to get a similar effect
+   using the method for UPDATE operations described
+   in greater detail here.
+
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 1
+  FOR UPDATE
+)
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid;
+
  
 
  
diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
index 2ab24b0523..49e0dc29de 100644
--- a/doc/src/sgml/ref/update.sgml
+++ b/doc/src/sgml/ref/update.sgml
@@ -434,7 +434,6 @@ UPDATE wines SET stock = stock + 24 WHERE winename = 'Chateau Lafite 2003';
 COMMIT;
 
   
-
   
Change the kind column of the table
films in the row on which the cursor
@@ -442,6 +441,43 @@ COMMIT;
 
 UPDATE films SET kind = 'Dramatic' WHERE CURRENT OF c_films;
 
+  
+   Updates affecting many rows can have negative effects on system performance,
+   such as table bloat, increased replica lag, increased lock contention,
+   and possible failure of the operation due to a deadlock. In such situations
+   it can make sense to perform the operation in smaller batches. Performing a
+   VACUUM operation on the table in between batches can help
+   reduce table bloat. The
+   SQL standard does
+   not define a LIMIT clause for UPDATE
+   operations, but it is possible get a similar effect through the use of a
+   Common Table Expression and an
+   efficient self-join via the system column
+   ctid:
+
+WITH exceeded_max_retries AS (
+  SELECT w.ctid
+  FROM work_item AS w
+  WHERE w.status = 'active'
+  AND w.num_retries > 10
+  ORDER BY w.retry_timestamp
+  FOR UPDATE
+  LIMIT 5000
+)
+UPDATE work_item
+SET status = 'failed'
+FROM exceeded_max_retries AS emr
+WHERE work_item.ctid = emr.ctid
+
+If lock contention is a concern, then SKIP LOCKED can
+be added to the CTE. However, one final
+UPDATE without SKIP LOCKED or
+LIMIT will be needed to ensure that no matching rows
+were overlooked. The use of an ORDER BY clause allows
+the command to prioritize which rows will be locked and updated. This can
+also reduce contention with other update operations if they use the same
+ordering.
+  
  
 
  
-- 
2.43.0



Re: Document efficient self-joins / UPDATE LIMIT techniques.

2024-01-14 Thread vignesh C
On Tue, 31 Oct 2023 at 23:42, Corey Huinker  wrote:
>>
>>
>> I think the SQL statements should end with semicolons.  Our SQL examples
>> are usually written like that.
>
>
> ok
>
>
>>
>>
>> Our general style with CTEs seems to be (according to
>> https://www.postgresql.org/docs/current/queries-with.html):
>>
>>  WITH quaxi AS (
>>  SELECT ...
>>  )
>>  SELECT ...;
>
>
> done
>
>>
>>
>> About the DELETE example:
>> -
>>
>> The text suggests that a single, big DELETE operation can consume
>> too many resources.  That may be true, but the sum of your DELETEs
>> will consume even more resources.
>>
>> In my experience, the bigger problem with bulk deletes like that is
>> that you can run into deadlocks easily, so maybe that would be a
>> better rationale to give.  You could say that with this technique,
>> you can force the lock to be taken in a certain order, which will
>> avoid the possibility of deadlock with other such DELETEs.
>
>
> I've changed the wording to address your concerns:
>
>While doing this will actually increase the total amount of work 
> performed, it can break the work into chunks that have a more acceptable 
> impact on other workloads.
>
>
>>
>>
>> About the SELECT example:
>> -
>>
>> That example belongs to UPDATE, I'd say, because that is the main
>> operation.
>
>
> I'm iffy on that suggestion. A big part of putting it in SELECT was the fact 
> that it shows usage of SKIP LOCKED and FOR UPDATE.
>
>>
>>
>> The reason you give (avoid excessive locking) is good.
>> Perhaps you could mention that updating in batches also avoids
>> excessive bload (if you VACUUM between the batches).
>
>
> I went with:
>
>This technique has the additional benefit that it can reduce the overal 
> bloat of the updated table if the table can be vacuumed in between batch 
> updates.
>
>>
>>
>> About the UPDATE example:
>> -
>>
>> I think that could go, because it is pretty similar to the previous
>> one.  You even use ctid in both examples.
>
>
> It is similar, but the idea here is to aid in discovery. A user might miss 
> the technique for update if it's only documented in delete, and even if they 
> did see it there, they might not realize that it works for both UPDATE and 
> DELETE. We could make reference links from one to the other, but that seems 
> like extra work for the reader.

I have changed the status of commitfest entry to "Returned with
Feedback" as Laurenz's comments have not yet been resolved. Please
handle the comments and update the commitfest entry accordingly.

Regards,
Vignesh




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2023-11-02 Thread Laurenz Albe
On Tue, 2023-10-31 at 14:12 -0400, Corey Huinker wrote:
> 
> 
> > About the SELECT example:
> > -
> > 
> > That example belongs to UPDATE, I'd say, because that is the main
> > operation.
> 
> I'm iffy on that suggestion. A big part of putting it in SELECT was the fact
> that it shows usage of SKIP LOCKED and FOR UPDATE.

I can accept that.

> 
> > About the UPDATE example:
> > -
> > 
> > I think that could go, because it is pretty similar to the previous
> > one.  You even use ctid in both examples.
> 
> It is similar, but the idea here is to aid in discovery. A user might miss the
> technique for update if it's only documented in delete, and even if they did 
> see
> it there, they might not realize that it works for both UPDATE and DELETE.
> We could make reference links from one to the other, but that seems like extra
> work for the reader.

I am talking about the similarity between the SELECT and the UPDATE example.
I don't agree with bloating the documentation with redundant examples just
to save a user a click.

I like the idea of a link. Perhaps:

  If you need to perform a large UPDATE in batches to avoid excessive bloat,
  deadlocks or to reduce the load on the server, look at the example in .

Other observations:

  @@ -234,6 +234,35 @@ DELETE FROM films
  In some cases the join style is easier to write or faster to
  execute than the sub-select style.
 
  +  
  +   In situations where a single operation would consume too many resources,
  +   either causing the operation to fail or negatively impacting other 
workloads,
  +   it may be desirable to break up a large DELETE into
  +   multiple separate commands. While doing this will actually increase the
  +   total amount of work performed, it can break the work into chunks that 
have
  +   a more acceptable impact on other workloads.  The
  +   SQL standard does
  +   not define a LIMIT clause for DELETE
  +   operations, but it is possible get the equivalent functionality through 
the
  +   USING clause to a
  +   Common Table Expression which 
identifies
  +   a subset of rows to be deleted, locks those rows, and returns their system
  +   column ctid values:

I don't think that reducing the load on the server is such a great use case
that we should recommend it as "best practice" in the documentation (because,
as your patch now mentions, it doesn't reduce the overall load).

I also don't think we need a verbal description of what the following query 
does.

How about something like:

"If you have to delete lots of rows, it can make sense to perform the operation
 in several smaller batches to reduce the risk of deadlocks.  The
 SQL standard does
 not define a LIMIT clause for DELETE,
 but it is possible to achieve a similar effect with a self-join on
 the system column ctid:"

  +
  +WITH delete_batch AS (
  +  SELECT l.ctid
  +  FROM user_logs AS l
  +  WHERE l.status = 'archived'
  +  ORDER BY l.creation_date
  +  LIMIT 1
  +  FOR UPDATE
  +)
  +DELETE FROM user_logs AS ul
  +USING delete_branch AS del
  +WHERE ul.ctid = del.ctid;
  +
  +  This allows for flexible search criteria within the CTE and an efficient 
self-join.
  +  

The last sentence is redundant, I'd say.

But you could add:

"An added benefit is that by using an ORDER BY clause in
 the subquery, you can determine the order in which the rows will be locked
 and deleted, which will prevent deadlocks with other statements that lock
 the rows in the same order."

But if you do that, you had better use "ORDER BY id" or something else that
looks more like a unique column.

--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1679,6 +1679,30 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss 
WHERE col1 = 5;
 condition is not textually within the sub-query.


+   
+In cases where a DML operation involving many rows

I think we should avoid using DML.  Beginner might not know it, and it is
not an index term.  My suggestion is "data modification statement/operation".

+must be performed, and that table experiences numerous other simultaneous
+DML operations, a FOR UPDATE clause
+used in conjunction with SKIP LOCKED can be useful for
+performing partial DML operations:
+
+
+WITH mods AS (
+SELECT ctid FROM mytable
+WHERE status = 'active' AND retries > 10
+ORDER BY id FOR UPDATE SKIP LOCKED
+)
+UPDATE mytable SET status = 'failed'
+FROM mods WHERE mytable.ctid = mods.ctid;
+
+
+This allows the DML operation to be performed in parts, 
avoiding locking,
+until such time as the set of rows that remain to be modified is small 
enough

"until such time as" does not sound English to me.  "Until the number of rows 
that remain"
would be better, in my opinion.

+that the locking will not affect overall performance, at which point the 
same

"that the locking" --> "that locking them"

+statement can be issued without the SKIP LOCKED clause 
to 

Re: Document efficient self-joins / UPDATE LIMIT techniques.

2023-10-31 Thread Corey Huinker
>
>
> I think the SQL statements should end with semicolons.  Our SQL examples
> are usually written like that.
>

ok



>
> Our general style with CTEs seems to be (according to
> https://www.postgresql.org/docs/current/queries-with.html):
>
>  WITH quaxi AS (
>  SELECT ...
>  )
>  SELECT ...;
>

done


>
> About the DELETE example:
> -
>
> The text suggests that a single, big DELETE operation can consume
> too many resources.  That may be true, but the sum of your DELETEs
> will consume even more resources.
>
> In my experience, the bigger problem with bulk deletes like that is
> that you can run into deadlocks easily, so maybe that would be a
> better rationale to give.  You could say that with this technique,
> you can force the lock to be taken in a certain order, which will
> avoid the possibility of deadlock with other such DELETEs.
>

I've changed the wording to address your concerns:

   While doing this will actually increase the total amount of work
performed, it can break the work into chunks that have a more acceptable
impact on other workloads.



>
> About the SELECT example:
> -
>
> That example belongs to UPDATE, I'd say, because that is the main
> operation.
>

I'm iffy on that suggestion. A big part of putting it in SELECT was the
fact that it shows usage of SKIP LOCKED and FOR UPDATE.


>
> The reason you give (avoid excessive locking) is good.
> Perhaps you could mention that updating in batches also avoids
> excessive bload (if you VACUUM between the batches).
>

I went with:

   This technique has the additional benefit that it can reduce the overal
bloat of the updated table if the table can be vacuumed in between batch
updates.


>
> About the UPDATE example:
> -
>
> I think that could go, because it is pretty similar to the previous
> one.  You even use ctid in both examples.
>

It is similar, but the idea here is to aid in discovery. A user might miss
the technique for update if it's only documented in delete, and even if
they did see it there, they might not realize that it works for both UPDATE
and DELETE. We could make reference links from one to the other, but that
seems like extra work for the reader.
From c6179c3cf1395884d4a42b5ad983542a3fc4887c Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Tue, 31 Oct 2023 03:52:41 -0400
Subject: [PATCH v2] Currently we do not show any examples of using ctid
 anywhere, nor do we address the often-requested but problematic use case of
 having a LIMIT clause on UPDATE and DELETE statements. These examples are a
 subtle way of addressing both those concerns.

---
 doc/src/sgml/ref/delete.sgml | 29 +
 doc/src/sgml/ref/select.sgml | 24 
 doc/src/sgml/ref/update.sgml | 23 +++
 3 files changed, 76 insertions(+)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..4e08c6c85e 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,35 @@ DELETE FROM films
In some cases the join style is easier to write or faster to
execute than the sub-select style.
   
+  
+   In situations where a single operation would consume too many resources,
+   either causing the operation to fail or negatively impacting other workloads,
+   it may be desirable to break up a large DELETE into
+   multiple separate commands. While doing this will actually increase the
+   total amount of work performed, it can break the work into chunks that have
+   a more acceptable impact on other workloads.  The
+   SQL standard does
+   not define a LIMIT clause for DELETE
+   operations, but it is possible get the equivalent functionality through the
+   USING clause to a
+   Common Table Expression which identifies
+   a subset of rows to be deleted, locks those rows, and returns their system
+   column ctid values:
+
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 1
+  FOR UPDATE
+)
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid;
+
+  This allows for flexible search criteria within the CTE and an efficient self-join.
+  
  
 
  
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 42d78913cf..10e10ea249 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1679,6 +1679,30 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
 condition is not textually within the sub-query.

 
+   
+In cases where a DML operation involving many rows
+must be performed, and that table experiences numerous other simultaneous
+DML operations, a FOR UPDATE clause
+used in conjunction with SKIP LOCKED can be useful for
+performing partial DML operations:
+
+
+WITH mods AS (
+SELECT ctid FROM mytable
+WHERE status = 'active' AND retries > 10
+

Re: Document efficient self-joins / UPDATE LIMIT techniques.

2023-10-04 Thread Laurenz Albe
On Wed, 2023-06-28 at 14:20 -0400, Corey Huinker wrote:
> This patch adds a few examples to demonstrate the following:
> 
> * The existence of the ctid column on every table
> * The utility of ctds in self joins
> * A practical usage of SKIP LOCKED

I had a look at your patch, and I am in favor of the general idea.

Style considerations:
-

I think the SQL statements should end with semicolons.  Our SQL examples
are usually written like that.

Our general style with CTEs seems to be (according to
https://www.postgresql.org/docs/current/queries-with.html):

 WITH quaxi AS (
 SELECT ...
 )
 SELECT ...;

About the DELETE example:
-

The text suggests that a single, big DELETE operation can consume
too many resources.  That may be true, but the sum of your DELETEs
will consume even more resources.

In my experience, the bigger problem with bulk deletes like that is
that you can run into deadlocks easily, so maybe that would be a
better rationale to give.  You could say that with this technique,
you can force the lock to be taken in a certain order, which will
avoid the possibility of deadlock with other such DELETEs.

About the SELECT example:
-

That example belongs to UPDATE, I'd say, because that is the main
operation.

The reason you give (avoid excessive locking) is good.
Perhaps you could mention that updating in batches also avoids
excessive bload (if you VACUUM between the batches).

About the UPDATE example:
-

I think that could go, because it is pretty similar to the previous
one.  You even use ctid in both examples.

Status set to "waiting for author".

Yours,
Laurenz Albe




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2023-09-25 Thread jian he
Hi.
-
In cases where a DML operation involving many rows must be performed,
and that table experiences numerous other simultaneous DML operations,
a FOR UPDATE clause used in conjunction with SKIP LOCKED can be useful
for performing partial DML operations:

WITH mods AS (SELECT ctid FROM mytable
  WHERE status = 'active' AND retries > 10
  ORDER BY id FOR UPDATE SKIP LOCKED)
UPDATE mytable SET status = 'failed'
FROM mods WHERE mytable.ctid = mods.ctid

This allows the DML operation to be performed in parts, avoiding
locking, until such time as the set of rows that remain to be modified
is small enough that the locking will not affect overall performance,
at which point the same statement can be issued without the SKIP
LOCKED clause to ensure that no rows were overlooked.
--
mods found out the ctids to be updated, update mytable actually do the update.
I didn't get "This allows the DML operation to be performed in parts".

omit "at which point", the last sentence still makes sense. so I
didn't get "at which point"?

I am not native english speaker.




Re: Document efficient self-joins / UPDATE LIMIT techniques.

2023-08-31 Thread Corey Huinker
On Wed, Jun 28, 2023 at 2:20 PM Corey Huinker 
wrote:

> This patch adds a few examples to demonstrate the following:
>

Bumping so CF app can see thread.

>


Document efficient self-joins / UPDATE LIMIT techniques.

2023-06-28 Thread Corey Huinker
This patch adds a few examples to demonstrate the following:

* The existence of the ctid column on every table
* The utility of ctds in self joins
* A practical usage of SKIP LOCKED

The reasoning for this is a bit long, but if you're interested, keep
reading.

In the past, there has been a desire to see a LIMIT clause of some sort on
UPDATE and DELETE statements. The reason for this usually stems from having
a large archive or backfill operation that if done in one single
transaction would overwhelm normal operations, either by the transaction
failing outright, locking too many rows, flooding the WAL causing replica
lag, or starving other processes of limited I/O.

The reasons for not adding a LIMIT clause are pretty straightforward: it
isn't in the SQL Standard, and UPDATE/DELETE operations are unordered
operations, so updating 1000 rows randomly isn't a great idea. The people
wanting the LIMIT clause were undeterred by this, because they know that
they intend to keep issuing updates until they run out of rows to update.

Given these limitations, I would write something like this:

WITH doomed AS (
SELECT t.id
FROM my_table AS t
WHERE t.expiration_date < :'some_archive_date'
FOR UPDATE SKIP LOCKED
LIMIT 1000 )
DELETE FROM my_table
WHERE id IN (SELECT id FROM doomed );

This wouldn't interfere with any other updates, so I felt good about it
running when the system was not-too-busy. I'd then write a script to run
that in a loop, with sleeps to allow the replicas a chance to catch their
breath. Then, when the rowcount finally dipped below 1000, I'd issue the
final

DELETE FROM my_table WHERE expiration_date < :'some_archive_date';

And this was ok, because at that point I have good reason to believe that
there are at most 1000 rows lingering out there, so waiting on locks for
those was no big deal.

But a query like this involves one scan along one index (or worse, a seq
scan) followed by another scan, either index or seq. Either way, we're
taking up a lot of cache with rows we don't even care about.

Then in v12, the query planner got hip to bitmap tidscans, allowing for
this optimization:

WITH doomed AS (
SELECT t.ctid AS tid
FROM my_table AS t
WHERE t.expiration_date < :'some_archive_date'
FOR UPDATE SKIP LOCKED
LIMIT 1000 )
DELETE FROM my_table
USING doomed WHERE my_table.ctid = doomed.tid;

And this works pretty well, especially if you set up a partial index to
meet the quals in the CTE. But we don't document this anywhere, and until
UPDATE and DELETE get a LIMIT clause, we probably should document this
workaround.
From 209fd8abe50603e85ca0cc07aecd72b87889e757 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Tue, 13 Jun 2023 13:00:40 -0400
Subject: [PATCH v1] Add examples that highlight the usage of the system column
 ctid in self-joins.

Currently we do not show any examples of using ctid anywhere, nor do we
address the often-requested but problematic use case of having a LIMIT
clause on UPDATE and DELETE statements. These examples are a subtle way
of addressing both those concerns.
---
 doc/src/sgml/ref/delete.sgml | 24 
 doc/src/sgml/ref/select.sgml | 21 +
 doc/src/sgml/ref/update.sgml | 23 +++
 3 files changed, 68 insertions(+)

diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 1b81b4e7d7..cca9138843 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -234,6 +234,30 @@ DELETE FROM films
In some cases the join style is easier to write or faster to
execute than the sub-select style.
   
+  
+   In situations where a single operation would consume too many resources, it
+   may be desirable to break up a large DELETE into multiple
+   separate commands. The SQL
+   standard does not define a LIMIT clause for
+   DELETE operations, but it is possible get the equivalent
+   functionality through the USING clause to a
+   Common Table Expression which identifies
+   a subset of rows to be deleted, locks those rows, and returns their system
+   column ctid values:
+
+WITH delete_batch AS (
+  SELECT l.ctid
+  FROM user_logs AS l
+  WHERE l.status = 'archived'
+  ORDER BY l.creation_date
+  LIMIT 1
+  FOR UPDATE )
+DELETE FROM user_logs AS ul
+USING delete_branch AS del
+WHERE ul.ctid = del.ctid
+
+  This allows for flexible search criteria within the CTE and an efficient self-join.
+  
  
 
  
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 0ee0cc7e64..9d7c3d5c41 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1676,6 +1676,27 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
 condition is not textually within the sub-query.

 
+   
+In cases where a DML operation involving many rows
+must be performed, and that table experiences numerous other simultaneous
+DML operations, a FOR UPDATE clause
+used in conjunction with SKIP