Re: Disabling Heap-Only Tuples

2024-03-15 Thread Matthias van de Meent
On Wed, 13 Mar 2024 at 14:27, Laurenz Albe  wrote:
>
> On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> > I think a minimal working approach could be to have the configuration be 
> > based
> > on the relation size vs space known to the FSM. If the target block of an
> > update is higher than ((relation_size - fsm_free_space) *
> > new_reloption_or_guc), try finding the target block via the FSM, even if
> > there's space on the page.
>
> That sounds like a good way forward.
>
> The patch is in state "needs review", but it got review.  I'll change it to
> "waiting for author".

Then I'll withdraw this patch as I don't currently have (nor expect to
have anytime soon) the bandwitdh or expertise to rewrite this patch to
include a system that calculates the free space available in a
relation.

I've added a TODO item in the UPDATE section with a backlink to this
thread so the discussion isn't lost.

-Matthias




Re: Disabling Heap-Only Tuples

2024-03-13 Thread Laurenz Albe
On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> I think a minimal working approach could be to have the configuration be based
> on the relation size vs space known to the FSM. If the target block of an
> update is higher than ((relation_size - fsm_free_space) *
> new_reloption_or_guc), try finding the target block via the FSM, even if
> there's space on the page.

That sounds like a good way forward.

The patch is in state "needs review", but it got review.  I'll change it to
"waiting for author".

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-09-21 Thread Andres Freund
Hi,

On 2023-09-19 20:20:06 +0200, Matthias van de Meent wrote:
> Mostly agreed, but I think there's a pitfall here. You seem to assume
> we have a perfect oracle that knows the optimal data size, but we
> already know that our estimates can be significantly off. I don't
> quite trust the statistics enough to do any calculations based on the
> number of tuples in the relation. That also ignores the fact that we
> don't actually have any good information about the average size of the
> tuples in the table. So with current statistics, any automated "this
> is how large the table should be" decisions would result in an
> automated footgun, instead of the current patch's where the user has
> to decide to configure it to an explicit value.

The proposed patch already relies on the FSM being reasonably up2date, no? If
the FSM doesn't know about free space, the patch won't be able to place tuples
earlier in the relation. And if the FSM wrongly thinks there's lots of free
space, it'll make updates very expensive.

We obviously don't want to scan the whole FSM on an ongoing basis, but
visiting the top-level FSM pages and/or having vacuum/analyze update some
statistic based on a more thorough analysis of the FSM doesn't seem insane.


A related issue is that an accurate tuple size and accurate number of tuples
isn't really sufficient - if tuples are wider, there can be plenty space on
pages without updates being able to reuse that space. And the width of tuples
doesn't have to be evenly distributed, so a simple approach of calculating how
many tuples of the average width fit in a page and then using that to come up
with the overall number of required pages isn't necessarily accurate either.


> But about that: I'm not sure what the "footgun" is that you've
> mentioned recently?
> The issue with excessive bloat (when the local_update_limit is set too
> small and fillfactor is low) was fixed in the latest patch nearly
> three weeks ago, so the only remaining issue with misconfiguration is
> slower updates.

There seem to be plenty footguns. Just to name a few:

- The user has to determine a good value for local_update_limit, without
  really any good way of doing so.

- A "too low" local_update_limit will often succeed in finding some space in
  earlier pages, without that providing useful progress on compaction -
  e.g. because subsequently tuples on the earlier page will be updated and
  there's now no space anymore. Leading to index bloat.

- Configuring local_update_limit as a fixed size will be fragile when the data
  actually grows, leading to lots of pointless out-of-page updates.


I think a minimal working approach could be to have the configuration be based
on the relation size vs space known to the FSM. If the target block of an
update is higher than ((relation_size - fsm_free_space) *
new_reloption_or_guc), try finding the target block via the FSM, even if
there's space on the page.


> Sure, that's not great, but in my opinion not a
> "footgun": performance returns immediately after resetting
> local_update_limit, and no space was lost.

I think there's plenty ways to get pointless out-of-page updates, and
therefore index bloat, with local_update_limit as-proposed (see earlier in the
email). Once you have such pointless out-of-page updates, disabling
local_update_limit won't bring performance back immediately (space usage due
to index bloat and lookup performance issues due to the additional index
entries).


> Updating the reloption after relation truncation implies having the
> same lock as relation truncation, i.e. AEL (if the vacuum docs are to
> be believed).

Aside: We really need to get rid of the AEL for relation trunction - it's
quite painful for hot standby workloads...

Thomas has been talking about a patch (and perhaps even posted it) that adds
infrastructure providing a "shared smgrrelation". Once we have that I think we
could lower the required lock level for truncation, by having storing both the
filesystem size and the "valid" size. There's a few potential models:

- Vacuum truncation could lower the valid size in-memory, end its transaction,
  wait for concurrent accesses to the relation to finish, check if/where to
  the relation has been extended since, acquire the extension lock and
  truncate down to the "valid" size.

  The danger with that is that the necessary waiting can be long, threatening
  to starve autovacuum of workers.

- Instead of making a single vacuum wait, we could have one vacuum update the
  valid size of the relation and also store an xid horizon. Later vacuums can
  truncate the physical size down the to valid size if there are no snapshot
  conflicts with said xid anymore.


If we had such an shared smgrrel, we could also make relation extension a lot
more efficient, because we would not need to pin all pages that a relation
extension "covers" - the reason that we need to pin the to-be-extended-pages
is to prevent concurrent scans from reading "new" blocks while 

Re: Disabling Heap-Only Tuples

2023-09-21 Thread Andres Freund
Hi,

On 2023-09-19 14:50:13 -0400, Robert Haas wrote:
> On Tue, Sep 19, 2023 at 12:56 PM Andres Freund  wrote:
> > Yea, a setting like what's discussed here seems, uh, not particularly useful
> > for achieving the goal of compacting tables.  I don't think guiding this
> > through SQL makes a lot of sense. For decent compaction you'd want to scan 
> > the
> > table backwards, and move rows from the end to earlier, but stop once
> > everything is filled up. You can somewhat do that from SQL, but it's going 
> > to
> > be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> > logging.
> >
> > I think having explicit compaction support in VACUUM or somewhere similar
> > would make sense, but I don't think the proposed GUC is a useful stepping
> > stone.
> 
> I think there's a difference between wanting to compact instantly and
> wanting to compact over time. I think that this kind of thing is
> reasonably well-suited to the latter, if we can engineer away the
> cases where it backfires.
> 
> But I know people will try to use it for instant compaction too, and
> there it's worth remembering why we removed old-style VACUUM FULL. The
> main problem is that it was mind-bogglingly slow.

I think some of the slowness was implementation related, rather than
fundamental. But more importantly, storage was something entirely different
back then than it is now.


> The other really bad problem is that it caused massive index bloat. I think
> any system that's based on moving around my tuples right now to make my
> table smaller right now is likely to have similar issues.

I think the problem of exploding WAL usage exists both for compaction being
done in VACUUM (or a dedicated command) and being done by backends. I think to
make using a facility like this realistic, you really need some form of rate
limiting, regardless of when compaction is performed. Even leaving WAL volume
aside, naively doing on-update compaction will cause lots of additional
contention on early FSM pages.


> In the case where you're trying to compact gradually, I think there
> are potentially serious issues with index bloat, but only potentially.
> It seems like there are reasonable cases where it's fine.

> Specifically, if you have relatively few indexes per table, relatively
> few long-running transactions, and all tuples get updated on a
> semi-regular basis, I'm thinking that you're more likely to win than
> lose.

Maybe - but are you going to have a significant bloat issue in that case?
Sure, if the updates update most of the table, youre are going to - but then
on-update compaction won't really be needed either, since you're going to run
out of space on pages on a regular basis.

Greetings,

Andres Freund




Re: Disabling Heap-Only Tuples

2023-09-20 Thread Stephen Frost
Greetings,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> On Tue, 2023-09-19 at 12:52 -0400, Robert Haas wrote:
> > On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> > wrote:
> > > I was thinking something vaguely like "a table size that's roughly what
> > > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > > vacuum_scale_factor.  You would determine the absolute minimum size for
> > > the table given the current live tuples in the table, then add 20% to
> > > account for a steady state of dead tuples and vacuumed space.  So it's
> > > not 1.2x of the "current" table size at the time the local_update_limit
> > > feature is installed, but 1.2x of the optimal table size.
> > 
> > Right, that would be great. And honestly if that's something we can
> > figure out, then why does the parameter even need to be an integer
> > instead of a Boolean? If the system knows the optimal table size, then
> > the user can just say "try to compact this table" and need not say to
> > what size. The 1.2 multiplier is probably situation dependent and
> > maybe the multiplier should indeed be a configuration parameter, but
> > we would be way better off if the absolute size didn't need to be.
> 
> I don't have high hopes for a reliable way to automatically determine
> the target table size.  There are these queries floating around to estimate
> table bloat, which are used by various monitoring systems.  I find that they
> get it right a lot of the time, but sometimes they get it wrong.  Perhaps
> we can do better than that, but I vastly prefer a setting that I can control
> (even at the danger that I can misconfigure it) over an automatism that I
> cannot control and that sometimes gets it wrong.

Not completely against a setting- but would certainly prefer that this
be done in a more automated way, if possible.

To that end, my thought would be some kind of regular review of the FSM,
or maybe actual review by walking through the table (as VACUUM already
does...) to get an idea of where there's space and where there's used up
areas and then use that to inform various operations (either VACUUM
itself or perhaps UPDATEs from SQL).  We could also try to 'start
simple' and look for cases that we can say "well, that's definitely not
good" and address those initially.

Consider (imagine as a histogram; X is used space, . is empty):

 1: XXX
 2: XXX
 3: XXX
 4: XXX
 5: X
 6: X
 7: .
 8: .
 9: .
10: .
11: .
12: .
13: .
14: .
15: .
16: .
17: .
18: .
19: .
20: X

Well, obviously there's tons of free space in the middle and if we could
just move those few tuples/pages/whatever that are near the end to
earlier in the table then we'd be able to truncate off and shrink a
lot of the table.

> I like Alvaro's idea to automatically reset "local_update_limit" when the
> table has shrunk enough.  Why not perform that task during vacuum truncation?
> If vacuum truncation has taken place, check if the table size is no bigger
> than "local_update_limit" * (1 + "autovacuum_vacuum_scale_factor"), and if
> it is no bigger, reset "local_update_limit".  That way, we would not have
> to worry about a lock, because vacuum truncation already has the table locked.

Agreed on this too.  Essentially, once we've done some truncation, we
should 'reset'.

I've no doubt that there's some better algorithm for this, but I keep
coming back to something as simple as- if the entire second half of the
table will fit into the entire first half then the table is twice as
large as it needs to be and perhaps that triggers a preference for
placing tuples in the first half of the table.  As for what handles
this- maybe have both UPDATE and VACUUM able to, but prefer for UPDATE
to do so and only have VACUUM kick in once the tuples at the end of the
relation are older than some xid-based threshold (perhaps all of the
tuples on a given page have to be old enough?).

While it feels a bit 'late' in terms of when to start taking this
action, we could possibly start with 'all frozen' as an indicator of
'old enough'?  Then, between the FSM and the VM, VACUUM could decide
that pages at the end of the table should be moved to be earlier and go
about making that happen.  I'm a bit concerned about the risk of some
kind of deadlock or similar happening between VACUUM and user processes
if we're trying to do this with multiple tuples at a time but hopefully
we could come up with a way to avoid that.  This process naturally would
have to involve updating indexes and the VM and FSM as the tuples get
moved.

In terms of what this would look like, my thinking is that VACUUM would
scan the table and the FSM and perhaps the VM and then say "ok, this
table is bigger than it needs to be, let's try to fix that" and then set
a flag on the table, which a user could also explicitly set to give them
control over this process happening sooner or not happening at all, and
that would indicate to UPDATE to prefer earlier pages over the current
page or HOT updates, 

Re: Disabling Heap-Only Tuples

2023-09-19 Thread Laurenz Albe
On Tue, 2023-09-19 at 12:52 -0400, Robert Haas wrote:
> On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> wrote:
> > I was thinking something vaguely like "a table size that's roughly what
> > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > vacuum_scale_factor.  You would determine the absolute minimum size for
> > the table given the current live tuples in the table, then add 20% to
> > account for a steady state of dead tuples and vacuumed space.  So it's
> > not 1.2x of the "current" table size at the time the local_update_limit
> > feature is installed, but 1.2x of the optimal table size.
> 
> Right, that would be great. And honestly if that's something we can
> figure out, then why does the parameter even need to be an integer
> instead of a Boolean? If the system knows the optimal table size, then
> the user can just say "try to compact this table" and need not say to
> what size. The 1.2 multiplier is probably situation dependent and
> maybe the multiplier should indeed be a configuration parameter, but
> we would be way better off if the absolute size didn't need to be.

I don't have high hopes for a reliable way to automatically determine
the target table size.  There are these queries floating around to estimate
table bloat, which are used by various monitoring systems.  I find that they
get it right a lot of the time, but sometimes they get it wrong.  Perhaps
we can do better than that, but I vastly prefer a setting that I can control
(even at the danger that I can misconfigure it) over an automatism that I
cannot control and that sometimes gets it wrong.

I like Alvaro's idea to automatically reset "local_update_limit" when the
table has shrunk enough.  Why not perform that task during vacuum truncation?
If vacuum truncation has taken place, check if the table size is no bigger
than "local_update_limit" * (1 + "autovacuum_vacuum_scale_factor"), and if
it is no bigger, reset "local_update_limit".  That way, we would not have
to worry about a lock, because vacuum truncation already has the table locked.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Laurenz Albe
On Tue, 2023-09-19 at 14:50 -0400, Robert Haas wrote:
> But I know people will try to use it for instant compaction too, and
> there it's worth remembering why we removed old-style VACUUM FULL. The
> main problem is that it was mind-bogglingly slow. The other really bad
> problem is that it caused massive index bloat. I think any system
> that's based on moving around my tuples right now to make my table
> smaller right now is likely to have similar issues.

I had the same feeling that this is sort of bringing back old-style
VACUUM (FULL).  But I don't think that index bloat is a show stopper
these days, when we have REINDEX CONCURRENTLY, so I am not worried.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 2:20 PM Matthias van de Meent
 wrote:
> Mostly agreed, but I think there's a pitfall here. You seem to assume
> we have a perfect oracle that knows the optimal data size, but we
> already know that our estimates can be significantly off. I don't
> quite trust the statistics enough to do any calculations based on the
> number of tuples in the relation. That also ignores the fact that we
> don't actually have any good information about the average size of the
> tuples in the table. So with current statistics, any automated "this
> is how large the table should be" decisions would result in an
> automated footgun, instead of the current patch's where the user has
> to decide to configure it to an explicit value.

I'm not assuming that there's an oracle here. I'm hoping that there's
some way that we can construct one. If we can't, then I think we're
asking the user to figure out a value that we don't have any idea how
to compute ourselves. And I think that kind of thing is usually a bad
idea. It's reasonable to ask the user for input when they know
something relevant that we can't know, like how large they think their
database will get, or what hardware they're using. But it's not
reasonable to essentially hope that the user is smarter than we are.
That's leaving our job half-undone and forcing the user into coping
with the result. And note that the value we need here is largely about
the present, not the future. The question is "how small can the table
be practically made right now?". And there is no reason at all to
suppose that the user is better-placed to answer that question than
the database itself.

> But about that: I'm not sure what the "footgun" is that you've
> mentioned recently?
> The issue with excessive bloat (when the local_update_limit is set too
> small and fillfactor is low) was fixed in the latest patch nearly
> three weeks ago, so the only remaining issue with misconfiguration is
> slower updates. Sure, that's not great, but in my opinion not a
> "footgun": performance returns immediately after resetting
> local_update_limit, and no space was lost.

That does seem like a very good change, but I'm not convinced that it
solves the whole problem. I would agree with your argument if the only
downside of enabling the feature were searching the FSM, failing to
find a suitable free page, and falling back to a HOT update. Such a
thing might be slow, but it won't cause any bloat, and as you say, if
the feature doesn't do what you want, don't use it. But I think the
feature can still cause bloat.

If we're using this feature on a reasonably heavily-updated table,
then sometimes when we check whether any low-numbered pages have free
space, it will turn out that one of them does. This will happen even
if local_update_limit is set far too low, because the table is
heavily-updated, and sometimes that means tuples are moving around,
leaving holes. So when there is a hole, i.e. just by luck we happen to
find some space on a low-numbered page, we'll suffer the cost of a
non-HOT update to move that tuple to an earlier page of the relation.
However, there's a good chance that the next time we update that
tuple, the page will have become completely full, because everybody's
furiously trying to jam as many tuples as possible into those
low-numbered pages, so now the tuple will have to bounce to some
higher-numbered page.

So I think what will happen if the local update limit is set too low,
and the table is actually being updated a lot, is that we'll just
uselessly do a bunch of HOT updates on high-numbered pages as non-HOT,
which will fill up low-numbered pages turning even potentially HOT
updates on those pages to non-HOT as well. Doing a bunch of updates
that could have been HOT as non-HOT can for sure cause index bloat. It
could maybe also cause table bloat, because if we'd done the updates
as HOT, we would have been able to recover the line pointers via
HOT-pruning, but since we turned them into non-HOT updates, we have to
wait for vacuum, which is comparatively much less frequent.

I'm not quite sure how bad this residual problem is. It's certainly a
lot better if a failed attempt to move a tuple earlier can turn into a
normal HOT update instead of a non-HOT update. But I don't think it
completely eliminates the problem of useless tuple movement either.

As Andres points out, I think rightly, we should really be thinking
about ways to guide this behavior other than a page number. As you
point out, there's no guarantee that we can know the right page
number. If we can, cool. But there are other approaches too. He
mentions looking at how full the FSM is, which seems like an
interesting idea although surely we don't want every backend
repeatedly iterating over the FSM to recompute statistics. I wonder if
there are other good ideas we haven't thought of yet. Certainly, if
you found that you were frequently being forced to move tuples to
higher-numbered pages for lack of space 

Re: Disabling Heap-Only Tuples

2023-09-19 Thread Andres Freund
Hi,

On 2023-09-19 19:33:22 +0200, Matthias van de Meent wrote:
> On Tue, 19 Sept 2023 at 18:56, Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> > > This makes me think that maybe the logic needs to be a little more
> > > complex to avoid the problem you describe: if an UPDATE is prevented
> > > from being HOT because of this setting, but then it goes and consults
> > > FSM and it gives the update a higher block number than the tuple's
> > > current block (or it fails to give a block number at all so it is forced
> > > to extend the relation), then the update should give up on that strategy
> > > and use a HOT update after all.  (I have not read the actual patch;
> > > maybe it already does this?  It sounds kinda obvious.)
> >
> > Yea, a setting like what's discussed here seems, uh, not particularly useful
> > for achieving the goal of compacting tables.  I don't think guiding this
> > through SQL makes a lot of sense. For decent compaction you'd want to scan 
> > the
> > table backwards, and move rows from the end to earlier, but stop once
> > everything is filled up. You can somewhat do that from SQL, but it's going 
> > to
> > be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> > logging.
>
> We can't move tuples around (or, not that I know of) without using a
> transaction ID to control the visibility of the two locations of that
> tuple.

Correct, otherwise you'd end up with broken visibility in scans (seeing the
same tuple twice or never).


> Doing table compaction would thus likely require using transactions to move
> these tuples around.

Yes - but I don't think that has to be a problem. I'd expect something like
this to use multiple transactions internally. Possibly optimizing xid usage by
checking if other transactions are currently waiting on the xid and committing
if that's the case. Processing a single page should be quite fast, so the
maximum delay on other sessions is quite small.


> Using a single backend and bulk operations, it'll still lock each tuple that
> is being moved, and that can be noticed by user DML queries. I'd rather make
> the user's queries move the data around than this long-duration, locking
> background operation.

I doubt that works well enough in practice. It's very common to have tuples
that aren't updated after some point. So you then end up with needing tooling
that triggers UPDATEs for tuples at the end of the relation.


> > I think having explicit compaction support in VACUUM or somewhere similar
> > would make sense, but I don't think the proposed GUC is a useful stepping
> > stone.
>
> The point of this GUC is that the compaction can happen organically in
> the user's UPDATE workflow, so that there is no long locking operation
> going on (as you would see with VACUUM FULL / CLUSTER / pg_repack).

It certainly shouldn't use an AEL. I think we could even get away without an
SUE (it's basically just UPDATEs after all), but whether it's worth doing that
I'm not sure.


> > > > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > > > feature. Sure, some people will get use out of it, if they're
> > > > sufficiently knowledgeable and sufficiently determined. But I think
> > > > for most people in most situations, it will be a struggle.
> >
> > Indeed. I think it'd often just explode table and index sizes, because HOT
> > pruning won't be able to make usable space in pages anymore (due to dead
> > items).
>
> You seem to misunderstand the latest patch. It explicitly only blocks
> local updates if the update can then move the new tuple to an earlier
> page. If that is not possible, then it'll insert locally (assuming
> that is still possible) and HOT can then still apply.

I indeed apparently had looked at the wrong patch. But I still don't think
this is a useful way of controlling this.  I guess it could be a small part of
something larger, but you are going to need something that actively updates
tuples at the end of the table, otherwise it's very unlikely in practice that
you'll ever be able to shrink the table.


Leaving aside what process "moves" tuples, I doubt that controlling "moving"
via the table size is useful. Controlling via the amount free space in the FSM
would make more sense. If there's no known free space in the FSM, this
approach can't compact. Using the table size to control also means that the
value needs to be updated with the growth of the table. Whereas controlling
moving via a percentage of free space in the FSM would allow the same setting
to be used even for a growing (or shrinking) table.

Greetings,

Andres Freund




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 12:56 PM Andres Freund  wrote:
> Yea, a setting like what's discussed here seems, uh, not particularly useful
> for achieving the goal of compacting tables.  I don't think guiding this
> through SQL makes a lot of sense. For decent compaction you'd want to scan the
> table backwards, and move rows from the end to earlier, but stop once
> everything is filled up. You can somewhat do that from SQL, but it's going to
> be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> logging.
>
> I think having explicit compaction support in VACUUM or somewhere similar
> would make sense, but I don't think the proposed GUC is a useful stepping
> stone.

I think there's a difference between wanting to compact instantly and
wanting to compact over time. I think that this kind of thing is
reasonably well-suited to the latter, if we can engineer away the
cases where it backfires.

But I know people will try to use it for instant compaction too, and
there it's worth remembering why we removed old-style VACUUM FULL. The
main problem is that it was mind-bogglingly slow. The other really bad
problem is that it caused massive index bloat. I think any system
that's based on moving around my tuples right now to make my table
smaller right now is likely to have similar issues.

In the case where you're trying to compact gradually, I think there
are potentially serious issues with index bloat, but only potentially.
It seems like there are reasonable cases where it's fine.
Specifically, if you have relatively few indexes per table, relatively
few long-running transactions, and all tuples get updated on a
semi-regular basis, I'm thinking that you're more likely to win than
lose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent
On Tue, 19 Sept 2023 at 18:52, Robert Haas  wrote:
>
> On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  
> wrote:
> > I was thinking something vaguely like "a table size that's roughly what
> > an optimal autovacuuming schedule would leave the table at" assuming 0.2
> > vacuum_scale_factor.  You would determine the absolute minimum size for
> > the table given the current live tuples in the table, then add 20% to
> > account for a steady state of dead tuples and vacuumed space.  So it's
> > not 1.2x of the "current" table size at the time the local_update_limit
> > feature is installed, but 1.2x of the optimal table size.
>
> Right, that would be great. And honestly if that's something we can
> figure out, then why does the parameter even need to be an integer
> instead of a Boolean? If the system knows the optimal table size, then
> the user can just say "try to compact this table" and need not say to
> what size. The 1.2 multiplier is probably situation dependent and
> maybe the multiplier should indeed be a configuration parameter, but
> we would be way better off if the absolute size didn't need to be.

Mostly agreed, but I think there's a pitfall here. You seem to assume
we have a perfect oracle that knows the optimal data size, but we
already know that our estimates can be significantly off. I don't
quite trust the statistics enough to do any calculations based on the
number of tuples in the relation. That also ignores the fact that we
don't actually have any good information about the average size of the
tuples in the table. So with current statistics, any automated "this
is how large the table should be" decisions would result in an
automated footgun, instead of the current patch's where the user has
to decide to configure it to an explicit value.

But about that: I'm not sure what the "footgun" is that you've
mentioned recently?
The issue with excessive bloat (when the local_update_limit is set too
small and fillfactor is low) was fixed in the latest patch nearly
three weeks ago, so the only remaining issue with misconfiguration is
slower updates. Sure, that's not great, but in my opinion not a
"footgun": performance returns immediately after resetting
local_update_limit, and no space was lost.

> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> +1 to all of that. Anything we can do to reduce the chance of the
> parameter doing the opposite of what it's intended to do is, IMHO,
> really, really valuable. If you're in the situation where you really
> need something like this, you're probably having a pretty bad day
> already.

Yes, it does that with the latest patch, from not quite 3 weeks ago.

> Just to be more clear about my position, I don't think that having
> some kind of a feature along these lines is a bad idea.

Thanks for clarifying.

> I do think
> that this is one of those cases where the perfect is the enemy of the
> good, and we can fall into the trap of saying that since we can't do
> the perfect thing let's not do anything at all. At the same time, just
> because we need to do something doesn't mean we should do exactly the
> first thing that anybody thought up, or that we shouldn't try as hard
> as we can to mitigate the downsides. If we add something like this I
> bet it will get a lot of use. Even a minor improvement to the design
> that removes one pitfall of many could turn out to help a lot of
> people.

100% agreed.

> > Having to set AEL is not nice for sure, but wouldn't
> > ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
> > for which that is sufficient.
>
> Hmm, yeah, I think you're right.

Updating the reloption after relation truncation implies having the
same lock as relation truncation, i.e. AEL (if the vacuum docs are to
be believed). So the AEL is not reqiored for updating the storage
option (that would require SUEL), but for the block truncation
operation operation.

Kind regards,

Matthias van de Meent
Neon (http://neon.tech)




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Matthias van de Meent
On Tue, 19 Sept 2023 at 18:56, Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> > This makes me think that maybe the logic needs to be a little more
> > complex to avoid the problem you describe: if an UPDATE is prevented
> > from being HOT because of this setting, but then it goes and consults
> > FSM and it gives the update a higher block number than the tuple's
> > current block (or it fails to give a block number at all so it is forced
> > to extend the relation), then the update should give up on that strategy
> > and use a HOT update after all.  (I have not read the actual patch;
> > maybe it already does this?  It sounds kinda obvious.)
>
> Yea, a setting like what's discussed here seems, uh, not particularly useful
> for achieving the goal of compacting tables.  I don't think guiding this
> through SQL makes a lot of sense. For decent compaction you'd want to scan the
> table backwards, and move rows from the end to earlier, but stop once
> everything is filled up. You can somewhat do that from SQL, but it's going to
> be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
> logging.

We can't move tuples around (or, not that I know of) without using a
transaction ID to control the visibility of the two locations of that
tuple. Doing table compaction would thus likely require using
transactions to move these tuples around. Using a single backend and
bulk operations, it'll still lock each tuple that is being moved, and
that can be noticed by user DML queries. I'd rather make the user's
queries move the data around than this long-duration, locking
background operation.

> I think having explicit compaction support in VACUUM or somewhere similar
> would make sense, but I don't think the proposed GUC is a useful stepping
> stone.

The point of this GUC is that the compaction can happen organically in
the user's UPDATE workflow, so that there is no long locking operation
going on (as you would see with VACUUM FULL / CLUSTER / pg_repack).

> > > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > > feature. Sure, some people will get use out of it, if they're
> > > sufficiently knowledgeable and sufficiently determined. But I think
> > > for most people in most situations, it will be a struggle.
>
> Indeed. I think it'd often just explode table and index sizes, because HOT
> pruning won't be able to make usable space in pages anymore (due to dead
> items).

You seem to misunderstand the latest patch. It explicitly only blocks
local updates if the update can then move the new tuple to an earlier
page. If that is not possible, then it'll insert locally (assuming
that is still possible) and HOT can then still apply.

And yes, moving tuples to earlier pages will indeed increase index
bloat, because it does create dead tuples where previously we could've
applied HOT. But we do have VACUUM and REINDEX CONCURRENTLY to clean
that up without serious long-duration stop-the-world actions, while
the other builtin cleanup methods don't.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Andres Freund
Hi,

On 2023-09-19 18:30:44 +0200, Alvaro Herrera wrote:
> This makes me think that maybe the logic needs to be a little more
> complex to avoid the problem you describe: if an UPDATE is prevented
> from being HOT because of this setting, but then it goes and consults
> FSM and it gives the update a higher block number than the tuple's
> current block (or it fails to give a block number at all so it is forced
> to extend the relation), then the update should give up on that strategy
> and use a HOT update after all.  (I have not read the actual patch;
> maybe it already does this?  It sounds kinda obvious.)

Yea, a setting like what's discussed here seems, uh, not particularly useful
for achieving the goal of compacting tables.  I don't think guiding this
through SQL makes a lot of sense. For decent compaction you'd want to scan the
table backwards, and move rows from the end to earlier, but stop once
everything is filled up. You can somewhat do that from SQL, but it's going to
be awkward and slow.  I doubt you even want to use the normal UPDATE WAL
logging.

I think having explicit compaction support in VACUUM or somewhere similar
would make sense, but I don't think the proposed GUC is a useful stepping
stone.


> > But without any kind of auto-tuning, in my opinion, it's a fairly poor
> > feature. Sure, some people will get use out of it, if they're
> > sufficiently knowledgeable and sufficiently determined. But I think
> > for most people in most situations, it will be a struggle.

Indeed. I think it'd often just explode table and index sizes, because HOT
pruning won't be able to make usable space in pages anymore (due to dead
items).

Greetings,

Andres Freund




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 12:30 PM Alvaro Herrera  wrote:
> I was thinking something vaguely like "a table size that's roughly what
> an optimal autovacuuming schedule would leave the table at" assuming 0.2
> vacuum_scale_factor.  You would determine the absolute minimum size for
> the table given the current live tuples in the table, then add 20% to
> account for a steady state of dead tuples and vacuumed space.  So it's
> not 1.2x of the "current" table size at the time the local_update_limit
> feature is installed, but 1.2x of the optimal table size.

Right, that would be great. And honestly if that's something we can
figure out, then why does the parameter even need to be an integer
instead of a Boolean? If the system knows the optimal table size, then
the user can just say "try to compact this table" and need not say to
what size. The 1.2 multiplier is probably situation dependent and
maybe the multiplier should indeed be a configuration parameter, but
we would be way better off if the absolute size didn't need to be.

> This makes me think that maybe the logic needs to be a little more
> complex to avoid the problem you describe: if an UPDATE is prevented
> from being HOT because of this setting, but then it goes and consults
> FSM and it gives the update a higher block number than the tuple's
> current block (or it fails to give a block number at all so it is forced
> to extend the relation), then the update should give up on that strategy
> and use a HOT update after all.  (I have not read the actual patch;
> maybe it already does this?  It sounds kinda obvious.)

+1 to all of that. Anything we can do to reduce the chance of the
parameter doing the opposite of what it's intended to do is, IMHO,
really, really valuable. If you're in the situation where you really
need something like this, you're probably having a pretty bad day
already.

Just to be more clear about my position, I don't think that having
some kind of a feature along these lines is a bad idea. I do think
that this is one of those cases where the perfect is the enemy of the
good, and we can fall into the trap of saying that since we can't do
the perfect thing let's not do anything at all. At the same time, just
because we need to do something doesn't mean we should do exactly the
first thing that anybody thought up, or that we shouldn't try as hard
as we can to mitigate the downsides. If we add something like this I
bet it will get a lot of use. Even a minor improvement to the design
that removes one pitfall of many could turn out to help a lot of
people. If we could get to the point where most people have a positive
user experience without too much effort, this could turn out to be one
of the most impactful features in years.

> Having to set AEL is not nice for sure, but wouldn't
> ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
> for which that is sufficient.

Hmm, yeah, I think you're right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Alvaro Herrera
On 2023-Sep-19, Robert Haas wrote:

> On Tue, Sep 19, 2023 at 6:26 AM Alvaro Herrera  
> wrote:
> > Second, I think we should make it auto-reset.  That is, have the user
> > set some value; later, when some condition triggers (say, the table size
> > is 1.2x the limit value you configured), then the local_update_limit is
> > automatically removed from the table options.  From that point onwards,
> > the table is operated normally.
> 
> That's an interesting idea. It would require taking AEL on the table.
> And also, what do you mean by 1.2x the limit value? Is that supposed
> to be a >= condition or a <= condition? It can't really be a >=
> condition, but you wouldn't set it in the first place unless the table
> were significantly bigger than it could be. But if it's a <= condition
> it doesn't really protect you from hosing yourself. You just have to
> insert a bit more data before enough of the bloat gets removed, and
> now the table just bloats infinitely and probably rather quickly. The
> correct value of the setting depends on the amount of real data
> (non-bloat) in the table, not the actual table size.

I was thinking something vaguely like "a table size that's roughly what
an optimal autovacuuming schedule would leave the table at" assuming 0.2
vacuum_scale_factor.  You would determine the absolute minimum size for
the table given the current live tuples in the table, then add 20% to
account for a steady state of dead tuples and vacuumed space.  So it's
not 1.2x of the "current" table size at the time the local_update_limit
feature is installed, but 1.2x of the optimal table size.

This makes me think that maybe the logic needs to be a little more
complex to avoid the problem you describe: if an UPDATE is prevented
from being HOT because of this setting, but then it goes and consults
FSM and it gives the update a higher block number than the tuple's
current block (or it fails to give a block number at all so it is forced
to extend the relation), then the update should give up on that strategy
and use a HOT update after all.  (I have not read the actual patch;
maybe it already does this?  It sounds kinda obvious.)


Having to set AEL is not nice for sure, but wouldn't
ShareUpdateExclusiveLock be sufficient?  We have a bunch of reloptions
for which that is sufficient.


> But without any kind of auto-tuning, in my opinion, it's a fairly poor
> feature. Sure, some people will get use out of it, if they're
> sufficiently knowledgeable and sufficiently determined. But I think
> for most people in most situations, it will be a struggle.
> 
> -- 
> Robert Haas
> EDB: http://www.enterprisedb.com


-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Robert Haas
On Tue, Sep 19, 2023 at 6:26 AM Alvaro Herrera  wrote:
> Second, I think we should make it auto-reset.  That is, have the user
> set some value; later, when some condition triggers (say, the table size
> is 1.2x the limit value you configured), then the local_update_limit is
> automatically removed from the table options.  From that point onwards,
> the table is operated normally.

That's an interesting idea. It would require taking AEL on the table.
And also, what do you mean by 1.2x the limit value? Is that supposed
to be a >= condition or a <= condition? It can't really be a >=
condition, but you wouldn't set it in the first place unless the table
were significantly bigger than it could be. But if it's a <= condition
it doesn't really protect you from hosing yourself. You just have to
insert a bit more data before enough of the bloat gets removed, and
now the table just bloats infinitely and probably rather quickly. The
correct value of the setting depends on the amount of real data
(non-bloat) in the table, not the actual table size.

> The point here is that third-party tools such as pg_repack or pg_squeeze
> exist, which work in a way we don't like, yet we offer no alternative.
> This proposal is a mechanism that essentially replaces those tools with
> a simple in-core feature, without having to include the tool itself in
> core.

I agree that it would be nice to have something in core that can be
used to help with this problem, but this feature isn't the same thing
as pg_repack or pg_squeeze, either. In some ways, it's better, because
it can shrink the table without rewriting it, which is very desirable.
But in other ways, it's worse, and the fact that it seems like it can
backfire spectacularly if you set the wrong value seems like one big
way that it is a lot worse. If there is a way that we can make this a
mode that you activate for a table, and the system calculates and
updates the threshold, I think that would actually be a pretty good
feature. It would be tricky to use it to recover from acute
emergencies, because it doesn't actually do anything until updates
happen, but you could use it for that in a pinch. And even without
that it would be useful if you have a table that is sometimes very
large and sometimes very small and you want to get the space back from
the OS when it is in the small phase of its lifecycle.

But without any kind of auto-tuning, in my opinion, it's a fairly poor
feature. Sure, some people will get use out of it, if they're
sufficiently knowledgeable and sufficiently determined. But I think
for most people in most situations, it will be a struggle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-19 Thread Alvaro Herrera
On 2023-Sep-18, Robert Haas wrote:

> On Tue, Sep 5, 2023 at 11:15 PM Laurenz Albe  wrote:
> > I don't think that is a good comparison.  While most people probably
> > never need to touch "local_update_limit", "work_mem" is something everybody
> > has to consider.
> >
> > And it is not so hard to tune: the setting would be the desired table
> > size, and you could use pgstattuple to find a good value.
> 
> What I suspect would happen, though, is that you'd end up tuning the
> value over and over. You'd set it to some value and after some number
> of vacuums maybe you'd realize that you could save even more disk
> space if you reduced it a bit further or maybe your data set would
> grow a bit and you'd have to increase it a little (or a lot). And if
> you didn't keep adjusting it then maybe something quite bad would
> happen to your database.

As I understand it, the setting being proposed is useful as an emergency
for removing excessive bloat -- a substitute for VACUUM FULL when you
don't want to lock the table for long.  Trying to use it as a permanent
gadget is going to be misguided.  So my first thought is that we should
tell people to use it that way: if you're not in the irrecoverable-space
situation, just do not use this.  Then we don't have to worry about
people misusing it the way you imagine.

Second, I think we should make it auto-reset.  That is, have the user
set some value; later, when some condition triggers (say, the table size
is 1.2x the limit value you configured), then the local_update_limit is
automatically removed from the table options.  From that point onwards,
the table is operated normally.

This removes the other concern that makes the system behaves
suboptimally because some DBA in the past decade left this set for no
good reason: if you run into an emergency, then you activate the
emergency escape hatch, and it will close on its own as soon as the
emergency is over.

This also dissuades people from using it for these other things you
describe.  It just won't work.


The point here is that third-party tools such as pg_repack or pg_squeeze
exist, which work in a way we don't like, yet we offer no alternative.
This proposal is a mechanism that essentially replaces those tools with
a simple in-core feature, without having to include the tool itself in
core.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)




Re: Disabling Heap-Only Tuples

2023-09-18 Thread Laurenz Albe
On Mon, 2023-09-18 at 12:22 -0400, Robert Haas wrote:
> On Tue, Sep 5, 2023 at 11:15 PM Laurenz Albe  wrote:
> > I don't think that is a good comparison.  While most people probably
> > never need to touch "local_update_limit", "work_mem" is something everybody
> > has to consider.
> > 
> > And it is not so hard to tune: the setting would be the desired table
> > size, and you could use pgstattuple to find a good value.
> 
> What I suspect would happen, though, is that you'd end up tuning the
> value over and over. You'd set it to some value and after some number
> of vacuums maybe you'd realize that you could save even more disk
> space if you reduced it a bit further or maybe your data set would
> grow a bit and you'd have to increase it a little (or a lot). And if
> you didn't keep adjusting it then maybe something quite bad would
> happen to your database.

There is that risk, yes.

> work_mem isn't quite the same [...] But what is the same here and in the case
> of work_mem is that you can suddenly get hosed if the situation
> changes substantially and you don't respond by updating the parameter
> setting. In the case of work_mem, again in my experience, it's quite
> common for people to suddenly find themselves in a lot of trouble if
> they have a load spike, because now they're running a lot more copies
> of the same query and the machine runs out of memory.

So the common ground is "both parameters are not so easy to get right,
and if you get them wrong, it's a problem".  For me the big difference is
that while you pretty much have to tune "work_mem", you can normally just ignore
"local_update_limit".

> The equivalent
> problem here would be if the table suddenly gets a lot bigger due to a
> load spike or some change in the way the application is used. Then
> suddenly, a setting that was previously serving to keep the table
> pleasantly small and un-bloated on disk is instead causing tons of
> updates that would have been HOT to become non-HOT, which could very
> easily result in both the table and its indexes bloating quite
> rapidly. I really don't like the idea of an anti-bloat feature that,
> when set to the wrong value, becomes a bloat-amplification feature. I
> don't know how to describe that other than "fragile and dangerous."

Yes, you can hurt yourself that way.  But that applies to many other
settings as well.  You can tank your performance with a bad value for
"commit_delay", "hot_standby_feedback" can bloat your primary, and
so on.  Still we consider these useful parameters.

> Imagine a hypothetical feature that knew how small the table could
> reasonably be kept, say by magic, and did non-HOT updates instead of
> HOT updates whenever doing so would allow moving a tuple from a page
> beyond that magical boundary to an earlier page. Such a feature would
> not have the downsides that this one does -- if there were
> opportunities to make the table smaller, the system would take
> advantage of them automatically, and if the table grew, the system
> would automatically become more relaxed to stay out of trouble. Such a
> feature is clearly more work to design and implement than what is
> proposed here, but it would also work a lot better in practice.

That sounds a bit like we should not have "shared_buffers" unless we
have a magical tool built in that gets the value right automatically.
Yes, the better is the enemy of the good.  You can kill everything with
a line of reasoning like that.

> In
> fact, I daresay that if we accept the feature as proposed, somebody's
> going to go out and write a tool to calculate what the threshold ought
> to be and automatically adjust it as things change. Users of the tool
> will then divide into two camps:
> 
> - People who try to tune it manually and get burned if anything
> changes on their system.
> - People who use that out-of-core tool.
> 
> So the out-of-core tool that does this tuning becomes a stealth
> dependency for any user who is facing this problem. Gosh, don't we
> have enough of those already? Connection pooling being perhaps the
> most obvious example, but far from the only one.

I cannot follow you there.  What I envision is that "local_update_limit"
is not set permanently on a table.  You set it when you realize your table
got bloated.  Then you wait until the bloat goes away or you launch a
couple of UPDATEs that eventually shrink the table.  Then you reset
"local_update_limit" again.
It's a more difficult, but less invasive alternative to VACUUM (FULL).

If a setting is hard to understand and hard to get right, we could invest
in good documentation that explains the use cases and pitfalls.
Wouldn't that go a long way towards defusing this perceived footgun?
I am aware that a frightening number of users don't read documentation,
but I find it hard to believe that anyone would twiddle a non-obvious
knob like "local_update_limit" without first trying to figure out what
it actually does.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-09-18 Thread Robert Haas
On Tue, Sep 5, 2023 at 11:15 PM Laurenz Albe  wrote:
> I don't think that is a good comparison.  While most people probably
> never need to touch "local_update_limit", "work_mem" is something everybody
> has to consider.
>
> And it is not so hard to tune: the setting would be the desired table
> size, and you could use pgstattuple to find a good value.

What I suspect would happen, though, is that you'd end up tuning the
value over and over. You'd set it to some value and after some number
of vacuums maybe you'd realize that you could save even more disk
space if you reduced it a bit further or maybe your data set would
grow a bit and you'd have to increase it a little (or a lot). And if
you didn't keep adjusting it then maybe something quite bad would
happen to your database.

work_mem isn't quite the same in the sense that most people don't need
to keep on iteratively tuning work_mem, at least not in my experience.
You figure out a value that works OK in practice and then leave it
alone. The problem is mostly discovering what that initial value ought
to be, which is often hard. But what is the same here and in the case
of work_mem is that you can suddenly get hosed if the situation
changes substantially and you don't respond by updating the parameter
setting. In the case of work_mem, again in my experience, it's quite
common for people to suddenly find themselves in a lot of trouble if
they have a load spike, because now they're running a lot more copies
of the same query and the machine runs out of memory. The equivalent
problem here would be if the table suddenly gets a lot bigger due to a
load spike or some change in the way the application is used. Then
suddenly, a setting that was previously serving to keep the table
pleasantly small and un-bloated on disk is instead causing tons of
updates that would have been HOT to become non-HOT, which could very
easily result in both the table and its indexes bloating quite
rapidly. I really don't like the idea of an anti-bloat feature that,
when set to the wrong value, becomes a bloat-amplification feature. I
don't know how to describe that other than "fragile and dangerous."

Imagine a hypothetical feature that knew how small the table could
reasonably be kept, say by magic, and did non-HOT updates instead of
HOT updates whenever doing so would allow moving a tuple from a page
beyond that magical boundary to an earlier page. Such a feature would
not have the downsides that this one does -- if there were
opportunities to make the table smaller, the system would take
advantage of them automatically, and if the table grew, the system
would automatically become more relaxed to stay out of trouble. Such a
feature is clearly more work to design and implement than what is
proposed here, but it would also work a lot better in practice. In
fact, I daresay that if we accept the feature as proposed, somebody's
going to go out and write a tool to calculate what the threshold ought
to be and automatically adjust it as things change. Users of the tool
will then divide into two camps:

- People who try to tune it manually and get burned if anything
changes on their system.
- People who use that out-of-core tool.

So the out-of-core tool that does this tuning becomes a stealth
dependency for any user who is facing this problem. Gosh, don't we
have enough of those already? Connection pooling being perhaps the
most obvious example, but far from the only one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-09-05 Thread Laurenz Albe
On Wed, 2023-08-30 at 09:31 -0400, Robert Haas wrote:
> On Wed, Aug 30, 2023 at 9:01 AM Matthias van de Meent
>  wrote:
> > I've reworked the patch a bit to remove the "excessive bloat with low
> > fillfactors when local space is available" issue that this parameter
> > could cause - local updates are now done if the selected page we would
> > be inserting into is after the old tuple's page and the old tuple's
> > page still (or: now) has space available.
> > 
> > Does that alleviate your concerns?
> 
> That seems like a good chance, but my core concern is around people
> having to micromanage local_update_limit, and probably either not
> knowing how to do it properly, or not being able or willing to keep
> updating it as things change.
> 
> In a way, this parameter is a lot like work_mem, which is notoriously
> very difficult to tune.

I don't think that is a good comparison.  While most people probably
never need to touch "local_update_limit", "work_mem" is something everybody
has to consider.

And it is not so hard to tune: the setting would be the desired table
size, and you could use pgstattuple to find a good value.

I don't know what other use cases come to mind, but I see it as a tool to
shrink a table after it has grown big holes, perhaps after a mass delete.
Today, you can only VACUUM (FULL) or play with the likes of pg_squeeze and
pg_repack.

I think this is useful.

To alleviate your concerns, perhaps it would help to describe the use case
and ideas for a good setting in the documentation.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-08-30 Thread Matthias van de Meent
On Wed, 30 Aug 2023 at 15:31, Robert Haas  wrote:
>
> On Wed, Aug 30, 2023 at 9:01 AM Matthias van de Meent
>  wrote:
> > I've reworked the patch a bit to remove the "excessive bloat with low
> > fillfactors when local space is available" issue that this parameter
> > could cause - local updates are now done if the selected page we would
> > be inserting into is after the old tuple's page and the old tuple's
> > page still (or: now) has space available.
> >
> > Does that alleviate your concerns?
>
> That seems like a good chance, but my core concern is around people
> having to micromanage local_update_limit, and probably either not
> knowing how to do it properly, or not being able or willing to keep
> updating it as things change.

Assuming you do want to provide a way to users to solve the issue of
"there is a lot of free space in the table, but I don't want to take
an access exclusive lock or wait for new inserts to fix the issue",
how would you suggest we do that then?

Alternative approaches that I can think of are:

- A %-based parameter.
  This does scale with the table, but doesn't stop being a performance
hog once you've reached the optimal table size, and thus also needs to
be disabled.

- Measure the parameter from the end of the table, instead of from the
front; i.e. "try to empty the last X=50 MBs of the table".
  Scales with the table, but same issue as above - once the table has
an optimal size, it doesn't stop.

- Install one more dynamic system to move the tuples to a better page,
one the users don't directly control (yet to be designed).
  I don't know if or when this will be implemented and what benefits
it will have, but we don't have access to a lot of state in
table_tuple_update or heap_update, so any data needs special lookup.

- Let users keep using VACUUM FULL and CLUSTER instead.
  I don't think this is a reasonable solution.

> In a way, this parameter is a lot like work_mem, which is notoriously
> very difficult to tune. If you set it too high, you run out of memory.
> If you set it too low, you get bad plans. You can switch from having
> one of those problems to having the other very quickly as load changs,
> and sometimes you can have both at the same time. If an omniscient
> oracle could set work_mem properly for every query based not only on
> what the query does but the state of the system at that moment, it
> would still be a very crude parameter, and since omniscient oracles
> are rare in practice, problems are reasonably common. I think that if
> we add this parameter, it's going to end up in the same category. A
> lot of people will ignore it, and they'll be OK, but 30% of the people
> who do try to use it will shoot themselves in the foot, or something
> like that.

The "shoot yourself in the foot" in this case is limited to "your
UPDATE statement's performance is potentially Y times worse due to
forced FSM lookups for every update at the end of the table". I'll
admit that this is not great, but I'd say it is also not the end of
the world, and still much better than the performance differences that
you can see when the plan changes due to an updated work_mem.

I'd love to have more contextual information available on the table's
free space distribution so that this decision could be made by the
system, but that info just isn't available right now. We don't really
have infrastructure in place that would handle such information
either, and table_tuple_update does not get to use reuse state across
tuples, so any use of information will add cost for every update. With
this patch, the FSM cost is gated behind the storage parameter, and
thus only limited, but I don't think we can store much more than
storage parameters in the Relation data.

VACUUM / ANALYZE could probably create and store sketches about the
free space distribution in the relation, but that would widen the
scope significantly, and I have only limited bandwidth available for
this.
So, while I do plan to implement any small changes or fixes required
to get this in, a major change in direction for this patch won't put
it anywhere high on my active items list.


Kind regards,

Matthias van de Meent




Re: Disabling Heap-Only Tuples

2023-08-30 Thread Robert Haas
On Wed, Aug 30, 2023 at 9:31 AM Robert Haas  wrote:
> That seems like a good chance, but

*change

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-08-30 Thread Robert Haas
On Wed, Aug 30, 2023 at 9:01 AM Matthias van de Meent
 wrote:
> I've reworked the patch a bit to remove the "excessive bloat with low
> fillfactors when local space is available" issue that this parameter
> could cause - local updates are now done if the selected page we would
> be inserting into is after the old tuple's page and the old tuple's
> page still (or: now) has space available.
>
> Does that alleviate your concerns?

That seems like a good chance, but my core concern is around people
having to micromanage local_update_limit, and probably either not
knowing how to do it properly, or not being able or willing to keep
updating it as things change.

In a way, this parameter is a lot like work_mem, which is notoriously
very difficult to tune. If you set it too high, you run out of memory.
If you set it too low, you get bad plans. You can switch from having
one of those problems to having the other very quickly as load changs,
and sometimes you can have both at the same time. If an omniscient
oracle could set work_mem properly for every query based not only on
what the query does but the state of the system at that moment, it
would still be a very crude parameter, and since omniscient oracles
are rare in practice, problems are reasonably common. I think that if
we add this parameter, it's going to end up in the same category. A
lot of people will ignore it, and they'll be OK, but 30% of the people
who do try to use it will shoot themselves in the foot, or something
like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-08-30 Thread Matthias van de Meent
On Mon, 28 Aug 2023 at 17:57, Robert Haas  wrote:
>
> On Mon, Aug 28, 2023 at 11:50 AM Matthias van de Meent
>  wrote:
> > Agreed on all points. But isn't that true for most most tools on bloat
> > prevention and/or detection? E.g. fillfactor, autovacuum_*, ...
>
> Not nearly to the same extent, IMHO. A lot of those parameters can be
> left alone forever and you lose nothing. That's not so here.

I've reworked the patch a bit to remove the "excessive bloat with low
fillfactors when local space is available" issue that this parameter
could cause - local updates are now done if the selected page we would
be inserting into is after the old tuple's page and the old tuple's
page still (or: now) has space available.

Does that alleviate your concerns?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v2-0001-Add-heap-reloption-local_update_limit.patch
Description: Binary data


Re: Disabling Heap-Only Tuples

2023-08-28 Thread Robert Haas
On Mon, Aug 28, 2023 at 11:50 AM Matthias van de Meent
 wrote:
> Agreed on all points. But isn't that true for most most tools on bloat
> prevention and/or detection? E.g. fillfactor, autovacuum_*, ...

Not nearly to the same extent, IMHO. A lot of those parameters can be
left alone forever and you lose nothing. That's not so here.

> I'd prefer that too, but by lack of other work in this area this seems
> like it fills a niche that would otherwise require extremely expensive
> locking over a long time for CLUSTER, superuser+pg_repack, or manual
> scripts that update tuples until they're located on a different page
> (begin; update tuple WHERE ctid > '(12,0)' returning ctid; ...;
> commit;). I agree this is very minimal and can definitely be used as a
> footgun, but with the description that it can be a footgun I don't
> think it's (much) worse than the current situation - a user should
> only reach for this once they've realized they actually have an issue.

Well, I sort of expected that counter-argument, but I'm not sure that I buy it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-08-28 Thread Matthias van de Meent
On Mon, 28 Aug 2023 at 17:14, Robert Haas  wrote:
>
> On Mon, Aug 28, 2023 at 10:52 AM Matthias van de Meent
>  wrote:
> > In this new patch, I've updated a few comments to get mostly within
> > line length limits; the name of the storage parameter is now
> > "local_update_limit", as per discussion on naming.
> > I've also added local_update_limit to psql's autocomplete file, and
> > added documentation on how the parameter behaves - including warnings
> > - in create_table.sgml.
>
> I feel like this is the sort of setting that experts will sometimes be
> able to use to improve the situation, and non-experts will have great
> difficulty using. It relies on the user to know what size limit will
> work out well, which probably involves knowing how much real data is
> in the table, and how that's going to change over time, and probably
> also some things about how PostgreSQL does space management
> internally. I don't know that I'd be able to guide a non-expert user
> in how to make effective use of this as a tool.

Agreed on all points. But isn't that true for most most tools on bloat
prevention and/or detection? E.g. fillfactor, autovacuum_*, ...

> I don't know exactly what to propose, but I would definitely like it
> if we could come up with something with which a casual user would be
> less likely to shoot themselves in the foot and more likely to derive
> a benefit.

I'd prefer that too, but by lack of other work in this area this seems
like it fills a niche that would otherwise require extremely expensive
locking over a long time for CLUSTER, superuser+pg_repack, or manual
scripts that update tuples until they're located on a different page
(begin; update tuple WHERE ctid > '(12,0)' returning ctid; ...;
commit;). I agree this is very minimal and can definitely be used as a
footgun, but with the description that it can be a footgun I don't
think it's (much) worse than the current situation - a user should
only reach for this once they've realized they actually have an issue.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Disabling Heap-Only Tuples

2023-08-28 Thread Robert Haas
On Mon, Aug 28, 2023 at 10:52 AM Matthias van de Meent
 wrote:
> In this new patch, I've updated a few comments to get mostly within
> line length limits; the name of the storage parameter is now
> "local_update_limit", as per discussion on naming.
> I've also added local_update_limit to psql's autocomplete file, and
> added documentation on how the parameter behaves - including warnings
> - in create_table.sgml.

I feel like this is the sort of setting that experts will sometimes be
able to use to improve the situation, and non-experts will have great
difficulty using. It relies on the user to know what size limit will
work out well, which probably involves knowing how much real data is
in the table, and how that's going to change over time, and probably
also some things about how PostgreSQL does space management
internally. I don't know that I'd be able to guide a non-expert user
in how to make effective use of this as a tool.

I don't know exactly what to propose, but I would definitely like it
if we could come up with something with which a casual user would be
less likely to shoot themselves in the foot and more likely to derive
a benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-08-28 Thread Matthias van de Meent
On Wed, 19 Jul 2023 at 14:58, Laurenz Albe  wrote:
>
> On Thu, 2023-07-06 at 22:18 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> > >  wrote:
> > > > So what were you thinking of? A session GUC? A table option?
> > >
> > > Both.
> >
> > Here's a small patch implementing a new table option max_local_update
> > (name very much bikesheddable). Value is -1 (default, disabled) or the
> > size of the table in MiB that you still want to allow to update on the
> > same page. I didn't yet go for a GUC as I think that has too little
> > control on the impact on the system.
> >
> > I decided that max_local_update would be in MB because there is no
> > reloption value that can contain MaxBlockNumber and -1/disabled; and 1
> > MiB seems like enough granularity for essentially all use cases.
> >
> > The added regression tests show how this feature works, that the new
> > feature works, and validate that lock levels are acceptable
> > (ShareUpdateExclusiveLock, same as for updating fillfactor).
>
> I have looked at your patch, and I must say that I like it.  Having
> a size limit is better than my original idea of just "on" or "off".
> Essentially, it is "try to shrink the table if it grows above a limit".
>
> The patch builds fine and passes all regression tests.
>
> Documentation is missing.

Yes, the first patch was a working proof-of-concept. Here's a new one
with documentation.

> I agree that the name "max_local_update" could be improved.
> Perhaps "avoid_hot_above_size_mb".
>
> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -342,6 +342,7 @@ typedef struct StdRdOptions
> int parallel_workers;   /* max number of parallel workers */
> StdRdOptIndexCleanup vacuum_index_cleanup;  /* controls index vacuuming */
> boolvacuum_truncate;/* enables vacuum to truncate a relation 
> */
> +   int max_local_update;   /* Updates to pages after this block must 
> go through the VM */
>  } StdRdOptions;
>
>  #define HEAP_MIN_FILLFACTOR10
>
> In the comment, it should be FSM, not VM, right?

Good catch.

In this new patch, I've updated a few comments to get mostly within
line length limits; the name of the storage parameter is now
"local_update_limit", as per discussion on naming.
I've also added local_update_limit to psql's autocomplete file, and
added documentation on how the parameter behaves - including warnings
- in create_table.sgml.

Kind regards,

Matthias van de Meent


v1-0001-Add-heap-reloption-local_update_limit.patch
Description: Binary data


Re: Disabling Heap-Only Tuples

2023-08-28 Thread Laurenz Albe
On Thu, 2023-08-24 at 18:23 +0200, Matthias van de Meent wrote:
> On Wed, 19 Jul 2023 at 15:13, Thom Brown  wrote:
> > 
> > On Wed, 19 Jul 2023, 13:58 Laurenz Albe,  wrote:
> > > I agree that the name "max_local_update" could be improved.
> > > Perhaps "avoid_hot_above_size_mb".
> > 
> > Or "hot_table_size_threshold" or "hot_update_limit"?
> 
> Although I like these names, it doesn't quite cover the use of the
> parameter for me, as updated tuples prefer to be inserted on the same
> page as the old tuple regardless of whether HOT applies.
> 
> How about 'local_update_limit'?

I agree with your concern.  I cannot think of a better name than yours.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-08-24 Thread Matthias van de Meent
On Wed, 19 Jul 2023 at 15:13, Thom Brown  wrote:
>
> On Wed, 19 Jul 2023, 13:58 Laurenz Albe,  wrote:
>> I agree that the name "max_local_update" could be improved.
>> Perhaps "avoid_hot_above_size_mb".
>
> Or "hot_table_size_threshold" or "hot_update_limit"?

Although I like these names, it doesn't quite cover the use of the
parameter for me, as updated tuples prefer to be inserted on the same
page as the old tuple regardless of whether HOT applies.

Example: a bloated table test(
   id int primary key,
   num_updates int,
   unique (id, num_updates)
)
would be assumed to remain bloated if I'd set a parameter named
something_hot_something, as all updates would be non-hot and thus
should not be influenced by the GUC/parameter.

How about 'local_update_limit'?

Kind regards,

Matthias van de Meent




Re: Disabling Heap-Only Tuples

2023-08-24 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 12:18, Tomas Vondra  wrote:
>
> On 7/7/23 11:55, Matthias van de Meent wrote:
>> On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
>>>
>>>
>>> So IIUC, this parameter we can control that instead of putting the new
>>> version of the tuple on the same page, it should choose using
>>> RelationGetBufferForTuple(), and that can reduce the fragmentation
>>> because now if there is space then most of the updated tuple will be
>>> inserted in same pages.  But this still can not truncate the pages
>>> from the heap right? because we can not guarantee that the new page
>>> selected by RelationGetBufferForTuple() is not from the end of the
>>> heap, and until we free the pages from the end of the heap, the vacuum
>>> can not truncate any page.  Is my understanding correct?
>>
>> Yes. If you don't have pages with (enough) free space for the updated
>> tuples in your table, or if the FSM doesn't accurately reflect the
>> actual state of free space in your table, this won't help (which is
>> also the reason why I run vacuum in the tests). It also won't help if
>> you don't update the tuples physically located at the end of your
>> table, but in the targeted workload this would introduce a bias where
>> new tuple versions are moved to the front of the table.
>>
>> Something to note is that this may result in very bad bloat when this
>> is combined with a low fillfactor: All blocks past max_local_update
>> will be unable to use space reserved by fillfactor because FSM lookups
>> always take fillfactor into account, and all updates (which ignore
>> fillfactor when local) would go through the FSM instead, thus reducing
>> the space available on each block to exactly the fillfactor. So, this
>> might need some extra code to make sure we don't accidentally blow up
>> the table's size with UPDATEs when max_local_update is combined with
>> low fillfactors. I'm not sure where that would fit best.
>>
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).

I mostly agree, but the point is that first we have to get the update
away from the page. Once we've done that, we can start getting smart
about placement in RelationGetBufferForTuple, but unless we decide to
not put the tuple on the old tuple's page no code from
RelationGetBufferForTuple is executed.

We could change the update code to always go through
RelationGetBufferForTuple to determine the target buffer, and make
that function consider page-local updates (instead of heap_update, who
does that now), but I think that'd need significant extra work in
other callsites of RelationGetBufferForTuple as well as that function
itself.

> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

Improved tuple routing sounds like a great idea, and I've thought
about it as well. I'm not sure whether BRIN (as-is) is the best
candidate though, considering its O(N) scan complexity - 100GB-scale
tables can reasonably have BRIN indexes of MBs, and running a scan on
that is not likely to have good performance.
If BRIN had hierarchical summaries (e.g. if we had range summaries for
data stored in every nonnegative power of 16 of page ranges) then we
could reduce that to something more reasonable, but that's not
currently implemented and so I don't think that's quite relevant yet.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: Disabling Heap-Only Tuples

2023-07-19 Thread Thom Brown
On Wed, 19 Jul 2023, 13:58 Laurenz Albe,  wrote:

> On Thu, 2023-07-06 at 22:18 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> > >  wrote:
> > > > So what were you thinking of? A session GUC? A table option?
> > >
> > > Both.
> >
> > Here's a small patch implementing a new table option max_local_update
> > (name very much bikesheddable). Value is -1 (default, disabled) or the
> > size of the table in MiB that you still want to allow to update on the
> > same page. I didn't yet go for a GUC as I think that has too little
> > control on the impact on the system.
> >
> > I decided that max_local_update would be in MB because there is no
> > reloption value that can contain MaxBlockNumber and -1/disabled; and 1
> > MiB seems like enough granularity for essentially all use cases.
> >
> > The added regression tests show how this feature works, that the new
> > feature works, and validate that lock levels are acceptable
> > (ShareUpdateExclusiveLock, same as for updating fillfactor).
>
> I have looked at your patch, and I must say that I like it.  Having
> a size limit is better than my original idea of just "on" or "off".
> Essentially, it is "try to shrink the table if it grows above a limit".
>
> The patch builds fine and passes all regression tests.
>
> Documentation is missing.
>
> I agree that the name "max_local_update" could be improved.
> Perhaps "avoid_hot_above_size_mb".
>

Or "hot_table_size_threshold" or "hot_update_limit"?

Thom


Re: Disabling Heap-Only Tuples

2023-07-19 Thread Laurenz Albe
On Thu, 2023-07-06 at 22:18 +0200, Matthias van de Meent wrote:
> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> > 
> > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> >  wrote:
> > > So what were you thinking of? A session GUC? A table option?
> > 
> > Both.
> 
> Here's a small patch implementing a new table option max_local_update
> (name very much bikesheddable). Value is -1 (default, disabled) or the
> size of the table in MiB that you still want to allow to update on the
> same page. I didn't yet go for a GUC as I think that has too little
> control on the impact on the system.
> 
> I decided that max_local_update would be in MB because there is no
> reloption value that can contain MaxBlockNumber and -1/disabled; and 1
> MiB seems like enough granularity for essentially all use cases.
> 
> The added regression tests show how this feature works, that the new
> feature works, and validate that lock levels are acceptable
> (ShareUpdateExclusiveLock, same as for updating fillfactor).

I have looked at your patch, and I must say that I like it.  Having
a size limit is better than my original idea of just "on" or "off".
Essentially, it is "try to shrink the table if it grows above a limit".

The patch builds fine and passes all regression tests.

Documentation is missing.

I agree that the name "max_local_update" could be improved.
Perhaps "avoid_hot_above_size_mb".

--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -342,6 +342,7 @@ typedef struct StdRdOptions
int parallel_workers;   /* max number of parallel workers */
StdRdOptIndexCleanup vacuum_index_cleanup;  /* controls index vacuuming */
boolvacuum_truncate;/* enables vacuum to truncate a relation */
+   int max_local_update;   /* Updates to pages after this block must 
go through the VM */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR10

In the comment, it should be FSM, not VM, right?

Other than that, I see nothing wrong.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Ants Aasma
On Fri, 7 Jul 2023 at 13:18, Tomas Vondra  wrote:
> On 7/7/23 11:55, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
> >>
> >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
> >>  wrote:
> >>>
> >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> 
>  On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
>   wrote:
> > So what were you thinking of? A session GUC? A table option?
> 
>  Both.
> >>>
> >>> Here's a small patch implementing a new table option max_local_update
> >>> (name very much bikesheddable). Value is -1 (default, disabled) or the
> >>> size of the table in MiB that you still want to allow to update on the
> >>> same page. I didn't yet go for a GUC as I think that has too little
> >>> control on the impact on the system.
> >>
> >> So IIUC, this parameter we can control that instead of putting the new
> >> version of the tuple on the same page, it should choose using
> >> RelationGetBufferForTuple(), and that can reduce the fragmentation
> >> because now if there is space then most of the updated tuple will be
> >> inserted in same pages.  But this still can not truncate the pages
> >> from the heap right? because we can not guarantee that the new page
> >> selected by RelationGetBufferForTuple() is not from the end of the
> >> heap, and until we free the pages from the end of the heap, the vacuum
> >> can not truncate any page.  Is my understanding correct?
> >
> > Yes. If you don't have pages with (enough) free space for the updated
> > tuples in your table, or if the FSM doesn't accurately reflect the
> > actual state of free space in your table, this won't help (which is
> > also the reason why I run vacuum in the tests). It also won't help if
> > you don't update the tuples physically located at the end of your
> > table, but in the targeted workload this would introduce a bias where
> > new tuple versions are moved to the front of the table.
> >
> > Something to note is that this may result in very bad bloat when this
> > is combined with a low fillfactor: All blocks past max_local_update
> > will be unable to use space reserved by fillfactor because FSM lookups
> > always take fillfactor into account, and all updates (which ignore
> > fillfactor when local) would go through the FSM instead, thus reducing
> > the space available on each block to exactly the fillfactor. So, this
> > might need some extra code to make sure we don't accidentally blow up
> > the table's size with UPDATEs when max_local_update is combined with
> > low fillfactors. I'm not sure where that would fit best.
> >
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).
>
> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

Just as another point in support of strategy based/extensible tuple
placement, I would at some point try out placing INSERT ON CONFLICT
tuples on the same page as the preceding key in the index. Use case is
in tables with (series, timestamp) primary key to get locality of
access range scanning for a single series. Placement will always be a
tradeoff that is dependent on hardware and workload, and the effect
can be pretty large. For the mentioned use case, if placement can
maintain some semblance of clustering, there will be a 10-100x
reduction in buffers accessed for a relatively minor increase in
bloat.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Thom Brown
On Thu, 6 Jul 2023 at 21:18, Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> >
> > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> >  wrote:
> > > So what were you thinking of? A session GUC? A table option?
> >
> > Both.
>
> Here's a small patch implementing a new table option max_local_update
> (name very much bikesheddable). Value is -1 (default, disabled) or the
> size of the table in MiB that you still want to allow to update on the
> same page. I didn't yet go for a GUC as I think that has too little
> control on the impact on the system.
>
> I decided that max_local_update would be in MB because there is no
> reloption value that can contain MaxBlockNumber and -1/disabled; and 1
> MiB seems like enough granularity for essentially all use cases.
>
> The added regression tests show how this feature works, that the new
> feature works, and validate that lock levels are acceptable
> (ShareUpdateExclusiveLock, same as for updating fillfactor).

Wow, thanks for working on this.

I've given it a test, and it does what I would expect it to do.

I'm aware of the concerns about the potential for the relocation to
land in an undesirable location, so perhaps that needs addressing.
But this is already considerably better than the current need to
update a row until it gets pushed off its current page.  Ideally there
would be tooling built around this where the user wouldn't need to
figure out how much of the table to UPDATE, or deal with VACUUMing
concerns.

But here's my quick test:

CREATE OR REPLACE FUNCTION compact_table(table_name IN TEXT)
RETURNS VOID AS $$
DECLARE
current_row RECORD;
old_ctid TID;
new_ctid TID;
keys TEXT;
update_query TEXT;
row_counter INTEGER := 0;
BEGIN
SELECT string_agg(a.attname || ' = ' || a.attname, ', ')
INTO keys
FROM
pg_index i
JOIN
pg_attribute a ON a.attnum = ANY(i.indkey)
WHERE
i.indrelid = table_name::regclass
AND a.attrelid = table_name::regclass
AND i.indisprimary;

IF keys IS NULL THEN
RAISE EXCEPTION 'Table % does not have a primary key.', table_name;
END IF;

FOR current_row IN
EXECUTE FORMAT('SELECT ctid, * FROM %I ORDER BY ctid DESC', table_name)
LOOP
old_ctid := current_row.ctid;

update_query := FORMAT('UPDATE %I SET %s WHERE ctid = $1
RETURNING ctid', table_name, keys);
EXECUTE update_query USING old_ctid INTO new_ctid;

row_counter := row_counter + 1;

IF row_counter % 1000 = 0 THEN
RAISE NOTICE '% rows relocated.', row_counter;
END IF;

IF new_ctid <= old_ctid THEN
CONTINUE;
ELSE
RAISE NOTICE 'All non-contiguous rows relocated.';
EXIT;
END IF;
END LOOP;
END; $$
LANGUAGE plpgsql;


postgres=# CREATE TABLE bigtable (id int, content text);
CREATE TABLE
postgres=# INSERT INTO bigtable SELECT x, 'This is just a way to fill
up space.' FROM generate_series(1,1000) a(x);
INSERT 0 1000
postgres=# DELETE FROM bigtable WHERE id % 7 = 0;
DELETE 1428571
postgres=# VACUUM bigtable;
VACUUM
postgres=# ALTER TABLE bigtable SET (max_local_update = 0);
ALTER TABLE
postgres=# ALTER TABLE bigtable ADD PRIMARY KEY (id);
ALTER TABLE
postgres=# \dt+ bigtable
   List of relations
 Schema |   Name   | Type  | Owner | Persistence | Access method |
Size  | Description
+--+---+---+-+---++-
 public | bigtable | table | thom  | permanent   | heap  | 730 MB |
(1 row)

postgres=# SELECT * FROM pgstattuple('bigtable');
 table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
---+-+---+---+--++++--
 765607936 | 8571429 | 557142885 | 72.77 |
0 |  0 |  0 |  105901628 |13.83
(1 row)

postgres=# SELECT compact_table('bigtable');
NOTICE:  1000 rows relocated.
NOTICE:  2000 rows relocated.
NOTICE:  3000 rows relocated.
NOTICE:  4000 rows relocated.
...
NOTICE:  1221000 rows relocated.
NOTICE:  1222000 rows relocated.
NOTICE:  1223000 rows relocated.
NOTICE:  1224000 rows relocated.
NOTICE:  All non-contiguous rows relocated.
 compact_table
---

(1 row)

postgres=# VACUUM bigtable;
VACUUM
postgres=# \dt+ bigtable;
   List of relations
 Schema |   Name   | Type  | Owner | Persistence | Access method |
Size  | Description
+--+---+---+-+---++-
 public | bigtable | table | thom  | permanent   | heap  | 626 MB |
(1 row)

postgres=# SELECT * FROM pgstattuple('bigtable');
 table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | 

Re: Disabling Heap-Only Tuples

2023-07-07 Thread Laurenz Albe
On Fri, 2023-07-07 at 16:27 +0530, Dilip Kumar wrote:
> On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra  
> wrote:
> > I'm imagining either a table option with a couple possible values
> > (default, non-hot, first-page, ...) or maybe something even more
> > elaborate (perhaps even a callback?).
> > 
> > Now, it's not my intention to hijack this thread, but this discussion
> > reminds me one of the ideas from my "BRIN improvements" talk, about
> > maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> > BRIN, making them gradually worse over time. If we could "tell"
> > RelationGetBufferForTuple() which buffers are more suitable (by looking
> > at an index, histogram or some approximate mapping), that might help.
> 
> IMHO that seems like the right direction for this feature to be
> useful.

Right, I agree.  A GUC/storage parameter like "update_strategy"
that is an enum (try-hot | first-page | ...).

To preserve BRIN indexes or CLUSTERed tables, there could be an additional
"insert_strategy", but that would somehow have to be tied to a certain
index.  I think that is out of scope for this effort.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Dilip Kumar
On Fri, Jul 7, 2023 at 3:48 PM Tomas Vondra
 wrote:
>

> On 7/7/23 11:55, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
> >>
> >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
> >>  wrote:
> >>>
> >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> 
>  On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
>   wrote:
> > So what were you thinking of? A session GUC? A table option?
> 
>  Both.
> >>>
> >>> Here's a small patch implementing a new table option max_local_update
> >>> (name very much bikesheddable). Value is -1 (default, disabled) or the
> >>> size of the table in MiB that you still want to allow to update on the
> >>> same page. I didn't yet go for a GUC as I think that has too little
> >>> control on the impact on the system.
> >>
> >> So IIUC, this parameter we can control that instead of putting the new
> >> version of the tuple on the same page, it should choose using
> >> RelationGetBufferForTuple(), and that can reduce the fragmentation
> >> because now if there is space then most of the updated tuple will be
> >> inserted in same pages.  But this still can not truncate the pages
> >> from the heap right? because we can not guarantee that the new page
> >> selected by RelationGetBufferForTuple() is not from the end of the
> >> heap, and until we free the pages from the end of the heap, the vacuum
> >> can not truncate any page.  Is my understanding correct?
> >
> > Yes. If you don't have pages with (enough) free space for the updated
> > tuples in your table, or if the FSM doesn't accurately reflect the
> > actual state of free space in your table, this won't help (which is
> > also the reason why I run vacuum in the tests). It also won't help if
> > you don't update the tuples physically located at the end of your
> > table, but in the targeted workload this would introduce a bias where
> > new tuple versions are moved to the front of the table.
> >
> > Something to note is that this may result in very bad bloat when this
> > is combined with a low fillfactor: All blocks past max_local_update
> > will be unable to use space reserved by fillfactor because FSM lookups
> > always take fillfactor into account, and all updates (which ignore
> > fillfactor when local) would go through the FSM instead, thus reducing
> > the space available on each block to exactly the fillfactor. So, this
> > might need some extra code to make sure we don't accidentally blow up
> > the table's size with UPDATEs when max_local_update is combined with
> > low fillfactors. I'm not sure where that would fit best.
> >
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).
>
> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

IMHO that seems like the right direction for this feature to be
useful.  Otherwise just forcing it to select a page using
RelationGetBufferForTuple() without any input or direction to this
function can behave pretty randomly.  In fact, there should be some
way to say insert a new tuple in a smaller block number first
(provided they have free space) and with that, we might get an
opportunity to truncate some heap pages by vacuum.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Tomas Vondra



On 7/7/23 11:55, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
>>
>> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
>>  wrote:
>>>
>>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:

 On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
  wrote:
> So what were you thinking of? A session GUC? A table option?

 Both.
>>>
>>> Here's a small patch implementing a new table option max_local_update
>>> (name very much bikesheddable). Value is -1 (default, disabled) or the
>>> size of the table in MiB that you still want to allow to update on the
>>> same page. I didn't yet go for a GUC as I think that has too little
>>> control on the impact on the system.
>>
>> So IIUC, this parameter we can control that instead of putting the new
>> version of the tuple on the same page, it should choose using
>> RelationGetBufferForTuple(), and that can reduce the fragmentation
>> because now if there is space then most of the updated tuple will be
>> inserted in same pages.  But this still can not truncate the pages
>> from the heap right? because we can not guarantee that the new page
>> selected by RelationGetBufferForTuple() is not from the end of the
>> heap, and until we free the pages from the end of the heap, the vacuum
>> can not truncate any page.  Is my understanding correct?
> 
> Yes. If you don't have pages with (enough) free space for the updated
> tuples in your table, or if the FSM doesn't accurately reflect the
> actual state of free space in your table, this won't help (which is
> also the reason why I run vacuum in the tests). It also won't help if
> you don't update the tuples physically located at the end of your
> table, but in the targeted workload this would introduce a bias where
> new tuple versions are moved to the front of the table.
> 
> Something to note is that this may result in very bad bloat when this
> is combined with a low fillfactor: All blocks past max_local_update
> will be unable to use space reserved by fillfactor because FSM lookups
> always take fillfactor into account, and all updates (which ignore
> fillfactor when local) would go through the FSM instead, thus reducing
> the space available on each block to exactly the fillfactor. So, this
> might need some extra code to make sure we don't accidentally blow up
> the table's size with UPDATEs when max_local_update is combined with
> low fillfactors. I'm not sure where that would fit best.
> 

I know the thread started as "let's disable HOT" and this essentially
just proposes to do that using a table option. But I wonder if that's
far too simple to be reliable, because hoping RelationGetBufferForTuple
happens to do the right thing does not seem great.

I wonder if we should invent some definition of "strategy" that would
tell RelationGetBufferForTuple what it should aim for ...

I'm imagining either a table option with a couple possible values
(default, non-hot, first-page, ...) or maybe something even more
elaborate (perhaps even a callback?).

Now, it's not my intention to hijack this thread, but this discussion
reminds me one of the ideas from my "BRIN improvements" talk, about
maybe using BRIN indexes for routing. UPDATEs may be a major issue for
BRIN, making them gradually worse over time. If we could "tell"
RelationGetBufferForTuple() which buffers are more suitable (by looking
at an index, histogram or some approximate mapping), that might help.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Disabling Heap-Only Tuples

2023-07-07 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
>
> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
>  wrote:
> >
> > On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> > >  wrote:
> > > > So what were you thinking of? A session GUC? A table option?
> > >
> > > Both.
> >
> > Here's a small patch implementing a new table option max_local_update
> > (name very much bikesheddable). Value is -1 (default, disabled) or the
> > size of the table in MiB that you still want to allow to update on the
> > same page. I didn't yet go for a GUC as I think that has too little
> > control on the impact on the system.
>
> So IIUC, this parameter we can control that instead of putting the new
> version of the tuple on the same page, it should choose using
> RelationGetBufferForTuple(), and that can reduce the fragmentation
> because now if there is space then most of the updated tuple will be
> inserted in same pages.  But this still can not truncate the pages
> from the heap right? because we can not guarantee that the new page
> selected by RelationGetBufferForTuple() is not from the end of the
> heap, and until we free the pages from the end of the heap, the vacuum
> can not truncate any page.  Is my understanding correct?

Yes. If you don't have pages with (enough) free space for the updated
tuples in your table, or if the FSM doesn't accurately reflect the
actual state of free space in your table, this won't help (which is
also the reason why I run vacuum in the tests). It also won't help if
you don't update the tuples physically located at the end of your
table, but in the targeted workload this would introduce a bias where
new tuple versions are moved to the front of the table.

Something to note is that this may result in very bad bloat when this
is combined with a low fillfactor: All blocks past max_local_update
will be unable to use space reserved by fillfactor because FSM lookups
always take fillfactor into account, and all updates (which ignore
fillfactor when local) would go through the FSM instead, thus reducing
the space available on each block to exactly the fillfactor. So, this
might need some extra code to make sure we don't accidentally blow up
the table's size with UPDATEs when max_local_update is combined with
low fillfactors. I'm not sure where that would fit best.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: Disabling Heap-Only Tuples

2023-07-06 Thread Dilip Kumar
On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> >
> > On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> >  wrote:
> > > So what were you thinking of? A session GUC? A table option?
> >
> > Both.
>
> Here's a small patch implementing a new table option max_local_update
> (name very much bikesheddable). Value is -1 (default, disabled) or the
> size of the table in MiB that you still want to allow to update on the
> same page. I didn't yet go for a GUC as I think that has too little
> control on the impact on the system.

So IIUC, this parameter we can control that instead of putting the new
version of the tuple on the same page, it should choose using
RelationGetBufferForTuple(), and that can reduce the fragmentation
because now if there is space then most of the updated tuple will be
inserted in same pages.  But this still can not truncate the pages
from the heap right? because we can not guarantee that the new page
selected by RelationGetBufferForTuple() is not from the end of the
heap, and until we free the pages from the end of the heap, the vacuum
can not truncate any page.  Is my understanding correct?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Disabling Heap-Only Tuples

2023-07-06 Thread Matthias van de Meent
On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
>
> On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
>  wrote:
> > So what were you thinking of? A session GUC? A table option?
>
> Both.

Here's a small patch implementing a new table option max_local_update
(name very much bikesheddable). Value is -1 (default, disabled) or the
size of the table in MiB that you still want to allow to update on the
same page. I didn't yet go for a GUC as I think that has too little
control on the impact on the system.

I decided that max_local_update would be in MB because there is no
reloption value that can contain MaxBlockNumber and -1/disabled; and 1
MiB seems like enough granularity for essentially all use cases.

The added regression tests show how this feature works, that the new
feature works, and validate that lock levels are acceptable
(ShareUpdateExclusiveLock, same as for updating fillfactor).


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)


v1-0001-Implement-a-reloption-that-forces-updated-tuples-.patch
Description: Binary data


Re: Disabling Heap-Only Tuples

2023-07-05 Thread Laurenz Albe
On Wed, 2023-07-05 at 12:02 +0100, Thom Brown wrote:
> On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent 
>  wrote:
> > On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > > Heap-Only Tuple (HOT) updates are a significant performance
> > > enhancement, as they prevent unnecessary page writes. However, HOT
> > > comes with a caveat: it means that if we have lots of available space
> > > earlier on in the relation, it can only be used for new tuples or in
> > > cases where there's insufficient space on a page for an UPDATE to use
> > > HOT.
> > > 
> > > Considering these trade-offs, I'd like to propose an option to allow
> > > superusers to disable HOT on tables. The intent is to trade some
> > > performance benefits for the ability to reduce the size of a table
> > > without the typical locking associated with it.
> > 
> > Interesting use case, but I think that disabling HOT would be missing
> > the forest for the trees. I think that a feature that disables
> > block-local updates for pages > some offset would be a better solution
> > to your issue: Normal updates also prefer the new tuple to be stored
> > in the same pages as the old tuple if at all possible, so disabling
> > HOT wouldn't solve the issue of tuples residing in the tail of your
> > table - at least not while there is still empty space in those pages.
> 
> Hmm... I see your point.  It's when an UPDATE isn't going to land on
> the same page that it relocates to the earlier available page.  So I
> guess I'm after whatever mechanism would allow that to happen reliably
> and predictably.
> 
> So $subject should really be "Allow forcing UPDATEs off the same page".

I've been thinking about the same thing - an option that changes the update
strategy to always use the lowest block with enough free space.

That would allow to consolidate bloated tables with no down time.

Yours,
Laurenz Albe




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Thom Brown
On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 14:39, Thom Brown  wrote:
> >
> > On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
> >  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 13:03, Thom Brown  wrote:
> > > >
> > > > On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
> > > >  wrote:
> > > > >
> > > > > On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > > > > > Heap-Only Tuple (HOT) updates are a significant performance
> > > > > > enhancement, as they prevent unnecessary page writes. However, HOT
> > > > > > comes with a caveat: it means that if we have lots of available 
> > > > > > space
> > > > > > earlier on in the relation, it can only be used for new tuples or in
> > > > > > cases where there's insufficient space on a page for an UPDATE to 
> > > > > > use
> > > > > > HOT.
> > > > > >
> > > > > > This mechanism limits our options for condensing tables, forcing us 
> > > > > > to
> > > > > > resort to methods like running VACUUM FULL/CLUSTER or using external
> > > > > > tools like pg_repack. These either require exclusive locks (which 
> > > > > > will
> > > > > > be a deal-breaker on large tables on a production system), or 
> > > > > > there's
> > > > > > risks involved. Of course we can always flood pages with new 
> > > > > > versions
> > > > > > of a row until it's forced onto an early page, but that shouldn't be
> > > > > > necessary.
> > > > > >
> > > > > > Considering these trade-offs, I'd like to propose an option to allow
> > > > > > superusers to disable HOT on tables. The intent is to trade some
> > > > > > performance benefits for the ability to reduce the size of a table
> > > > > > without the typical locking associated with it.
> > > > >
> > > > > Interesting use case, but I think that disabling HOT would be missing
> > > > > the forest for the trees. I think that a feature that disables
> > > > > block-local updates for pages > some offset would be a better solution
> > > > > to your issue: Normal updates also prefer the new tuple to be stored
> > > > > in the same pages as the old tuple if at all possible, so disabling
> > > > > HOT wouldn't solve the issue of tuples residing in the tail of your
> > > > > table - at least not while there is still empty space in those pages.
> > > >
> > > > Hmm... I see your point.  It's when an UPDATE isn't going to land on
> > > > the same page that it relocates to the earlier available page.  So I
> > > > guess I'm after whatever mechanism would allow that to happen reliably
> > > > and predictably.
> > > >
> > > > So $subject should really be "Allow forcing UPDATEs off the same page".
> > >
> > > You'd probably want to do that only for a certain range of the table -
> > > for a table with 1GB of data and 3GB of bloat there is no good reason
> > > to force page-crossing updates in the first 1GB of the table - all
> > > tuples of the table will eventually reside there, so why would you
> > > take a performance penalty and move the tuples from inside that range
> > > to inside that same range?
> >
> > I'm thinking more of a case of:
> >
> > 
> >
> > UPDATE bigtable
> > SET primary key = primary key
> > WHERE ctid IN (
> > SELECT ctid
> > FROM bigtable
> > ORDER BY ctid DESC
> > LIMIT 10);
>
> So what were you thinking of? A session GUC? A table option?

Both.

> The benefit of a table option is that it is retained across sessions
> and thus allows tables that get enough updates to eventually get to a
> cleaner state. The main downside of such a table option is that it
> requires a temporary table-level lock to update the parameter.

Yes, but the maintenance window to make such a change would be extremely brief.

> The benefit of a session GUC is that you can set it without impacting
> other sessions, but the downside is that you need to do the
> maintenance in that session, and risk that cascading updates to other
> tables (e.g. through AFTER UPDATE triggers) are also impacted by this
> non-local update GUC.
>
> > > Something else to note: Indexes would suffer some (large?) amount of
> > > bloat in this process, as you would be updating a lot of tuples
> > > without the HOT optimization, thus increasing the work to be done by
> > > VACUUM.
> > > This may result in more bloat in indexes than what you get back from
> > > shrinking the table.
> >
> > This could be the case, but I guess indexes are expendable to an
> > extent, unlike tables.
>
> I don't think that's accurate - index rebuilds are quite expensive.
> But, that's besides the point of this thread.
>
> Somewhat related: did you consider using pg_repack instead of this
> potential feature?

pg_repack isn't exactly innocuous, and can leave potentially the
database in an irrevocable state.  Plus, if disk space is an issue, it
doesn't help.

Thom




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Matthias van de Meent
On Wed, 5 Jul 2023 at 14:39, Thom Brown  wrote:
>
> On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
>  wrote:
> >
> > On Wed, 5 Jul 2023 at 13:03, Thom Brown  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
> > >  wrote:
> > > >
> > > > On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > > > > Heap-Only Tuple (HOT) updates are a significant performance
> > > > > enhancement, as they prevent unnecessary page writes. However, HOT
> > > > > comes with a caveat: it means that if we have lots of available space
> > > > > earlier on in the relation, it can only be used for new tuples or in
> > > > > cases where there's insufficient space on a page for an UPDATE to use
> > > > > HOT.
> > > > >
> > > > > This mechanism limits our options for condensing tables, forcing us to
> > > > > resort to methods like running VACUUM FULL/CLUSTER or using external
> > > > > tools like pg_repack. These either require exclusive locks (which will
> > > > > be a deal-breaker on large tables on a production system), or there's
> > > > > risks involved. Of course we can always flood pages with new versions
> > > > > of a row until it's forced onto an early page, but that shouldn't be
> > > > > necessary.
> > > > >
> > > > > Considering these trade-offs, I'd like to propose an option to allow
> > > > > superusers to disable HOT on tables. The intent is to trade some
> > > > > performance benefits for the ability to reduce the size of a table
> > > > > without the typical locking associated with it.
> > > >
> > > > Interesting use case, but I think that disabling HOT would be missing
> > > > the forest for the trees. I think that a feature that disables
> > > > block-local updates for pages > some offset would be a better solution
> > > > to your issue: Normal updates also prefer the new tuple to be stored
> > > > in the same pages as the old tuple if at all possible, so disabling
> > > > HOT wouldn't solve the issue of tuples residing in the tail of your
> > > > table - at least not while there is still empty space in those pages.
> > >
> > > Hmm... I see your point.  It's when an UPDATE isn't going to land on
> > > the same page that it relocates to the earlier available page.  So I
> > > guess I'm after whatever mechanism would allow that to happen reliably
> > > and predictably.
> > >
> > > So $subject should really be "Allow forcing UPDATEs off the same page".
> >
> > You'd probably want to do that only for a certain range of the table -
> > for a table with 1GB of data and 3GB of bloat there is no good reason
> > to force page-crossing updates in the first 1GB of the table - all
> > tuples of the table will eventually reside there, so why would you
> > take a performance penalty and move the tuples from inside that range
> > to inside that same range?
>
> I'm thinking more of a case of:
>
> 
>
> UPDATE bigtable
> SET primary key = primary key
> WHERE ctid IN (
> SELECT ctid
> FROM bigtable
> ORDER BY ctid DESC
> LIMIT 10);

So what were you thinking of? A session GUC? A table option?

The benefit of a table option is that it is retained across sessions
and thus allows tables that get enough updates to eventually get to a
cleaner state. The main downside of such a table option is that it
requires a temporary table-level lock to update the parameter.

The benefit of a session GUC is that you can set it without impacting
other sessions, but the downside is that you need to do the
maintenance in that session, and risk that cascading updates to other
tables (e.g. through AFTER UPDATE triggers) are also impacted by this
non-local update GUC.

> > Something else to note: Indexes would suffer some (large?) amount of
> > bloat in this process, as you would be updating a lot of tuples
> > without the HOT optimization, thus increasing the work to be done by
> > VACUUM.
> > This may result in more bloat in indexes than what you get back from
> > shrinking the table.
>
> This could be the case, but I guess indexes are expendable to an
> extent, unlike tables.

I don't think that's accurate - index rebuilds are quite expensive.
But, that's besides the point of this thread.

Somewhat related: did you consider using pg_repack instead of this
potential feature?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Thom Brown
On Wed, 5 Jul 2023 at 13:12, Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 13:03, Thom Brown  wrote:
> >
> > On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
> >  wrote:
> > >
> > > On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > > > Heap-Only Tuple (HOT) updates are a significant performance
> > > > enhancement, as they prevent unnecessary page writes. However, HOT
> > > > comes with a caveat: it means that if we have lots of available space
> > > > earlier on in the relation, it can only be used for new tuples or in
> > > > cases where there's insufficient space on a page for an UPDATE to use
> > > > HOT.
> > > >
> > > > This mechanism limits our options for condensing tables, forcing us to
> > > > resort to methods like running VACUUM FULL/CLUSTER or using external
> > > > tools like pg_repack. These either require exclusive locks (which will
> > > > be a deal-breaker on large tables on a production system), or there's
> > > > risks involved. Of course we can always flood pages with new versions
> > > > of a row until it's forced onto an early page, but that shouldn't be
> > > > necessary.
> > > >
> > > > Considering these trade-offs, I'd like to propose an option to allow
> > > > superusers to disable HOT on tables. The intent is to trade some
> > > > performance benefits for the ability to reduce the size of a table
> > > > without the typical locking associated with it.
> > >
> > > Interesting use case, but I think that disabling HOT would be missing
> > > the forest for the trees. I think that a feature that disables
> > > block-local updates for pages > some offset would be a better solution
> > > to your issue: Normal updates also prefer the new tuple to be stored
> > > in the same pages as the old tuple if at all possible, so disabling
> > > HOT wouldn't solve the issue of tuples residing in the tail of your
> > > table - at least not while there is still empty space in those pages.
> >
> > Hmm... I see your point.  It's when an UPDATE isn't going to land on
> > the same page that it relocates to the earlier available page.  So I
> > guess I'm after whatever mechanism would allow that to happen reliably
> > and predictably.
> >
> > So $subject should really be "Allow forcing UPDATEs off the same page".
>
> You'd probably want to do that only for a certain range of the table -
> for a table with 1GB of data and 3GB of bloat there is no good reason
> to force page-crossing updates in the first 1GB of the table - all
> tuples of the table will eventually reside there, so why would you
> take a performance penalty and move the tuples from inside that range
> to inside that same range?

I'm thinking more of a case of:



UPDATE bigtable
SET primary key = primary key
WHERE ctid IN (
SELECT ctid
FROM bigtable
ORDER BY ctid DESC
LIMIT 10);

> Something else to note: Indexes would suffer some (large?) amount of
> bloat in this process, as you would be updating a lot of tuples
> without the HOT optimization, thus increasing the work to be done by
> VACUUM.
> This may result in more bloat in indexes than what you get back from
> shrinking the table.

This could be the case, but I guess indexes are expendable to an
extent, unlike tables.

Thom




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Matthias van de Meent
On Wed, 5 Jul 2023 at 13:03, Thom Brown  wrote:
>
> On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
>  wrote:
> >
> > On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > > Heap-Only Tuple (HOT) updates are a significant performance
> > > enhancement, as they prevent unnecessary page writes. However, HOT
> > > comes with a caveat: it means that if we have lots of available space
> > > earlier on in the relation, it can only be used for new tuples or in
> > > cases where there's insufficient space on a page for an UPDATE to use
> > > HOT.
> > >
> > > This mechanism limits our options for condensing tables, forcing us to
> > > resort to methods like running VACUUM FULL/CLUSTER or using external
> > > tools like pg_repack. These either require exclusive locks (which will
> > > be a deal-breaker on large tables on a production system), or there's
> > > risks involved. Of course we can always flood pages with new versions
> > > of a row until it's forced onto an early page, but that shouldn't be
> > > necessary.
> > >
> > > Considering these trade-offs, I'd like to propose an option to allow
> > > superusers to disable HOT on tables. The intent is to trade some
> > > performance benefits for the ability to reduce the size of a table
> > > without the typical locking associated with it.
> >
> > Interesting use case, but I think that disabling HOT would be missing
> > the forest for the trees. I think that a feature that disables
> > block-local updates for pages > some offset would be a better solution
> > to your issue: Normal updates also prefer the new tuple to be stored
> > in the same pages as the old tuple if at all possible, so disabling
> > HOT wouldn't solve the issue of tuples residing in the tail of your
> > table - at least not while there is still empty space in those pages.
>
> Hmm... I see your point.  It's when an UPDATE isn't going to land on
> the same page that it relocates to the earlier available page.  So I
> guess I'm after whatever mechanism would allow that to happen reliably
> and predictably.
>
> So $subject should really be "Allow forcing UPDATEs off the same page".

You'd probably want to do that only for a certain range of the table -
for a table with 1GB of data and 3GB of bloat there is no good reason
to force page-crossing updates in the first 1GB of the table - all
tuples of the table will eventually reside there, so why would you
take a performance penalty and move the tuples from inside that range
to inside that same range?

Something else to note: Indexes would suffer some (large?) amount of
bloat in this process, as you would be updating a lot of tuples
without the HOT optimization, thus increasing the work to be done by
VACUUM.
This may result in more bloat in indexes than what you get back from
shrinking the table.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Thom Brown
On Wed, 5 Jul 2023 at 11:57, Matthias van de Meent
 wrote:
>
> On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> > Heap-Only Tuple (HOT) updates are a significant performance
> > enhancement, as they prevent unnecessary page writes. However, HOT
> > comes with a caveat: it means that if we have lots of available space
> > earlier on in the relation, it can only be used for new tuples or in
> > cases where there's insufficient space on a page for an UPDATE to use
> > HOT.
> >
> > This mechanism limits our options for condensing tables, forcing us to
> > resort to methods like running VACUUM FULL/CLUSTER or using external
> > tools like pg_repack. These either require exclusive locks (which will
> > be a deal-breaker on large tables on a production system), or there's
> > risks involved. Of course we can always flood pages with new versions
> > of a row until it's forced onto an early page, but that shouldn't be
> > necessary.
> >
> > Considering these trade-offs, I'd like to propose an option to allow
> > superusers to disable HOT on tables. The intent is to trade some
> > performance benefits for the ability to reduce the size of a table
> > without the typical locking associated with it.
>
> Interesting use case, but I think that disabling HOT would be missing
> the forest for the trees. I think that a feature that disables
> block-local updates for pages > some offset would be a better solution
> to your issue: Normal updates also prefer the new tuple to be stored
> in the same pages as the old tuple if at all possible, so disabling
> HOT wouldn't solve the issue of tuples residing in the tail of your
> table - at least not while there is still empty space in those pages.

Hmm... I see your point.  It's when an UPDATE isn't going to land on
the same page that it relocates to the earlier available page.  So I
guess I'm after whatever mechanism would allow that to happen reliably
and predictably.

So $subject should really be "Allow forcing UPDATEs off the same page".

Thom




Re: Disabling Heap-Only Tuples

2023-07-05 Thread Matthias van de Meent
On Wed, 5 Jul 2023 at 12:45, Thom Brown  wrote:
> Heap-Only Tuple (HOT) updates are a significant performance
> enhancement, as they prevent unnecessary page writes. However, HOT
> comes with a caveat: it means that if we have lots of available space
> earlier on in the relation, it can only be used for new tuples or in
> cases where there's insufficient space on a page for an UPDATE to use
> HOT.
>
> This mechanism limits our options for condensing tables, forcing us to
> resort to methods like running VACUUM FULL/CLUSTER or using external
> tools like pg_repack. These either require exclusive locks (which will
> be a deal-breaker on large tables on a production system), or there's
> risks involved. Of course we can always flood pages with new versions
> of a row until it's forced onto an early page, but that shouldn't be
> necessary.
>
> Considering these trade-offs, I'd like to propose an option to allow
> superusers to disable HOT on tables. The intent is to trade some
> performance benefits for the ability to reduce the size of a table
> without the typical locking associated with it.

Interesting use case, but I think that disabling HOT would be missing
the forest for the trees. I think that a feature that disables
block-local updates for pages > some offset would be a better solution
to your issue: Normal updates also prefer the new tuple to be stored
in the same pages as the old tuple if at all possible, so disabling
HOT wouldn't solve the issue of tuples residing in the tail of your
table - at least not while there is still empty space in those pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Disabling Heap-Only Tuples

2023-07-05 Thread Thom Brown
Hi,

Heap-Only Tuple (HOT) updates are a significant performance
enhancement, as they prevent unnecessary page writes. However, HOT
comes with a caveat: it means that if we have lots of available space
earlier on in the relation, it can only be used for new tuples or in
cases where there's insufficient space on a page for an UPDATE to use
HOT.

This mechanism limits our options for condensing tables, forcing us to
resort to methods like running VACUUM FULL/CLUSTER or using external
tools like pg_repack. These either require exclusive locks (which will
be a deal-breaker on large tables on a production system), or there's
risks involved. Of course we can always flood pages with new versions
of a row until it's forced onto an early page, but that shouldn't be
necessary.

Considering these trade-offs, I'd like to propose an option to allow
superusers to disable HOT on tables. The intent is to trade some
performance benefits for the ability to reduce the size of a table
without the typical locking associated with it.

This feature could be used to shrink tables in one of two ways:
temporarily disabling HOT until DML operations have compacted the data
into a smaller area, or performing a mass update on later rows to
relocate them to an earlier location, probably in stages. Of course,
this would need to be used in conjunction with a VACUUM operation.

Admittedly this isn't ideal, and it would be better if we had an
operation that could do this (e.g. VACUUM COMPACT ), or an
option that causes some operations to avoid HOT when it detects an
amount of free space over a threshold, but in lieu of those, I thought
this would at least allow users to help themselves when running into
disk space issues.

Thoughts?

Thom