subject:"Improve eviction algorithm in ReorderBuffer"

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Heikki,
>
> I also prototyped the idea, which has almost the same shape.
> I attached just in case, but we may not have to see.
>
> Few comments based on the experiment.

Thank you for reviewing the patch. I didn't include the following
suggestions since firstly I wanted to just fix binaryheap part while
keeping other parts. If we need these changes, we can do them in
separate commits as fixes.

>
> ```
> +   /* txn_heap is ordered by transaction size */
> +   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
> NULL);
> ```
>
> I think the pairing heap should be in the same MemoryContext with the buffer.
> Can we add MemoryContextSwithTo()?

The pairingheap_allocate() allocates a tiny amount of memory for
pairingheap and its memory usage doesn't grow even when adding more
data. And since it's allocated in logical decoding context its
lifetime is also fine. So I'm not sure it's worth including it in
rb->context for better memory accountability.

>
> ```
> +   /* Update the max-heap */
> +   if (oldsize != 0)
> +   pairingheap_remove(rb->txn_heap, &txn->txn_node);
> +   pairingheap_add(rb->txn_heap, &txn->txn_node);
> ...
> +   /* Update the max-heap */
> +   pairingheap_remove(rb->txn_heap, &txn->txn_node);
> +   if (txn->size != 0)
> +   pairingheap_add(rb->txn_heap, &txn->txn_node);
> ```
>
> Since the number of stored transactions does not affect to the insert 
> operation, we may able
> to add the node while creating ReorederBufferTXN and remove while cleaning up 
> it. This can
> reduce branches in ReorderBufferChangeMemoryUpdate().

I think it also means that we need to remove the entry while cleaning
up even if it doesn't have any changes, which is done in O(log n). I
feel the current approach that we don't store transactions with size 0
in the heap is better and I'm not sure that reducing these branches
really contributes to the performance improvements..

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Heikki Linnakangas


On 11/04/2024 11:20, Masahiko Sawada wrote:

On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada  wrote:


We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.


Pushed the patch and reverted binaryheap changes.


Thank you!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada  wrote:
>
> We can see 2% ~ 3% performance regressions compared to the current
> HEAD, but it's much smaller than I expected. Given that we can make
> the code simple, I think we can go with this direction.

Pushed the patch and reverted binaryheap changes.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 10:32 AM Masahiko Sawada  wrote:
>
> Hi,
>
> Sorry for the late reply, I took two days off.
>
> On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
> >
> > On 10/04/2024 08:31, Amit Kapila wrote:
> > > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> On 10/04/2024 07:45, Michael Paquier wrote:
> > >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> >  On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> > > Wouldn't the best way forward be to revert
> > > 5bec1d6bc5e3 and revisit the whole in v18?
> > 
> >  Also consider commits b840508644 and bcb14f4abc.
> > >>>
> > >>> Indeed.  These are also linked.
> > >>
> > >> I don't feel the urge to revert this:
> > >>
> > >> - It's not broken as such, we're just discussing better ways to
> > >> implement it. We could also do nothing, and revisit this in v18. The
> > >> only must-fix issue is some compiler warnings IIUC.
> > >>
> > >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> > >> way of other patches or reverts. Nothing else depends on the binaryheap
> > >> changes yet either.
> > >>
> > >> - It seems straightforward to repeat the performance tests with whatever
> > >> alternative implementations we want to consider.
> > >>
> > >> My #1 choice would be to write a patch to switch the pairing heap,
> > >> performance test that, and revert the binary heap changes.
> > >>
> > >
> > > +1.
> >
> > To move this forward, here's a patch to switch to a pairing heap. In my
> > very quick testing, with the performance test cases posted earlier in
> > this thread [1] [2], I'm seeing no meaningful performance difference
> > between this and what's in master currently.
> >
> > Sawada-san, what do you think of this? To be sure, if you could also
> > repeat the performance tests you performed earlier, that'd be great. If
> > you agree with this direction, and you're happy with this patch, feel
> > free take it from here and commit this, and also revert commits
> > b840508644 and bcb14f4abc.
> >
>
> Thank you for the patch!
>
> I agree with the direction that we replace binaryheap + index with the
> existing pairing heap and revert the changes for binaryheap. Regarding
> the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
> logic because otherwise we need to remove and add the txn node (i.e.
> O(log n)) for every memory update. I'm concerned it could cause some
> performance degradation in a case where there are not many
> transactions being decoded.
>
> I'll do performance tests, and share the results soon.
>

Here are some performance test results.

* test case 1 (many subtransactions)

test script:

create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
begin
  insert into test values (i);
exception when division_by_zero then
  raise notice 'caught error';
  return;
end;
  end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(100);
set logical_decoding_work_mem to '4MB';
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

43128.266 ms
40116.313 ms
38790.666 ms

Patched:

43626.316 ms
44456.234 ms
39899.753 ms


* test case 2 (single big insertion)

test script:

create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 1000);
set logical_decoding_work_mem to '10GB'; -- avoid data spill
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

7996.476 ms
8034.022 ms
8005.583 ms

Patched:

8153.500 ms
8121.588 ms
8121.538 ms


* test case 3 (many small transactions)

test script:

pgbench -s -i 300
psql -c "select pg_create_replication_slot('s', 'test_decoding')";
pgbench -t 10 -c 32
psql -c "set logical_decoding_work_mem to '10GB'; select count(*) from
pg_logical_slot_peek_changes('s', null, null)"

HEAD:

22586.343 ms
22507.905 ms
22504.133 ms

Patched:

23365.142 ms
23110.651 ms
23102.170 ms

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Hayato Kuroda (Fujitsu)

Dear Heikki,

I also prototyped the idea, which has almost the same shape.
I attached just in case, but we may not have to see.

Few comments based on the experiment.

```
+   /* txn_heap is ordered by transaction size */
+   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
NULL);
```

I think the pairing heap should be in the same MemoryContext with the buffer.
Can we add MemoryContextSwithTo()?

```
+   /* Update the max-heap */
+   if (oldsize != 0)
+   pairingheap_remove(rb->txn_heap, &txn->txn_node);
+   pairingheap_add(rb->txn_heap, &txn->txn_node);
...
+   /* Update the max-heap */
+   pairingheap_remove(rb->txn_heap, &txn->txn_node);
+   if (txn->size != 0)
+   pairingheap_add(rb->txn_heap, &txn->txn_node);
```

Since the number of stored transactions does not affect to the insert 
operation, we may able
to add the node while creating ReorederBufferTXN and remove while cleaning up 
it. This can
reduce branches in ReorderBufferChangeMemoryUpdate().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

<>

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada

Hi,

Sorry for the late reply, I took two days off.

On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
>
> On 10/04/2024 08:31, Amit Kapila wrote:
> > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:
> >>
> >> On 10/04/2024 07:45, Michael Paquier wrote:
> >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
>  On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> > Wouldn't the best way forward be to revert
> > 5bec1d6bc5e3 and revisit the whole in v18?
> 
>  Also consider commits b840508644 and bcb14f4abc.
> >>>
> >>> Indeed.  These are also linked.
> >>
> >> I don't feel the urge to revert this:
> >>
> >> - It's not broken as such, we're just discussing better ways to
> >> implement it. We could also do nothing, and revisit this in v18. The
> >> only must-fix issue is some compiler warnings IIUC.
> >>
> >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> >> way of other patches or reverts. Nothing else depends on the binaryheap
> >> changes yet either.
> >>
> >> - It seems straightforward to repeat the performance tests with whatever
> >> alternative implementations we want to consider.
> >>
> >> My #1 choice would be to write a patch to switch the pairing heap,
> >> performance test that, and revert the binary heap changes.
> >>
> >
> > +1.
>
> To move this forward, here's a patch to switch to a pairing heap. In my
> very quick testing, with the performance test cases posted earlier in
> this thread [1] [2], I'm seeing no meaningful performance difference
> between this and what's in master currently.
>
> Sawada-san, what do you think of this? To be sure, if you could also
> repeat the performance tests you performed earlier, that'd be great. If
> you agree with this direction, and you're happy with this patch, feel
> free take it from here and commit this, and also revert commits
> b840508644 and bcb14f4abc.
>

Thank you for the patch!

I agree with the direction that we replace binaryheap + index with the
existing pairing heap and revert the changes for binaryheap. Regarding
the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
logic because otherwise we need to remove and add the txn node (i.e.
O(log n)) for every memory update. I'm concerned it could cause some
performance degradation in a case where there are not many
transactions being decoded.

I'll do performance tests, and share the results soon.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Heikki Linnakangas


On 11/04/2024 01:37, Michael Paquier wrote:

On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote:

To move this forward, here's a patch to switch to a pairing heap. In my very
quick testing, with the performance test cases posted earlier in this thread
[1] [2], I'm seeing no meaningful performance difference between this and
what's in master currently.


Reading through the patch, that's a nice cleanup.  It cuts quite some
code.

+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
  #include "access/htup_details.h"
  #include "lib/binaryheap.h"
  #include "lib/ilist.h"
+#include "lib/pairingheap.h"

I'm slightly annoyed by the extra amount of information that gets
added to reorderbuffer.h for stuff that's only local to
reorderbuffer.c, but that's not something new in this area, so..


We can actually remove the "lib/binaryheap.h" in this patch; I missed 
that. There are no other uses of binaryheap in the file.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Michael Paquier

On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote:
> To move this forward, here's a patch to switch to a pairing heap. In my very
> quick testing, with the performance test cases posted earlier in this thread
> [1] [2], I'm seeing no meaningful performance difference between this and
> what's in master currently.

Reading through the patch, that's a nice cleanup.  It cuts quite some
code.

+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/binaryheap.h"
 #include "lib/ilist.h"
+#include "lib/pairingheap.h"

I'm slightly annoyed by the extra amount of information that gets
added to reorderbuffer.h for stuff that's only local to
reorderbuffer.c, but that's not something new in this area, so..
--
Michael


signature.asc
Description: PGP signature

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Heikki Linnakangas


On 10/04/2024 08:31, Amit Kapila wrote:

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:


On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?


Also consider commits b840508644 and bcb14f4abc.


Indeed.  These are also linked.


I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.



+1.


To move this forward, here's a patch to switch to a pairing heap. In my 
very quick testing, with the performance test cases posted earlier in 
this thread [1] [2], I'm seeing no meaningful performance difference 
between this and what's in master currently.


Sawada-san, what do you think of this? To be sure, if you could also 
repeat the performance tests you performed earlier, that'd be great. If 
you agree with this direction, and you're happy with this patch, feel 
free take it from here and commit this, and also revert commits 
b840508644 and bcb14f4abc.


--
Heikki Linnakangas
Neon (https://neon.tech)
From c883d03bd221341b0bbb315d376d9e125b84329a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 10 Apr 2024 08:09:38 +0300
Subject: [PATCH 1/1] Replace binaryheap + index with pairingheap

A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessarily, but
they will be reverted separately.

Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.
---
 .../replication/logical/reorderbuffer.c   | 186 +++---
 src/include/replication/reorderbuffer.h   |   8 +-
 2 files changed, 29 insertions(+), 165 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cf28d4df42..ab2ddabfadd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -68,19 +68,9 @@
  *	  memory gets actually freed.
  *
  *	  We use a max-heap with transaction size as the key to efficiently find
- *	  the largest transaction. While the max-heap is empty, we don't update
- *	  the max-heap when updating the memory counter. Therefore, we can get
- *	  the largest transaction in O(N) time, where N is the number of
- *	  transactions including top-level transactions and subtransactions.
- *
- *	  We build the max-heap just before selecting the largest transactions
- *	  if the number of transactions being decoded is higher than the threshold,
- *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
- *	  update the max-heap when updating the memory counter. The intention is
- *	  to efficiently find the largest transaction in O(1) time instead of
- *	  incurring the cost of memory counter updates (O(log N)). Once the number
- *	  of transactions got lower than the threshold, we reset the max-heap
- *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *	  the largest transaction. We update the max-heap whenever the memory
+ *	  counter is updated; however transactions with size 0 are not stored in
+ *	  the heap, because they have no changes to evict.
  *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
@@ -122,23 +112,6 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
-/*
- * Threshold of the total number of top-level and sub transactions that
- * controls whether we use the max-heap for tracking their sizes. Although
- * using the max-heap to select the largest transaction is effective when
- * there are many transactions being decoded, maintaining the max-heap while
- * updating the memory statistics can be costly. Therefore, we use
- * MaxConnections as the threshold so that we use the max-heap only when
- * using subtransactions.
- */
-#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
-
-/*
- * A macro to check if the

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Jeff Davis

On Wed, 2024-04-10 at 08:30 +0300, Heikki Linnakangas wrote:
> My #1 choice would be to write a patch to switch the pairing heap, 
> performance test that, and revert the binary heap changes.

Sounds good to me. I would expect it to perform better than the extra
hash table, if anything.

It also has the advantage that we don't change the API for binaryheap
in 17.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Amit Kapila

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:
>
> On 10/04/2024 07:45, Michael Paquier wrote:
> > On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> >> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> >>> Wouldn't the best way forward be to revert
> >>> 5bec1d6bc5e3 and revisit the whole in v18?
> >>
> >> Also consider commits b840508644 and bcb14f4abc.
> >
> > Indeed.  These are also linked.
>
> I don't feel the urge to revert this:
>
> - It's not broken as such, we're just discussing better ways to
> implement it. We could also do nothing, and revisit this in v18. The
> only must-fix issue is some compiler warnings IIUC.
>
> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> way of other patches or reverts. Nothing else depends on the binaryheap
> changes yet either.
>
> - It seems straightforward to repeat the performance tests with whatever
> alternative implementations we want to consider.
>
> My #1 choice would be to write a patch to switch the pairing heap,
> performance test that, and revert the binary heap changes.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Heikki Linnakangas


On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?


Also consider commits b840508644 and bcb14f4abc.


Indeed.  These are also linked.


I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to 
implement it. We could also do nothing, and revisit this in v18. The 
only must-fix issue is some compiler warnings IIUC.


- It's a pretty localized change in reorderbuffer.c, so it's not in the 
way of other patches or reverts. Nothing else depends on the binaryheap 
changes yet either.


- It seems straightforward to repeat the performance tests with whatever 
alternative implementations we want to consider.


My #1 choice would be to write a patch to switch the pairing heap, 
performance test that, and revert the binary heap changes.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Michael Paquier

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
>> Wouldn't the best way forward be to revert
>> 5bec1d6bc5e3 and revisit the whole in v18?
> 
> Also consider commits b840508644 and bcb14f4abc.

Indeed.  These are also linked.
--
Michael


signature.asc
Description: PGP signature

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Jeff Davis

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

> Wouldn't the best way forward be to revert
> 5bec1d6bc5e3 and revisit the whole in v18?

That's a reasonable conclusion. Also consider commits b840508644 and
bcb14f4abc.

I had tried to come up with a narrower fix, and I think it's already
been implemented here in approach 2:

https://www.postgresql.org/message-id/CAD21AoAtf12e9Z9NLBuaO1GjHMMo16_8R-yBu9Q9jrk2QLqMEA%40mail.gmail.com

but it does feel wrong to introduce an unnecessary hash table in 17
when we know it's not the right solution.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Michael Paquier

On Tue, Apr 09, 2024 at 06:24:43PM -0700, Jeff Davis wrote:
> I had suggested that the heap could track the element indexes for
> efficient update/removal, but that would be a change to the
> binaryheap.h API, which would require some discussion (and possibly not
> be acceptable post-freeze).
> 
> But I think you're right: a pairing heap already solves the problem
> without modification. (Note: our pairing heap API doesn't explicitly
> support updating a key, so I think it would need to be done with
> remove/add.) So we might as well just do that right now rather than
> trying to fix the way the hash table is being used or trying to extend
> the binaryheap API.
> 
> Of course, we should measure to be sure there aren't bad cases around
> updating/removing a key, but I don't see a fundamental reason that it
> should be worse.

This is going to require a rewrite of 5bec1d6bc5e3 with a new
performance study, which strikes me as something that we'd better not
do after feature freeze.  Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

I have added an open item, for now.
--
Michael


signature.asc
Description: PGP signature

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Jeff Davis

On Tue, 2024-04-09 at 23:49 +0300, Heikki Linnakangas wrote:
> 
> I wonder why this doesn't use the existing pairing heap 
> implementation? I would assume that to be at least as good as the
> binary 
> heap + hash table

I agree that an additional hash table is not needed -- there's already
a hash table to do a lookup based on xid in reorderbuffer.c.

I had suggested that the heap could track the element indexes for
efficient update/removal, but that would be a change to the
binaryheap.h API, which would require some discussion (and possibly not
be acceptable post-freeze).

But I think you're right: a pairing heap already solves the problem
without modification. (Note: our pairing heap API doesn't explicitly
support updating a key, so I think it would need to be done with
remove/add.) So we might as well just do that right now rather than
trying to fix the way the hash table is being used or trying to extend
the binaryheap API.

Of course, we should measure to be sure there aren't bad cases around
updating/removing a key, but I don't see a fundamental reason that it
should be worse.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Heikki Linnakangas


On 09/04/2024 21:04, Jeff Davis wrote:

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:

I have some further comments and I believe changes are required for
v17.


I also noticed that the simplehash is declared in binaryheap.h with
"SH_SCOPE extern", which seems wrong. Attached is a rough patch to
bring the declarations into binaryheap.c.

Note that the attached patch uses "SH_SCOPE static", which makes sense
to me in this case, but causes a bunch of warnings in gcc. I will post
separately about silencing that warning, but for now you can either
use:

SH_SCOPE static inline

which is probably fine, but will encourage the compiler to inline more
code, when not all callers even use the hash table. Alternatively, you
can do:

SH_SCOPE static pg_attribute_unused()

which looks a bit wrong to me, but seems to silence the warnings, and
lets the compiler decide whether or not to inline.

Also probably needs comment updates, etc.


Sorry I'm late to the party, I didn't pay attention to this thread 
earlier. But I wonder why this doesn't use the existing pairing heap 
implementation? I would assume that to be at least as good as the binary 
heap + hash table. And it's cheap to to insert to (O(1)), so we could 
probably remove the MAX_HEAP_TXN_COUNT_THRESHOLD, and always keep the 
heap up-to-date.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Jeff Davis

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:
> > I have some further comments and I believe changes are required for
> > v17.

I also noticed that the simplehash is declared in binaryheap.h with
"SH_SCOPE extern", which seems wrong. Attached is a rough patch to
bring the declarations into binaryheap.c.

Note that the attached patch uses "SH_SCOPE static", which makes sense
to me in this case, but causes a bunch of warnings in gcc. I will post
separately about silencing that warning, but for now you can either
use:

   SH_SCOPE static inline

which is probably fine, but will encourage the compiler to inline more
code, when not all callers even use the hash table. Alternatively, you
can do:

   SH_SCOPE static pg_attribute_unused()

which looks a bit wrong to me, but seems to silence the warnings, and
lets the compiler decide whether or not to inline.

Also probably needs comment updates, etc.

Regards,
Jeff Davis

From 08dcf21646e4ded22b10fd0ed536d3bbf6fc1328 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 9 Apr 2024 10:45:00 -0700
Subject: [PATCH v1] binaryheap: move hash table out of header into
 binaryheap.c.

Commit b840508644 declared the internal hash table in the header with
scope "extern", which was unnecessary.
---
 src/common/binaryheap.c  | 15 ++-
 src/include/lib/binaryheap.h | 25 +
 2 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index c20ed50acc..2501dad6f2 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -25,6 +25,18 @@
 #include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
 /*
  * Define parameters for hash table code generation. The interface is *also*
  * declared in binaryheap.h (to generate the types, which are externally
@@ -37,12 +49,13 @@
 #define SH_HASH_KEY(tb, key) \
 	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
 #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
-#define SH_SCOPE extern
+#define SH_SCOPE static
 #ifdef FRONTEND
 #define SH_RAW_ALLOCATOR pg_malloc0
 #endif
 #define SH_STORE_HASH
 #define SH_GET_HASH(tb, a) a->hash
+#define SH_DECLARE
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 4c1a1bb274..8b47132fc3 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,29 +29,6 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
-/*
- * Struct for a hash table element to store the node's index in the bh_nodes
- * array.
- */
-typedef struct bh_nodeidx_entry
-{
-	bh_node_type key;
-	int			index;			/* entry's index within the node array */
-	char		status;			/* hash status */
-	uint32		hash;			/* hash values (cached) */
-} bh_nodeidx_entry;
-
-/* Define parameters necessary to generate the hash table interface. */
-#define SH_PREFIX bh_nodeidx
-#define SH_ELEMENT_TYPE bh_nodeidx_entry
-#define SH_KEY_TYPE bh_node_type
-#define SH_SCOPE extern
-#ifdef FRONTEND
-#define SH_RAW_ALLOCATOR pg_malloc0
-#endif
-#define SH_DECLARE
-#include "lib/simplehash.h"
-
 /*
  * binaryheap
  *
@@ -76,7 +53,7 @@ typedef struct binaryheap
 	 * node's index in bh_nodes. This enables the caller to perform
 	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
 	 */
-	bh_nodeidx_hash *bh_nodeidx;
+	struct bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int num_nodes,
-- 
2.34.1

Re: Improve eviction algorithm in ReorderBuffer

2024-04-08 Thread Jeff Davis

On Mon, 2024-04-08 at 21:29 +0900, Masahiko Sawada wrote:
> For v17, changes for #2 are smaller, but I'm concerned
> that the new API that requires a hash function to be able to use
> binaryheap_update_{up|down} might not be user friendly.

The only API change in 02 is accepting a hash callback rather than a
boolean in binaryheap_allocate(), so I don't see that as worse than
what's there now. It also directly fixes my complaint (hashing the
pointer) and does nothing more, so I think it's the right fix for now.

I do think that the API can be better (templated like simplehash), but
I don't think right now is a great time to change it.

> When it comes to performance overhead, I mentioned that there is some
> regression in the current binaryheap even without indexing.

As far as I can tell, you are just adding a single branch in that path,
and I would expect it to be a predictable branch, right?

Thank you for testing, but small differences in a microbenchmark aren't
terribly worrying for me. If other call sites are that sensitive to
binaryheap performance, the right answer is to have a templated version
that would not only avoid this unnecessary branch, but also inline the
comparator (which probably matters more).

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-08 Thread Masahiko Sawada

On Sat, Apr 6, 2024 at 5:44 AM Jeff Davis  wrote:
>
> >
> > It sounds like a data structure that mixes the hash table and the
> > binary heap and we use it as the main storage (e.g. for
> > ReorderBufferTXN) instead of using the binary heap as the secondary
> > data structure. IIUC with your idea, the indexed binary heap has a
> > hash table to store elements each of which has its index within the
> > heap node array. I guess it's better to create it as a new data
> > structure rather than extending the existing binaryheap, since APIs
> > could be very different. I might be missing something, though.
>
> You are right that this approach starts to feel like a new data
> structure and is not v17 material.
>
> I am interested in this for v18 though -- we could make the API more
> like simplehash to be more flexible when using values (rather than
> pointers) and to be able to inline the comparator.

Interesting project. It would be great if we could support increasing
and decreasing the key as APIs. The current
binaryheap_update_{up|down} APIs are not very user-friendly.

>
> > > * remove the hash table from binaryheap.c
> > >
> > > * supply a new callback to the binary heap with type like:
> > >
> > >   typedef void (*binaryheap_update_index)(
> > > bh_node_type node,
> > > int new_element_index);
> > >
> > > * make the remove, update_up, and update_down methods take the
> > > element
> > > index rather than the pointer
>
> ...
>
> > This shows that the current indexed binaryheap is much slower than
> > the
> > other two implementations, and the xx_binaryheap has a good number in
> > spite of also being indexed.
>
> xx_binaryheap isn't indexed though, right?

Well, yes. To be xact, xx_binaryheap isn't indexed but the element
indexes are stored in the element itself (see test_elem struct) so the
caller still can update the key using xx_binaryheap_update_{up|down}.

>
> > When it comes to
> > implementing the above idea (i.e. changing binaryheap to
> > xx_binaryheap), it was simple since we just replace the code where we
> > update the hash table with the code where we call the callback, if we
> > get the consensus on API change.
>
> That seems reasonable to me.
>
> > The fact that we use simplehash for the internal hash table might
> > make
> > this idea complex. If I understand your suggestion correctly, the
> > caller needs to tell the hash table the hash function when creating a
> > binaryheap but the hash function needs to be specified at a compile
> > time. We can use a dynahash instead but it would make the binaryheap
> > slow further.
>
> simplehash.h supports private_data, which makes it easier to track a
> callback.
>
> In binaryheap.c, that would look something like:
>
>   static inline uint32
>   binaryheap_hash(bh_nodeidx_hash *tab, uint32 key)
>   {
>  binaryheap_hashfunc hashfunc = tab->private_data;
>  return hashfunc(key);
>   }
>
>   ...
>   #define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
>   ...
>
>   binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
>   void *arg, binaryheap_hashfunc hashfunc)
>   {
> ...
> if (hashfunc != NULL)
> {
>/* could have a new structure, but we only need to
> * store one pointer, so don't bother with palloc/pfree */
>void *private_data = (void *)hashfunc;
>heap->bh_nodeidx = bh_nodeidx_create(..., private_data);
>...
>
>
> And in reorderbuffer.c, define the callback like:
>
>   static uint32
>   reorderbuffer_xid_hash(TransactionId xid)
>   {
> /* fasthash32 is 'static inline' so may
>  * be faster than hash_bytes()? */
> return fasthash32(&xid, sizeof(TransactionId), 0);
>   }
>

Thanks, that's a good idea.

>
>
> In summary, there are two viable approaches for addressing the concerns
> in v17:
>
> 1. Provide callback to update ReorderBufferTXN->heap_element_index, and
> use that index (rather than the pointer) for updating the heap key
> (transaction size) or removing elements from the heap.
>
> 2. Provide callback for hashing, so that binaryheap.c can hash the xid
> value rather than the pointer.
>
> I don't have a strong opinion about which one to use. I prefer
> something closer to #1 for v18, but for v17 I suggest whichever one
> comes out simpler.

I've implemented prototypes of both ideas, and attached the draft patches.

I agree with you that something closer to #1 is for v18. Probably we
can implement the #1 idea while making binaryheap codes template like
simplehash.h. For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly. In terms of
APIs, I prefer #1 idea. And changes for #1 can make the binaryheap
code simple, although it requires adding a variable in
ReorderBufferTXN instead. But overall, it can remove the hash table
and some functions so it looks better to me.

When it comes to performance overhead, I mentioned

Re: Improve eviction algorithm in ReorderBuffer

2024-04-05 Thread Jeff Davis

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:
> IIUC for example in ReorderBuffer, the heap key is transaction size
> and the hash key is xid.

Yes.

> 
> I see your point. It assumes that the bh_node_type is a pointer or at
> least unique. So it cannot work with Datum being a value.

Right. One option might just be to add some comments explaining the API
and limitations, but in general I feel it's confusing to hash a pointer
without a good reason.

> 
> It sounds like a data structure that mixes the hash table and the
> binary heap and we use it as the main storage (e.g. for
> ReorderBufferTXN) instead of using the binary heap as the secondary
> data structure. IIUC with your idea, the indexed binary heap has a
> hash table to store elements each of which has its index within the
> heap node array. I guess it's better to create it as a new data
> structure rather than extending the existing binaryheap, since APIs
> could be very different. I might be missing something, though.

You are right that this approach starts to feel like a new data
structure and is not v17 material.

I am interested in this for v18 though -- we could make the API more
like simplehash to be more flexible when using values (rather than
pointers) and to be able to inline the comparator.

> > * remove the hash table from binaryheap.c
> > 
> > * supply a new callback to the binary heap with type like:
> > 
> >   typedef void (*binaryheap_update_index)(
> >     bh_node_type node,
> >     int new_element_index);
> > 
> > * make the remove, update_up, and update_down methods take the
> > element
> > index rather than the pointer

...

> This shows that the current indexed binaryheap is much slower than
> the
> other two implementations, and the xx_binaryheap has a good number in
> spite of also being indexed.

xx_binaryheap isn't indexed though, right?

> When it comes to
> implementing the above idea (i.e. changing binaryheap to
> xx_binaryheap), it was simple since we just replace the code where we
> update the hash table with the code where we call the callback, if we
> get the consensus on API change.

That seems reasonable to me.

> The fact that we use simplehash for the internal hash table might
> make
> this idea complex. If I understand your suggestion correctly, the
> caller needs to tell the hash table the hash function when creating a
> binaryheap but the hash function needs to be specified at a compile
> time. We can use a dynahash instead but it would make the binaryheap
> slow further.

simplehash.h supports private_data, which makes it easier to track a
callback.

In binaryheap.c, that would look something like:

  static inline uint32
  binaryheap_hash(bh_nodeidx_hash *tab, uint32 key)
  {
 binaryheap_hashfunc hashfunc = tab->private_data;
 return hashfunc(key);
  }

  ...
  #define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
  ...

  binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
  void *arg, binaryheap_hashfunc hashfunc)
  {
...
if (hashfunc != NULL)
{
   /* could have a new structure, but we only need to
* store one pointer, so don't bother with palloc/pfree */
   void *private_data = (void *)hashfunc;
   heap->bh_nodeidx = bh_nodeidx_create(..., private_data);
   ...

And in reorderbuffer.c, define the callback like:

  static uint32
  reorderbuffer_xid_hash(TransactionId xid)
  {
/* fasthash32 is 'static inline' so may
 * be faster than hash_bytes()? */
return fasthash32(&xid, sizeof(TransactionId), 0);
  }

In summary, there are two viable approaches for addressing the concerns
in v17:

1. Provide callback to update ReorderBufferTXN->heap_element_index, and
use that index (rather than the pointer) for updating the heap key
(transaction size) or removing elements from the heap.

2. Provide callback for hashing, so that binaryheap.c can hash the xid
value rather than the pointer.

I don't have a strong opinion about which one to use. I prefer
something closer to #1 for v18, but for v17 I suggest whichever one
comes out simpler.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-05 Thread Masahiko Sawada

On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:
> > > Perhaps it's not worth the effort though, if performance is already
> > > good enough?
> >
> > Yeah, it would be better to measure the overhead first. I'll do that.
>
> I have some further comments and I believe changes are required for
> v17.
>
> An indexed binary heap API requires both a comparator and a hash
> function to be specified, and has two different kinds of keys: the heap
> key (mutable) and the hash key (immutable). It provides heap methods
> and hashtable methods, and keep the two internal structures (heap and
> HT) in sync.

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

>
> The implementation in b840508644 uses the bh_node_type as the hash key,
> which is just a Datum, and it just hashes the bytes. I believe the
> implicit assumption is that the Datum is a pointer -- I'm not sure how
> one would use that API if the Datum were a value. Hashing a pointer
> seems strange to me and, while I see why you did it that way, I think
> it reflects that the API boundaries are not quite right.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

>
> One consequence of using the pointer as the hash key is that you need
> to find the pointer first: you can't change or remove elements based on
> the transaction ID, you have to get the ReorderBufferTXN pointer by
> finding it in another structure, first. Currently, that's being done by
> searching ReorderBuffer->by_txn. So we actually have two hash tables
> for essentially the same purpose: one with xid as the key, and the
> other with the pointer as the key. That makes no sense -- let's have a
> proper indexed binary heap to look things up by xid (the internal HT)
> or by transaction size (using the internal heap).
>
> I suggest:
>
>   * Make a proper indexed binary heap API that accepts a hash function
> and provides both heap methods and HT methods that operate based on
> values (transaction size and transaction ID, respectively).
>   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> instead.
>
> This will be a net simplification in reorderbuffer.c, which is good,
> because that file makes use of a *lot* of data strucutres.
>

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:
> >   * Make a proper indexed binary heap API that accepts a hash
> > function
> > and provides both heap methods and HT methods that operate based on
> > values (transaction size and transaction ID, respectively).
> >   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> > instead.
>
> An alternative idea:
>
> * remove the hash table from binaryheap.c
>
> * supply a new callback to the binary heap with type like:
>
>   typedef void (*binaryheap_update_index)(
> bh_node_type node,
> int new_element_index);
>
> * make the remove, update_up, and update_down methods take the element
> index rather than the pointer
>
> reorderbuffer.c would then do something like:
>
>   void
>   txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
>   {
>  txn->heap_element_index = new_element_index;
>   }
>
>   ...
>
>   txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);
>
> and then binaryheap.c would effectively maintain txn-
> >heap_element_index, so reorderbuffer.c can pass that to the APIs that
> require the element index.

Thank you for the idea. I was thinking the same idea when considering
your previous comment. With this idea, we still use the binaryheap for
ReorderBuffer as the second data structure. Since we can implement
this idea with relatively small changes to the current binaryheap,
I've implemented it and measured performances.

I've attached a patch that adds an extension for benchmarking
binaryheap implementations. binaryheap_bench.c is the main test
module. To make the comparison between different binaryheap
implementations, the extension includes two different binaryheap
implementations. Therefore, binaryheap_bench.c uses three different
binaryheap implementation in total as the comment on top of the file
says:

/*
 * This benchmark tool uses three binary heap implementations.
 *
 * "binaryheap" is the current binaryheap implementation in PostgreSQL. That
 * is, it internally has a hash table to track

Re: Improve eviction algorithm in ReorderBuffer

2024-04-04 Thread Jeff Davis

On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:
>   * Make a proper indexed binary heap API that accepts a hash
> function
> and provides both heap methods and HT methods that operate based on
> values (transaction size and transaction ID, respectively).
>   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> instead.

An alternative idea:

* remove the hash table from binaryheap.c

* supply a new callback to the binary heap with type like:

  typedef void (*binaryheap_update_index)(
bh_node_type node,
int new_element_index);

* make the remove, update_up, and update_down methods take the element
index rather than the pointer

reorderbuffer.c would then do something like:

  void
  txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
  {
 txn->heap_element_index = new_element_index;
  }

  ...

  txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);

and then binaryheap.c would effectively maintain txn-
>heap_element_index, so reorderbuffer.c can pass that to the APIs that
require the element index.


Another alternative is to keep the hash table in binaryheap.c, and
supply a hash function that hashes the xid. That leaves us with two
hash tables still, but it would be cleaner than hashing the pointer.
That might be best for right now, and we can consider these other ideas
later.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-04 Thread Jeff Davis

On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:
> > Perhaps it's not worth the effort though, if performance is already
> > good enough?
> 
> Yeah, it would be better to measure the overhead first. I'll do that.

I have some further comments and I believe changes are required for
v17.

An indexed binary heap API requires both a comparator and a hash
function to be specified, and has two different kinds of keys: the heap
key (mutable) and the hash key (immutable). It provides heap methods
and hashtable methods, and keep the two internal structures (heap and
HT) in sync.

The implementation in b840508644 uses the bh_node_type as the hash key,
which is just a Datum, and it just hashes the bytes. I believe the
implicit assumption is that the Datum is a pointer -- I'm not sure how
one would use that API if the Datum were a value. Hashing a pointer
seems strange to me and, while I see why you did it that way, I think
it reflects that the API boundaries are not quite right.

One consequence of using the pointer as the hash key is that you need
to find the pointer first: you can't change or remove elements based on
the transaction ID, you have to get the ReorderBufferTXN pointer by
finding it in another structure, first. Currently, that's being done by
searching ReorderBuffer->by_txn. So we actually have two hash tables
for essentially the same purpose: one with xid as the key, and the
other with the pointer as the key. That makes no sense -- let's have a
proper indexed binary heap to look things up by xid (the internal HT)
or by transaction size (using the internal heap).

I suggest:

  * Make a proper indexed binary heap API that accepts a hash function
and provides both heap methods and HT methods that operate based on
values (transaction size and transaction ID, respectively).
  * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
instead.

This will be a net simplification in reorderbuffer.c, which is good,
because that file makes use of a *lot* of data strucutres.

Regards
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-04 Thread Masahiko Sawada

On Thu, Apr 4, 2024 at 1:54 PM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:
> > IIUC, with your suggestion, sift_{up|down} needs to update the
> > heap_index field as well. Does it mean that the caller needs to pass
> > the address of heap_index down to sift_{up|down}?
>
> I'm not sure quite how binaryheap should be changed. Bringing the heap
> implementation into reorderbuffer.c would obviously work, but that
> would be more code.

Right.

>  Another option might be to make the API of
> binaryheap look a little more like simplehash, where some #defines
> control optional behavior and can tell the implementation where to find
> fields in the structure.

Interesting idea.

>
> Perhaps it's not worth the effort though, if performance is already
> good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis

On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:
> IIUC, with your suggestion, sift_{up|down} needs to update the
> heap_index field as well. Does it mean that the caller needs to pass
> the address of heap_index down to sift_{up|down}?

I'm not sure quite how binaryheap should be changed. Bringing the heap
implementation into reorderbuffer.c would obviously work, but that
would be more code. Another option might be to make the API of
binaryheap look a little more like simplehash, where some #defines
control optional behavior and can tell the implementation where to find
fields in the structure.

Perhaps it's not worth the effort though, if performance is already
good enough?

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Masahiko Sawada

Hi,

On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis  wrote:
>
> On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> > I suggest that you add a "heap_index" field to ReorderBufferTXN that
> > would point to the index into the heap's array (the same as
> > bh_nodeidx_entry.index in your patch). Each time an element moves
> > within the heap array, just follow the pointer to the
> > ReorderBufferTXN
> > object and update the heap_index -- no hash lookup required.
>
> It looks like my email was slightly too late, as the work was already
> committed.

Thank you for the suggestions! I should have informed it earlier.

>
> My suggestion is not required for 17, and so it's fine if this waits
> until the next CF. If it turns out to be a win we can consider
> backporting to 17 just to keep the code consistent, otherwise it can go
> in 18.

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis

On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> I suggest that you add a "heap_index" field to ReorderBufferTXN that
> would point to the index into the heap's array (the same as
> bh_nodeidx_entry.index in your patch). Each time an element moves
> within the heap array, just follow the pointer to the
> ReorderBufferTXN
> object and update the heap_index -- no hash lookup required.

It looks like my email was slightly too late, as the work was already
committed.

My suggestion is not required for 17, and so it's fine if this waits
until the next CF. If it turns out to be a win we can consider
backporting to 17 just to keep the code consistent, otherwise it can go
in 18.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Jeff Davis

On Mon, 2024-04-01 at 12:42 +0900, Masahiko Sawada wrote:
> While reviewing the patches, I realized the comment of
> binearyheap_allocate() should also be updated. So I've attached the
> new patches.

In sift_{up|down}, each loop iteration calls set_node(), and each call
to set_node does a hash lookup. I didn't measure it, but that feels
wasteful.

I don't even think you really need the hash table. The key to the hash
table is a pointer, so it's not really doing anything that couldn't be
done more efficiently by just following the pointer.

I suggest that you add a "heap_index" field to ReorderBufferTXN that
would point to the index into the heap's array (the same as
bh_nodeidx_entry.index in your patch). Each time an element moves
within the heap array, just follow the pointer to the ReorderBufferTXN
object and update the heap_index -- no hash lookup required.

That's not easy to do with the current binaryheap API. But a binary
heap is not a terribly complex structure, so you can just do an inline
implementation of it where sift_{up|down} know to update the heap_index
field of the ReorderBufferTXN.

Regards,
Jeff Davis

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 11:26 AM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
> >
> > On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > > >
> > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > I've attached new version patches.
> > > > >
> > > > > Since the previous patch conflicts with the current HEAD, I've
> > > > > attached the rebased patches.
> > > >
> > > > Thanks for the updated patch.
> > > > One comment:
> > > > I felt we can mention the improvement where we update memory
> > > > accounting info at transaction level instead of per change level which
> > > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > > ReorderBufferSerializeTXN also in the commit message:
> > >
> > > Agreed.
> > >
> > > I think the patch is in good shape. I'll push the patch with the
> > > suggestion next week, barring any objections.
> > >
> >
> > Few minor comments:
> > 1.
> > @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> >   Assert(txn->nentries_mem == 0);
> >   }
> >
> > + ReorderBufferMaybeResetMaxHeap(rb);
> > +
> >
> > Can we write a comment about why this reset is required here?
> > Otherwise, the reason is not apparent.
>
> Yes, added.
>
> >
> > 2.
> > Although using max-heap to select the largest
> > + * transaction is effective when there are many transactions being decoded,
> > + * there is generally no need to use it as long as all transactions being
> > + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> > the
> > + * threshold so we can prevent switching to the state unless we use
> > + * subtransactions.
> > + */
> > +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
> >
> > Isn't using max-heap equally effective in finding the largest
> > transaction whether there are top-level or top-level plus
> > subtransactions? This comment indicates it is only effective when
> > there are subtransactions.
>
> You're right. Updated the comment.
>
> I've attached the updated patches.
>

While reviewing the patches, I realized the comment of
binearyheap_allocate() should also be updated. So I've attached the
new patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 8:48 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
>
> Thanks for working on this. Agreed it is committable.
> Few minor comments:

Thank you for the comments!

>
> ```
> + * Either txn or change must be non-NULL at least. We update the memory
> + * counter of txn if it's non-NULL, otherwise change->txn.
> ```
>
> IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
> Assert(txn || change)?

Agreed to add it.

>
> ```
> +/* make sure enough space for a new node */
> ...
> +/* make sure enough space for a new node */
> ```
>
> Should be started with upper case?

I  don't think we need to change it. There are other comments in the
same file that are one line and start with lowercase.

I've just submitted the updated patches[1]

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoA6%3D%2BtL%3DbtB_s9N%2BcZK7tKz1W%3DPQyNq72nzjUcdyE%2BwZw%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
>
> On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > >
> > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > >
> > > > > I've attached new version patches.
> > > >
> > > > Since the previous patch conflicts with the current HEAD, I've
> > > > attached the rebased patches.
> > >
> > > Thanks for the updated patch.
> > > One comment:
> > > I felt we can mention the improvement where we update memory
> > > accounting info at transaction level instead of per change level which
> > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > ReorderBufferSerializeTXN also in the commit message:
> >
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
> >
>
> Few minor comments:
> 1.
> @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
>   Assert(txn->nentries_mem == 0);
>   }
>
> + ReorderBufferMaybeResetMaxHeap(rb);
> +
>
> Can we write a comment about why this reset is required here?
> Otherwise, the reason is not apparent.

Yes, added.

>
> 2.
> Although using max-heap to select the largest
> + * transaction is effective when there are many transactions being decoded,
> + * there is generally no need to use it as long as all transactions being
> + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> the
> + * threshold so we can prevent switching to the state unless we use
> + * subtransactions.
> + */
> +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
>
> Isn't using max-heap equally effective in finding the largest
> transaction whether there are top-level or top-level plus
> subtransactions? This comment indicates it is only effective when
> there are subtransactions.

You're right. Updated the comment.

I've attached the updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v11-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data

RE: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> Agreed.
> 
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)? 

```
+/* make sure enough space for a new node */
...
+/* make sure enough space for a new node */
```

Should be started with upper case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> >
> > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I've attached new version patches.
> > >
> > > Since the previous patch conflicts with the current HEAD, I've
> > > attached the rebased patches.
> >
> > Thanks for the updated patch.
> > One comment:
> > I felt we can mention the improvement where we update memory
> > accounting info at transaction level instead of per change level which
> > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > ReorderBufferSerializeTXN also in the commit message:
>
> Agreed.
>
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.
>

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
  Assert(txn->nentries_mem == 0);
  }

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-28 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
>
> On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I've attached new version patches.
> >
> > Since the previous patch conflicts with the current HEAD, I've
> > attached the rebased patches.
>
> Thanks for the updated patch.
> One comment:
> I felt we can mention the improvement where we update memory
> accounting info at transaction level instead of per change level which
> is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-28 Thread vignesh C

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> wrote:
> >
> >
> > I've attached new version patches.
>
> Since the previous patch conflicts with the current HEAD, I've
> attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
/* Check we're not mixing changes from different
transactions. */
Assert(change->txn == txn);

-   ReorderBufferReturnChange(rb, change, true);
+   ReorderBufferReturnChange(rb, change, false);
}

/*
@@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);

+   /* Update the memory counter */
+   ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

Regards,
Vignesh

Re: Improve eviction algorithm in ReorderBuffer

2024-03-27 Thread Shubham Khanna

On Tue, Mar 5, 2024 at 8:50 AM vignesh C  wrote:
>
> On Wed, 28 Feb 2024 at 11:40, Amit Kapila  wrote:
> >
> > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > A few comments on 0003:
> > ===
> > 1.
> > +/*
> > + * Threshold of the total number of top-level and sub transactions
> > that controls
> > + * whether we switch the memory track state. While the MAINTAIN_HEAP state 
> > is
> > + * effective when there are many transactions being decoded, in many 
> > systems
> > + * there is generally no need to use it as long as all transactions
> > being decoded
> > + * are top-level transactions. Therefore, we use MaxConnections as
> > the threshold
> > + * so we can prevent switch to the state unless we use subtransactions.
> > + */
> > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
> >
> > The comment seems to imply that MAINTAIN_HEAP is useful for large
> > number of transactions but ReorderBufferLargestTXN() switches to this
> > state even when there is one transaction. So, basically we use the
> > binary_heap technique to get the largest even when we have one
> > transaction but we don't maintain that heap unless we have
> > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> > in-progress. This means there is some additional work when (build and
> > reset heap each time when we pick largest xact) we have fewer
> > transactions in the system but that may not be impacting us because of
> > other costs involved like serializing all the changes. I think once we
> > can try to stress test this by setting
> > debug_logical_replication_streaming to 'immediate' to see if the new
> > mechanism has any overhead.
>
> I ran the test with a transaction having many inserts:
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 26.31 | 48.84 | 93.65  | 480.05  |  4808.29   | 47020.16
> Patch |  26.35  | 50.8   | 97.99  | 484.8|  4856.95   | 48108.89
>
> The same test with debug_logical_replication_streaming= 'immediate'
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 59.29   |  115.84  |  227.21 | 1156.08   |  11367.42   |  113986.14
> Patch | 62.45  |  120.48  |  240.56 | 1185.12   |  11855.37   |  119921.81
>
> The execution time is in milliseconds. The column header indicates the
> number of inserts in the transaction.
> In this case I noticed that the test execution with patch was taking
> slightly more time.

I have ran the tests that Vignesh had reported a issue, the test
results with the latest patch is given below:

Without debug_logical_replication_streaming= 'immediate'
Record|1000  |100  |10 | 2 | 1 | 5000
--|---|-|---|--|--|--
Head|47563.759| 4917.057|478.923|97.28   |50.368 |25.917
Patch|47445.733| 4722.874|472.817|95.15   |48.801 |26.168
%imp|0.248| 03.949|01.274   |02.189|03.111  |-0.968

With debug_logical_replication_streaming= 'immediate'
Record| 1000  | 100   | 10  | 2  | 1  | 5000
--||--|-|---|---|--
Head|106281.236|10669.992|1073.815|214.287|107.62  |54.947
Patch|103108.673|10603.139|1064.98  |210.229|106.321|54.218
%imp| 02.985| 0.626  |0.822  |01.893  |01.207  |01.326

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction. I can notice with the test
result that the issue has been resolved with the new patch.

Thanks and Regards,
Shubham Khanna.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-25 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  wrote:
>
>
> I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v10-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 11:23 AM Peter Smith  wrote:
>
> On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
> > >
> > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  
> > > > wrote:
> > > > >
> > > ...
> > > > > > > 5.
> > > > > > > + *
> > > > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > > > node's
> > > > > > > + * index in the heap, enabling to perform some operations such 
> > > > > > > as removing
> > > > > > > + * the node from the heap.
> > > > > > >   */
> > > > > > >  binaryheap *
> > > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > > > void *arg)
> > > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > > > + bool indexed, void *arg)
> > > > > > >
> > > > > > > BEFORE
> > > > > > > ... enabling to perform some operations such as removing the node 
> > > > > > > from the heap.
> > > > > > >
> > > > > > > SUGGESTION
> > > > > > > ... to help make operations such as removing nodes more efficient.
> > > > > > >
> > > > > >
> > > > > > But these operations literally require the indexed binary heap as we
> > > > > > have an assertion:
> > > > > >
> > > > > > void
> > > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > > > {
> > > > > > bh_nodeidx_entry *ent;
> > > > > >
> > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > > > Assert(heap->bh_indexed);
> > > > > >
> > > > >
> > > > > I didn’t quite understand -- the operations mentioned are "operations
> > > > > such as removing the node", but binaryheap_remove_node() also removes
> > > > > a node from the heap. So I still felt the comment wording of the patch
> > > > > is not quite correct.
> > > >
> > > > Now I understand your point. That's a valid point.
> > > >
> > > > >
> > > > > Now, if the removal of a node from an indexed heap can *only* be done
> > > > > using binaryheap_remove_node_ptr() then:
> > > > > - the other removal functions (binaryheap_remove_*) probably need some
> > > > > comments to make sure nobody is tempted to call them directly for an
> > > > > indexed heap.
> > > > > - maybe some refactoring and assertions are needed to ensure those
> > > > > *cannot* be called directly for an indexed heap.
> > > > >
> > > >
> > > > If the 'index' is true, the caller can not only use the existing
> > > > functions but also newly added functions such as
> > > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > > > something like below?
> > > >
> > >
> > > You said: "can not only use the existing functions but also..."
> > >
> > > Hmm. Is that right? IIUC those existing "remove" functions should NOT
> > > be called directly if the heap was "indexed" because they'll delete
> > > the node from the heap OK, but any corresponding index for that
> > > deleted node will be left lying around -- i.e. everything gets out of
> > > sync. This was the reason for my original concern.
> > >
> >
> > All existing binaryheap functions should be available even if the
> > binaryheap is 'indexed'. For instance, with the patch,
> > binaryheap_remote_node() is:
> >
> > void
> > binaryheap_remove_node(binaryheap *heap, int n)
> > {
> > int cmp;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(n >= 0 && n < heap->bh_size);
> >
> > /* compare last node to the one that is being removed */
> > cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
> >heap->bh_nodes[n],
> >heap->bh_arg);
> >
> > /* remove the last node, placing it in the vacated entry */
> > replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
> >
> > /* sift as needed to preserve the heap property */
> > if (cmp > 0)
> > sift_up(heap, n);
> > else if (cmp < 0)
> > sift_down(heap, n);
> > }
> >
> > The replace_node(), sift_up() and sift_down() update node's index as
> > well if the binaryheap is indexed. When deleting the node from the
> > binaryheap, it will also delete its index from the hash table.
> >
>
> I see now. Thanks for the information.
>
> ~~~
>
> Some more review comments for v8-0002
>
> ==
>
> 1.
> +/*
> + * Remove the node's index from the hash table if the heap is indexed.
> + */
> +static bool
> +delete_nodeidx(binaryheap *heap, bh_node_type node)
> +{
> + if (!binaryheap_indexed(heap))
> + return false;
> +
> + return bh_nodeidx_delete(heap->bh_nodeidx, node);
> +}
>
> I wasn't sure if having this function was a good idea. Yes, it makes
> code more readable, but I felt the heap code ought to be as efficient
> as possible so maybe it is better for the index check to be done at
> the caller, instead of incurring any overhead of function calls that
> might do nothing.
>
> SUGGESTION
> if (binar

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

the end, which is more efficient.
>  */
> static void
> ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
>
> /We will run a heap assembly step at the end, which is more
> efficient./The heap assembly step is deferred until the end, for
> efficiency./

Fixed.

>
> ~~~
>
> 8. ReorderBufferLargestTXN
>
> + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
> + {
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> +
> + hash_seq_init(&hash_seq, rb->by_txn);
> + while ((ent = hash_seq_search(&hash_seq)) != NULL)
> + {
> + ReorderBufferTXN *txn = ent->txn;
> +
> + /* if the current transaction is larger, remember it */
> + if ((!largest) || (txn->size > largest->size))
> + largest = txn;
> + }
> +
> + Assert(largest);
> + }
>
> That Assert(largest) seems redundant because there is anyway another
> Assert(largest) immediately after this code.

Removed.

>
> ~~~
>
> 9.
> + /* Get the largest transaction from the max-heap */
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
> + {
> + Assert(binaryheap_size(rb->txn_heap) > 0);
> + largest = (ReorderBufferTXN *)
> + DatumGetPointer(binaryheap_first(rb->txn_heap));
>   }
> Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
> readable way of saying:
>
> Assert(!binaryheap_empty(rb->txn_heap));

Fixed.

>
> ~~~
>
> 10.
> +
> +/*
> + * Compare between sizes of two transactions. This is for a binary heap
> + * comparison function.
> + */
> +static int
> +ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
>
> 10a.
> /Compare between sizes of two transactions./Compare two transactions by size./

Fixed.

>
> ~~~
>
> 10b.
> IMO this comparator function belongs just before the
> ReorderBufferAllocate() function since that is the only place where it
> is used.

I think it's better to move close to new max-heap related functions.

>
> ==
> src/include/replication/reorderbuffer.h
>
> 11.
> +/* State of how to track the memory usage of each transaction being decoded 
> */
> +typedef enum ReorderBufferMemTrackState
> +{
> + /*
> + * We don't update max-heap while updating the memory counter. The
> + * max-heap is built before use.
> + */
> + REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
> +
> + /*
> + * We also update the max-heap when updating the memory counter so the
> + * heap property is always preserved.
> + */
> + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
> +} ReorderBufferMemTrackState;
> +
>
> In my GENERAL review comment #0, I suggested the removal of this
> entire enum. e.g. It could be replaced with a boolean field
> 'track_txn_sizes'
>
> TBH, I think there is a better way to handle this "state". IIUC
> - the txn_heap is always allocated up-front.
> - you only "build" it when > threshold and
> - when it drops < 0.9 x threshold you reset it.
>
> Therefore, AFAICT you do not need to maintain any “switch states” at
> all; you simply need to check binaryheap_empty(txn_heap), right?
> * If the heap is empty…. It means you are NOT tracking, so don’t use it
> * If the heap is NOT empty …. It means you ARE tracking, so use it.
>
> ~
>
> Using my idea to remove the state flag will have the side effect of
> simplifying many other parts of this patch. For example
>
> BEFORE
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 
> 0.9)
> + {
> + rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
> + binaryheap_reset(rb->txn_heap);
> + }
> +}
>
> AFTER
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (binaryheap_empty(rb->txn_heap))
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 
> 0.9)
> + binaryheap_reset(rb->txn_heap);
> +}

Agreed. I removed the enum and changed the logic.

>
> ~~~
>
> 12. struct ReorderBuffer
>
> + /* Max-heap for sizes of all top-level and sub transactions */
> + ReorderBufferMemTrackState memtrack_state;
> + binaryheap *txn_heap;
> +
>
> 12a.
> Why is this being referred to in the commit message and code comments
> as "max-heap" when the field is not called by that same name? Won't it
> be better to give the field a better name -- e.g. "txn_maxheap" or
> similar?

Not sure it helps increase readability. Other codes where we use
binaryheap use neither max nor min in the field name.

>
> ~
>
> 12b.
> This comment should also say that the heap is ordered by tx size --
> (e.g. the comparator is ReorderBufferTXNSizeCompare)

It seems to me the comment "/* Max-heap for sizes of all top-level and
sub transactions */" already mentions that, no? I'm not sure we need
to refer to the actual function name here.

I've attached new version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v9-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v9-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data


v9-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Peter Smith

On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
> >
> > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> > > >
> > ...
> > > > > > 5.
> > > > > > + *
> > > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > > node's
> > > > > > + * index in the heap, enabling to perform some operations such as 
> > > > > > removing
> > > > > > + * the node from the heap.
> > > > > >   */
> > > > > >  binaryheap *
> > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > > void *arg)
> > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > > + bool indexed, void *arg)
> > > > > >
> > > > > > BEFORE
> > > > > > ... enabling to perform some operations such as removing the node 
> > > > > > from the heap.
> > > > > >
> > > > > > SUGGESTION
> > > > > > ... to help make operations such as removing nodes more efficient.
> > > > > >
> > > > >
> > > > > But these operations literally require the indexed binary heap as we
> > > > > have an assertion:
> > > > >
> > > > > void
> > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > > {
> > > > > bh_nodeidx_entry *ent;
> > > > >
> > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > > Assert(heap->bh_indexed);
> > > > >
> > > >
> > > > I didn’t quite understand -- the operations mentioned are "operations
> > > > such as removing the node", but binaryheap_remove_node() also removes
> > > > a node from the heap. So I still felt the comment wording of the patch
> > > > is not quite correct.
> > >
> > > Now I understand your point. That's a valid point.
> > >
> > > >
> > > > Now, if the removal of a node from an indexed heap can *only* be done
> > > > using binaryheap_remove_node_ptr() then:
> > > > - the other removal functions (binaryheap_remove_*) probably need some
> > > > comments to make sure nobody is tempted to call them directly for an
> > > > indexed heap.
> > > > - maybe some refactoring and assertions are needed to ensure those
> > > > *cannot* be called directly for an indexed heap.
> > > >
> > >
> > > If the 'index' is true, the caller can not only use the existing
> > > functions but also newly added functions such as
> > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > > something like below?
> > >
> >
> > You said: "can not only use the existing functions but also..."
> >
> > Hmm. Is that right? IIUC those existing "remove" functions should NOT
> > be called directly if the heap was "indexed" because they'll delete
> > the node from the heap OK, but any corresponding index for that
> > deleted node will be left lying around -- i.e. everything gets out of
> > sync. This was the reason for my original concern.
> >
>
> All existing binaryheap functions should be available even if the
> binaryheap is 'indexed'. For instance, with the patch,
> binaryheap_remote_node() is:
>
> void
> binaryheap_remove_node(binaryheap *heap, int n)
> {
> int cmp;
>
> Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> Assert(n >= 0 && n < heap->bh_size);
>
> /* compare last node to the one that is being removed */
> cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
>heap->bh_nodes[n],
>heap->bh_arg);
>
> /* remove the last node, placing it in the vacated entry */
> replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
>
> /* sift as needed to preserve the heap property */
> if (cmp > 0)
> sift_up(heap, n);
> else if (cmp < 0)
> sift_down(heap, n);
> }
>
> The replace_node(), sift_up() and sift_down() update node's index as
> well if the binaryheap is indexed. When deleting the node from the
> binaryheap, it will also delete its index from the hash table.
>

I see now. Thanks for the information.

~~~

Some more review comments for v8-0002

==

1.
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

I wasn't sure if having this function was a good idea. Yes, it makes
code more readable, but I felt the heap code ought to be as efficient
as possible so maybe it is better for the index check to be done at
the caller, instead of incurring any overhead of function calls that
might do nothing.

SUGGESTION
if (binaryheap_indexed(heap))
  found = bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

2.
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
>
> On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> > >
> ...
> > > > > 5.
> > > > > + *
> > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > node's
> > > > > + * index in the heap, enabling to perform some operations such as 
> > > > > removing
> > > > > + * the node from the heap.
> > > > >   */
> > > > >  binaryheap *
> > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > void *arg)
> > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > + bool indexed, void *arg)
> > > > >
> > > > > BEFORE
> > > > > ... enabling to perform some operations such as removing the node 
> > > > > from the heap.
> > > > >
> > > > > SUGGESTION
> > > > > ... to help make operations such as removing nodes more efficient.
> > > > >
> > > >
> > > > But these operations literally require the indexed binary heap as we
> > > > have an assertion:
> > > >
> > > > void
> > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > {
> > > > bh_nodeidx_entry *ent;
> > > >
> > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > Assert(heap->bh_indexed);
> > > >
> > >
> > > I didn’t quite understand -- the operations mentioned are "operations
> > > such as removing the node", but binaryheap_remove_node() also removes
> > > a node from the heap. So I still felt the comment wording of the patch
> > > is not quite correct.
> >
> > Now I understand your point. That's a valid point.
> >
> > >
> > > Now, if the removal of a node from an indexed heap can *only* be done
> > > using binaryheap_remove_node_ptr() then:
> > > - the other removal functions (binaryheap_remove_*) probably need some
> > > comments to make sure nobody is tempted to call them directly for an
> > > indexed heap.
> > > - maybe some refactoring and assertions are needed to ensure those
> > > *cannot* be called directly for an indexed heap.
> > >
> >
> > If the 'index' is true, the caller can not only use the existing
> > functions but also newly added functions such as
> > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > something like below?
> >
>
> You said: "can not only use the existing functions but also..."
>
> Hmm. Is that right? IIUC those existing "remove" functions should NOT
> be called directly if the heap was "indexed" because they'll delete
> the node from the heap OK, but any corresponding index for that
> deleted node will be left lying around -- i.e. everything gets out of
> sync. This was the reason for my original concern.
>

All existing binaryheap functions should be available even if the
binaryheap is 'indexed'. For instance, with the patch,
binaryheap_remote_node() is:

void
binaryheap_remove_node(binaryheap *heap, int n)
{
int cmp;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(n >= 0 && n < heap->bh_size);

/* compare last node to the one that is being removed */
cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
   heap->bh_nodes[n],
   heap->bh_arg);

/* remove the last node, placing it in the vacated entry */
replace_node(heap, n, heap->bh_nodes[heap->bh_size]);

/* sift as needed to preserve the heap property */
if (cmp > 0)
sift_up(heap, n);
else if (cmp < 0)
sift_down(heap, n);
}

The replace_node(), sift_up() and sift_down() update node's index as
well if the binaryheap is indexed. When deleting the node from the
binaryheap, it will also delete its index from the hash table.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Peter Smith

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> >
...
> > > > 5.
> > > > + *
> > > > + * If 'indexed' is true, we create a hash table to track of each node's
> > > > + * index in the heap, enabling to perform some operations such as 
> > > > removing
> > > > + * the node from the heap.
> > > >   */
> > > >  binaryheap *
> > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void 
> > > > *arg)
> > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > + bool indexed, void *arg)
> > > >
> > > > BEFORE
> > > > ... enabling to perform some operations such as removing the node from 
> > > > the heap.
> > > >
> > > > SUGGESTION
> > > > ... to help make operations such as removing nodes more efficient.
> > > >
> > >
> > > But these operations literally require the indexed binary heap as we
> > > have an assertion:
> > >
> > > void
> > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > {
> > > bh_nodeidx_entry *ent;
> > >
> > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > Assert(heap->bh_indexed);
> > >
> >
> > I didn’t quite understand -- the operations mentioned are "operations
> > such as removing the node", but binaryheap_remove_node() also removes
> > a node from the heap. So I still felt the comment wording of the patch
> > is not quite correct.
>
> Now I understand your point. That's a valid point.
>
> >
> > Now, if the removal of a node from an indexed heap can *only* be done
> > using binaryheap_remove_node_ptr() then:
> > - the other removal functions (binaryheap_remove_*) probably need some
> > comments to make sure nobody is tempted to call them directly for an
> > indexed heap.
> > - maybe some refactoring and assertions are needed to ensure those
> > *cannot* be called directly for an indexed heap.
> >
>
> If the 'index' is true, the caller can not only use the existing
> functions but also newly added functions such as
> binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> something like below?
>

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

>  * If 'indexed' is true, we create a hash table to track each node's
>  * index in the heap, enabling to perform some operations such as
>  * binaryheap_remove_node_ptr() etc.
>

Yeah, something like that... I'll wait for the next patch version
before commenting further.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-11 Thread Masahiko Sawada

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
>
> On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> > >
>
> > > 4a.
> > > The comment in simplehash.h says
> > >  *   The following parameters are only relevant when SH_DEFINE is defined:
> > >  *   - SH_KEY - ...
> > >  *   - SH_EQUAL(table, a, b) - ...
> > >  *   - SH_HASH_KEY(table, key) - ...
> > >  *   - SH_STORE_HASH - ...
> > >  *   - SH_GET_HASH(tb, a) - ...
> > >
> > > So maybe it is nicer to reorder the #defines in that same order?
> > >
> > > SUGGESTION:
> > > +#define SH_PREFIX bh_nodeidx
> > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > > +#define SH_KEY_TYPE bh_node_type
> > > +#define SH_SCOPE extern
> > > +#ifdef FRONTEND
> > > +#define SH_RAW_ALLOCATOR pg_malloc0
> > > +#endif
> > >
> > > +#define SH_DEFINE
> > > +#define SH_KEY key
> > > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> > > +#define SH_HASH_KEY(tb, key) \
> > > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> > > +#include "lib/simplehash.h"
> >
> > I'm really not sure it helps increase readability. For instance, for
> > me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> > #include since it's more obvious whether we want to declare, define or
> > both. Other simplehash.h users also do so.
> >
>
> OK.
>
> > > 5.
> > > + *
> > > + * If 'indexed' is true, we create a hash table to track of each node's
> > > + * index in the heap, enabling to perform some operations such as 
> > > removing
> > > + * the node from the heap.
> > >   */
> > >  binaryheap *
> > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void 
> > > *arg)
> > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > + bool indexed, void *arg)
> > >
> > > BEFORE
> > > ... enabling to perform some operations such as removing the node from 
> > > the heap.
> > >
> > > SUGGESTION
> > > ... to help make operations such as removing nodes more efficient.
> > >
> >
> > But these operations literally require the indexed binary heap as we
> > have an assertion:
> >
> > void
> > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > {
> > bh_nodeidx_entry *ent;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(heap->bh_indexed);
> >
>
> I didn’t quite understand -- the operations mentioned are "operations
> such as removing the node", but binaryheap_remove_node() also removes
> a node from the heap. So I still felt the comment wording of the patch
> is not quite correct.

Now I understand your point. That's a valid point.

>
> Now, if the removal of a node from an indexed heap can *only* be done
> using binaryheap_remove_node_ptr() then:
> - the other removal functions (binaryheap_remove_*) probably need some
> comments to make sure nobody is tempted to call them directly for an
> indexed heap.
> - maybe some refactoring and assertions are needed to ensure those
> *cannot* be called directly for an indexed heap.
>

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

 * If 'indexed' is true, we create a hash table to track each node's
 * index in the heap, enabling to perform some operations such as
 * binaryheap_remove_node_ptr() etc.

>
> And, here are some review comments for v8-0002.
>
> ==
> 1. delete_nodeidx
>
> +/*
> + * Remove the node's index from the hash table if the heap is indexed.
> + */
> +static bool
> +delete_nodeidx(binaryheap *heap, bh_node_type node)
> +{
> + if (!binaryheap_indexed(heap))
> + return false;
> +
> + return bh_nodeidx_delete(heap->bh_nodeidx, node);
> +}
>
> 1a.
> In v8 this function was changed to now return bool, so, I think the
> function comment should explain the meaning of that return value.
>
> ~
>
> 1b.
> I felt the function body is better expressed positively: "If this then
> do that", instead of "If not this then do nothing otherwise do that"
>
> SUGGESTION
> if (binaryheap_indexed(heap))
>   return bh_nodeidx_delete(heap->bh_nodeidx, node);
>
> return false;
>
> ~~~
>
> 2.
> +static void
> +replace_node(binaryheap *heap, int index, bh_node_type new_node)
> +{
> + bool found PG_USED_FOR_ASSERTS_ONLY;
> +
> + /* Quick return if not necessary to move */
> + if (heap->bh_nodes[index] == new_node)
> + return;
> +
> + /*
> + * Remove overwritten node's index. The overwritten node's position must
> + * have been tracked, if enabled.
> + */
> + found = delete_nodeidx(heap, heap->bh_nodes[index]);
> + Assert(!binaryheap_indexed(heap) || found);
> +
> + /*
> + * Replace it with the given new node. This node's position must also be
> + * tracked as we assume to replace the node by the existing node.
> + */
> + found = set_node(heap, new_node, index);
> + Assert(!binaryheap_indexed(heap) || found);
> +}
>
> 2a.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-10 Thread Peter Smith

Here are some review comments for v8-0003

==
0. GENERAL -- why the state enum?

This patch introduced a new ReorderBufferMemTrackState with 2 states
(REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)

It's the same as having a boolean flag OFF/ON, so I didn't see any
benefit of the enum instead of a simple boolean flag like
'track_txn_sizes'.

NOTE: Below in this post (see #11) I would like to propose another
idea, which can simplify much further, eliminating the need for the
state boolean. If adopted that will impact lots of these other review
comments.

==
Commit Message

1.
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

~

/Which could/This could/

/in case/in the case/

==
.../replication/logical/reorderbuffer.c

2.
 *   We use a max-heap with transaction size as the key to efficiently find
 *   the largest transaction. The max-heap state is managed in two states:
 *   REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

/The max-heap state is managed in two states:/The max-heap is managed
in two states:/

~~~

3.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While using max-heap to select
+ * the largest transaction is effective when there are many transactions being
+ * decoded, in many systems there is generally no need to use it as long as all
+ * transactions being decoded are top-level transactions. Therefore, we use
+ * MaxConnections as the threshold* so we can prevent switch to the
state unless
+ * we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

3a.
/memory track state./memory tracking state./

/While using max-heap/Although using max-heap/

"in many systems" (are these words adding anything?)

/threshold*/threshold/

/so we can prevent switch/so we can prevent switching/

~

3b.
There's nothing really in this name to indicate the units of the
threshold. Consider if there is some more informative name for this
macro: e.g.
MAXHEAP_TX_COUNT_THRESHOLD (?)

~~~

4.
+ /*
+ * Don't start with a lower number than
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+ */
+ buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+ReorderBufferTXNSizeCompare,
+true, NULL);
+

IIUC the comment intends to say:

Allocate the initial heap size greater than THRESHOLD because the
txn_heap will not be used until the threshold is exceeded.

Also, maybe the comment should make a point of saying "Note: the
binary heap is INDEXED for faster manipulations". or something
similar.

~~~

5.
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
  ReorderBufferChange *change,
+ ReorderBufferTXN *txn,
  bool addition, Size sz)
 {
- ReorderBufferTXN *txn;
  ReorderBufferTXN *toptxn;

- Assert(change->txn);
-

There seems some trick now where the passed 'change' could be NULL,
which was not possible before. e.g., when change is NULL then 'txn' is
not NULL, and vice versa. Some explanation about this logic and the
meaning of these parameters should be written in this function
comment.

~

6.
+ txn = txn != NULL ? txn : change->txn;

IMO it's more natural to code the ternary using the same order as the
parameters:

e.g. txn = change ? change->txn : txn;

~~~

7.
/*
 * Build the max-heap and switch the state. We will run a heap assembly step
 * at the end, which is more efficient.
 */
static void
ReorderBufferBuildMaxHeap(ReorderBuffer *rb)

/We will run a heap assembly step at the end, which is more
efficient./The heap assembly step is deferred until the end, for
efficiency./

~~~

8. ReorderBufferLargestTXN

+ if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
+ {
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ /* if the current transaction is larger, remember it */
+ if ((!largest) || (txn->size > largest->size))
+ largest = txn;
+ }
+
+ Assert(largest);
+ }

That Assert(largest) seems redundant because there is anyway another
Assert(largest) immediately after this code.

~~~

9.
+ /* Get the largest transaction from the max-heap */
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+ {
+ Assert(binaryheap_size(rb->txn_heap) > 0);
+ largest = (ReorderBufferTXN *)
+ DatumGetPointer(binaryheap_first(rb->txn_heap));
  }
Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
readable way of saying:

Assert(!binaryheap_empty(rb->txn_heap));

~~~

10.
+
+/*
+ * Comp

Re: Improve eviction algorithm in ReorderBuffer

2024-03-07 Thread Peter Smith

On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> >

> > 4a.
> > The comment in simplehash.h says
> >  *   The following parameters are only relevant when SH_DEFINE is defined:
> >  *   - SH_KEY - ...
> >  *   - SH_EQUAL(table, a, b) - ...
> >  *   - SH_HASH_KEY(table, key) - ...
> >  *   - SH_STORE_HASH - ...
> >  *   - SH_GET_HASH(tb, a) - ...
> >
> > So maybe it is nicer to reorder the #defines in that same order?
> >
> > SUGGESTION:
> > +#define SH_PREFIX bh_nodeidx
> > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > +#define SH_KEY_TYPE bh_node_type
> > +#define SH_SCOPE extern
> > +#ifdef FRONTEND
> > +#define SH_RAW_ALLOCATOR pg_malloc0
> > +#endif
> >
> > +#define SH_DEFINE
> > +#define SH_KEY key
> > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> > +#define SH_HASH_KEY(tb, key) \
> > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> > +#include "lib/simplehash.h"
>
> I'm really not sure it helps increase readability. For instance, for
> me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> #include since it's more obvious whether we want to declare, define or
> both. Other simplehash.h users also do so.
>

OK.

> > 5.
> > + *
> > + * If 'indexed' is true, we create a hash table to track of each node's
> > + * index in the heap, enabling to perform some operations such as removing
> > + * the node from the heap.
> >   */
> >  binaryheap *
> > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
> > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > + bool indexed, void *arg)
> >
> > BEFORE
> > ... enabling to perform some operations such as removing the node from the 
> > heap.
> >
> > SUGGESTION
> > ... to help make operations such as removing nodes more efficient.
> >
>
> But these operations literally require the indexed binary heap as we
> have an assertion:
>
> void
> binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> {
> bh_nodeidx_entry *ent;
>
> Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> Assert(heap->bh_indexed);
>

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

> > 7b.
> > IMO the parameters would be better the other way around (e.g. 'index'
> > before the 'node') because that's what the assignments look like:
> >
> >
> > heap->bh_nodes[heap->bh_size] = d;
> >
> > becomes:
> > bh_set_node(heap, heap->bh_size, d);
> >
>
> I think it assumes heap->bh_nodes is an array. But if we change it in
> the future, it will no longer make sense. I think it would make more
> sense if we define the parameters in an order like "we set the 'node'
> at 'index'". What do you think?

YMMV. The patch code is also OK by me if you prefer it.

//

And, here are some review comments for v8-0002.

==
1. delete_nodeidx

+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

1a.
In v8 this function was changed to now return bool, so, I think the
function comment should explain the meaning of that return value.

~

1b.
I felt the function body is better expressed positively: "If this then
do that", instead of "If not this then do nothing otherwise do that"

SUGGESTION
if (binaryheap_indexed(heap))
  return bh_nodeidx_delete(heap->bh_nodeidx, node);

return false;

~~~

2.
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ /* Quick return if not necessary to move */
+ if (heap->bh_nodes[index] == new_node)
+ return;
+
+ /*
+ * Remove overwritten node's index. The overwritten node's position must
+ * have been tracked, if enabled.
+ */
+ found = delete_nodeidx(heap, heap->bh_nodes[index]);
+ Assert(!binaryheap_indexed(heap) || found);
+
+ /*
+ * Replace it with the given new node. This node's position must also be
+ * tracked as we assume to replace the node by the existing node.
+ */
+ found = set_node(heap, new_node, index);
+ Assert(!binaryheap_indexed(heap) || found);
+}

2a.
/Remove overwritten/Remove the overwritten/
/replace the node by the existing node/replace the node with the existing node/

~

2b.
It might be helpful to declare another local var...
bh_node_type

Re: Improve eviction algorithm in ReorderBuffer

2024-03-06 Thread Masahiko Sawada

On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
>
> Hi, here are some review comments for v7-0002
>
> ==
> Commit Message
>
> 1.
> This commit adds a hash table to binaryheap in order to track of
> positions of each nodes in the binaryheap. That way, by using newly
> added functions such as binaryheap_update_up() etc., both updating a
> key and removing a node can be done in O(1) on an average and O(log n)
> in worst case. This is known as the indexed binary heap. The caller
> can specify to use the indexed binaryheap by passing indexed = true.
>
> ~
>
> /to track of positions of each nodes/to track the position of each node/
>
> ~~~
>
> 2.
> There is no user of it but it will be used by a upcoming patch.
>
> ~
>
> The current code does not use the new indexing logic, but it will be
> used by an upcoming patch.

Fixed.

>
> ==
> src/common/binaryheap.c
>
> 3.
> +/*
> + * Define parameters for hash table code generation. The interface is *also*"
> + * declared in binaryheaph.h (to generate the types, which are externally
> + * visible).
> + */
>
> Typo: *also*"

Fixed.

>
> ~~~
>
> 4.
> +#define SH_PREFIX bh_nodeidx
> +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> +#define SH_KEY_TYPE bh_node_type
> +#define SH_KEY key
> +#define SH_HASH_KEY(tb, key) \
> + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> +#define SH_SCOPE extern
> +#ifdef FRONTEND
> +#define SH_RAW_ALLOCATOR pg_malloc0
> +#endif
> +#define SH_DEFINE
> +#include "lib/simplehash.h"
>
> 4a.
> The comment in simplehash.h says
>  *   The following parameters are only relevant when SH_DEFINE is defined:
>  *   - SH_KEY - ...
>  *   - SH_EQUAL(table, a, b) - ...
>  *   - SH_HASH_KEY(table, key) - ...
>  *   - SH_STORE_HASH - ...
>  *   - SH_GET_HASH(tb, a) - ...
>
> So maybe it is nicer to reorder the #defines in that same order?
>
> SUGGESTION:
> +#define SH_PREFIX bh_nodeidx
> +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> +#define SH_KEY_TYPE bh_node_type
> +#define SH_SCOPE extern
> +#ifdef FRONTEND
> +#define SH_RAW_ALLOCATOR pg_malloc0
> +#endif
>
> +#define SH_DEFINE
> +#define SH_KEY key
> +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> +#define SH_HASH_KEY(tb, key) \
> + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> +#include "lib/simplehash.h"

I'm really not sure it helps increase readability. For instance, for
me it's readable if SH_DEFINE and SH_DECLARE come to the last before
#include since it's more obvious whether we want to declare, define or
both. Other simplehash.h users also do so.

>
> ~~
>
> 4b.
> The comment in simplehash.h says that "it's preferable, if possible,
> to store the element's hash in the element's data type", so should
> SH_STORE_HASH and SH_GET_HASH also be defined here?

Good catch. I've used these macros.

>
> ~~~
>
> 5.
> + *
> + * If 'indexed' is true, we create a hash table to track of each node's
> + * index in the heap, enabling to perform some operations such as removing
> + * the node from the heap.
>   */
>  binaryheap *
> -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
> +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> + bool indexed, void *arg)
>
> BEFORE
> ... enabling to perform some operations such as removing the node from the 
> heap.
>
> SUGGESTION
> ... to help make operations such as removing nodes more efficient.
>

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

> ~~~
>
> 6.
> + heap->bh_indexed = indexed;
> + if (heap->bh_indexed)
> + {
> +#ifdef FRONTEND
> + heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
> +#else
> + heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
> + NULL);
> +#endif
> + }
> +
>
> The heap allocation just uses palloc instead of palloc0 so it might be
> better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
> never get a situation where bh_indexed is false but bh_nodeidx has
> some (garbage) value.

Fixed.

>
> ~~~
>
> 7.
> +/*
> + * Set the given node at the 'index', and updates its position accordingly.
> + *
> + * Return true if the node's index is already tracked.
> + */
> +static bool
> +bh_set_node(binaryheap *heap, bh_node_type node, int index)
>
> 7a.
> I felt the 1st sentence should be more like:
>
> SUGGESTION
> Set the given node at the 'index' and track it if required.

Fixed.

>
> ~
>
> 7b.
> IMO the parameters would be better the other way around (e.g. 'index'
> before the 'node') because that's what the assignments look like:
>
>
> heap->bh_nodes[heap->bh_size] = d;
>
> becomes:
> bh_set_node(heap, heap->bh_size, d);
>

I think it assumes heap->bh_nodes is an array. But if we change it in
the futur

Re: Improve eviction algorithm in ReorderBuffer

2024-03-06 Thread Masahiko Sawada

On Tue, Mar 5, 2024 at 12:20 PM vignesh C  wrote:
>
> On Wed, 28 Feb 2024 at 11:40, Amit Kapila  wrote:
> >
> > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > A few comments on 0003:
> > ===
> > 1.
> > +/*
> > + * Threshold of the total number of top-level and sub transactions
> > that controls
> > + * whether we switch the memory track state. While the MAINTAIN_HEAP state 
> > is
> > + * effective when there are many transactions being decoded, in many 
> > systems
> > + * there is generally no need to use it as long as all transactions
> > being decoded
> > + * are top-level transactions. Therefore, we use MaxConnections as
> > the threshold
> > + * so we can prevent switch to the state unless we use subtransactions.
> > + */
> > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
> >
> > The comment seems to imply that MAINTAIN_HEAP is useful for large
> > number of transactions but ReorderBufferLargestTXN() switches to this
> > state even when there is one transaction. So, basically we use the
> > binary_heap technique to get the largest even when we have one
> > transaction but we don't maintain that heap unless we have
> > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> > in-progress. This means there is some additional work when (build and
> > reset heap each time when we pick largest xact) we have fewer
> > transactions in the system but that may not be impacting us because of
> > other costs involved like serializing all the changes. I think once we
> > can try to stress test this by setting
> > debug_logical_replication_streaming to 'immediate' to see if the new
> > mechanism has any overhead.
>
> I ran the test with a transaction having many inserts:
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 26.31 | 48.84 | 93.65  | 480.05  |  4808.29   | 47020.16
> Patch |  26.35  | 50.8   | 97.99  | 484.8|  4856.95   | 48108.89
>
> The same test with debug_logical_replication_streaming= 'immediate'
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 59.29   |  115.84  |  227.21 | 1156.08   |  11367.42   |  113986.14
> Patch | 62.45  |  120.48  |  240.56 | 1185.12   |  11855.37   |  119921.81
>
> The execution time is in milliseconds. The column header indicates the
> number of inserts in the transaction.
> In this case I noticed that the test execution with patch was taking
> slightly more time.
>

Thank you for testing! With 10M records, I can see 2% regression in
the 'buffered' case and 5% regression in the 'immediate' case.

I think that in general it makes sense to postpone using a max-heap
until the number of transactions is higher than the threshold. I've
implemented this idea and here are the results on my environment (with
10M records and debug_logical_replication_streaming = 'immediate'):

HEAD:
68937.887 ms
69450.174 ms
68808.248 ms

v7 patch:
71280.783 ms
71673.101 ms
71330.898 ms

v8 patch:
68918.259 ms
68822.330 ms
68972.452 ms

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v8-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v8-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data


v8-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-06 Thread Masahiko Sawada

On Tue, Mar 5, 2024 at 11:25 AM Peter Smith  wrote:
>
> Hi, Here are some review comments for v7-0001

Thank you for reviewing the patch.

>
> 1.
> /*
>  * binaryheap_free
>  *
>  * Releases memory used by the given binaryheap.
>  */
> void
> binaryheap_free(binaryheap *heap)
> {
> pfree(heap);
> }
>
>
> Shouldn't the above function (not modified by the patch) also firstly
> free the memory allocated for the heap->bh_nodes?
>
> ~~~
>
> 2.
> +/*
> + * Make sure there is enough space for nodes.
> + */
> +static void
> +bh_enlarge_node_array(binaryheap *heap)
> +{
> + heap->bh_space *= 2;
> + heap->bh_nodes = repalloc(heap->bh_nodes,
> +   sizeof(bh_node_type) * heap->bh_space);
> +}
>
> Strictly speaking, this function doesn't really "Make sure" of
> anything because the caller does the check whether we need more space.
> All that happens here is allocating more space. Maybe this function
> comment should say something like "Double the space allocated for
> nodes."

Agreed with the above two points. I'll fix them in the next version patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, here are some review comments for v7-0002

==
Commit Message

1.
This commit adds a hash table to binaryheap in order to track of
positions of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

~

/to track of positions of each nodes/to track the position of each node/

~~~

2.
There is no user of it but it will be used by a upcoming patch.

~

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

==
src/common/binaryheap.c

3.
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */

Typo: *also*"

~~~

4.
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"

4a.
The comment in simplehash.h says
 *   The following parameters are only relevant when SH_DEFINE is defined:
 *   - SH_KEY - ...
 *   - SH_EQUAL(table, a, b) - ...
 *   - SH_HASH_KEY(table, key) - ...
 *   - SH_STORE_HASH - ...
 *   - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif

+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#include "lib/simplehash.h"

~~

4b.
The comment in simplehash.h says that "it's preferable, if possible,
to store the element's hash in the element's data type", so should
SH_STORE_HASH and SH_GET_HASH also be defined here?

~~~

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

~~~

6.
+ heap->bh_indexed = indexed;
+ if (heap->bh_indexed)
+ {
+#ifdef FRONTEND
+ heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+ heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+ NULL);
+#endif
+ }
+

The heap allocation just uses palloc instead of palloc0 so it might be
better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
never get a situation where bh_indexed is false but bh_nodeidx has
some (garbage) value.

~~~

7.
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)

7a.
I felt the 1st sentence should be more like:

SUGGESTION
Set the given node at the 'index' and track it if required.

~

7b.
IMO the parameters would be better the other way around (e.g. 'index'
before the 'node') because that's what the assignments look like:


heap->bh_nodes[heap->bh_size] = d;

becomes:
bh_set_node(heap, heap->bh_size, d);

~~~

8.
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+ bh_nodeidx_entry *ent;
+ bool found = false;
+
+ /* Set the node to the nodes array */
+ heap->bh_nodes[index] = node;
+
+ if (heap->bh_indexed)
+ {
+ /* Remember its index in the nodes array */
+ ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+ ent->idx = index;
+ }
+
+ return found;
+}

8a.
That 'ent' declaration can be moved to the inner block scope, so it is
closer to where it is needed.

~

8b.
+ /* Remember its index in the nodes array */

The comment is worded a bit ambiguously. IMO a simpler comment would
be: "/* Keep track of the node index. */"

~~~

9.
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!heap->bh_indexed)
+ return;
+
+ (void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

Since there is only 1 statement IMO it is simpler to write this
function like below:

if (heap->bh_indexed)
  (void) bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

10.
+/*
+ * Replace the node a

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread vignesh C

On Wed, 28 Feb 2024 at 11:40, Amit Kapila  wrote:
>
> On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  wrote:
> >
>
> A few comments on 0003:
> ===
> 1.
> +/*
> + * Threshold of the total number of top-level and sub transactions
> that controls
> + * whether we switch the memory track state. While the MAINTAIN_HEAP state is
> + * effective when there are many transactions being decoded, in many systems
> + * there is generally no need to use it as long as all transactions
> being decoded
> + * are top-level transactions. Therefore, we use MaxConnections as
> the threshold
> + * so we can prevent switch to the state unless we use subtransactions.
> + */
> +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
>
> The comment seems to imply that MAINTAIN_HEAP is useful for large
> number of transactions but ReorderBufferLargestTXN() switches to this
> state even when there is one transaction. So, basically we use the
> binary_heap technique to get the largest even when we have one
> transaction but we don't maintain that heap unless we have
> REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> in-progress. This means there is some additional work when (build and
> reset heap each time when we pick largest xact) we have fewer
> transactions in the system but that may not be impacting us because of
> other costs involved like serializing all the changes. I think once we
> can try to stress test this by setting
> debug_logical_replication_streaming to 'immediate' to see if the new
> mechanism has any overhead.

I ran the test with a transaction having many inserts:

 | 5000 | 1   |  2  | 10|  100   | 1000
--- 
|---|||--||
Head | 26.31 | 48.84 | 93.65  | 480.05  |  4808.29   | 47020.16
Patch |  26.35  | 50.8   | 97.99  | 484.8|  4856.95   | 48108.89

The same test with debug_logical_replication_streaming= 'immediate'

 | 5000 | 1   |  2  | 10|  100   | 1000
--- 
|---|||--||
Head | 59.29   |  115.84  |  227.21 | 1156.08   |  11367.42   |  113986.14
Patch | 62.45  |  120.48  |  240.56 | 1185.12   |  11855.37   |  119921.81

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction.
In this case I noticed that the test execution with patch was taking
slightly more time.

Regards,
Vignesh

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, Here are some review comments for v7-0001

1.
/*
 * binaryheap_free
 *
 * Releases memory used by the given binaryheap.
 */
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}


Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-02-28 Thread Masahiko Sawada

On Wed, Feb 28, 2024 at 3:10 PM Amit Kapila  wrote:
>
> On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  wrote:
> >
>

Thank you for the comments!

> A few comments on 0003:
> ===
> 1.
> +/*
> + * Threshold of the total number of top-level and sub transactions
> that controls
> + * whether we switch the memory track state. While the MAINTAIN_HEAP state is
> + * effective when there are many transactions being decoded, in many systems
> + * there is generally no need to use it as long as all transactions
> being decoded
> + * are top-level transactions. Therefore, we use MaxConnections as
> the threshold
> + * so we can prevent switch to the state unless we use subtransactions.
> + */
> +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
>
> The comment seems to imply that MAINTAIN_HEAP is useful for large
> number of transactions but ReorderBufferLargestTXN() switches to this
> state even when there is one transaction. So, basically we use the
> binary_heap technique to get the largest even when we have one
> transaction but we don't maintain that heap unless we have
> REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> in-progress.

Right.

> This means there is some additional work when (build and
> reset heap each time when we pick largest xact) we have fewer
> transactions in the system but that may not be impacting us because of
> other costs involved like serializing all the changes. I think once we
> can try to stress test this by setting
> debug_logical_replication_streaming to 'immediate' to see if the new
> mechanism has any overhead.

Agreed.

I've done performance tests that decodes 10k small transactions
(pgbench transactions) with debug_logical_replication_streaming =
'immediate':

master: 6263.022 ms
patched: 6403.873 ms

I don't see noticeable regressions.

>
> 2. Can we somehow measure the additional memory that will be consumed
> by each backend/walsender to maintain transactions? Because I think
> this also won't be accounted for logical_decoding_work_mem, so if this
> is large, there could be a chance of more complaints on us for not
> honoring logical_decoding_work_mem.

Good point.

We initialize the binaryheap with MaxConnections * 2 entries and the
binaryheap entries are pointers. So we use additional (8 * 100 * 2)
bytes with the default max_connections setting even when there is one
transaction, and could use more memory when adding more transactions.

I think there is still room for considering how to determine the
threshold and the number of initial entries. Using MaxConnections
seems to work but it always uses the current MaxConnections value
instead of the value that was set at a time when WAL records were
written. As for the initial number of entries in binaryheap, I think
we can the threshold value as the initial number of entries instead of
(threshold * 2). Or we might want to use the same value, 1000, as the
one we use for buffer->by_txn hash table.

>
> 3.
> @@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
>   ReorderBufferSerializeChange(rb, txn, fd, change);
>   dlist_delete(&change->node);
> - ReorderBufferReturnChange(rb, change, true);
> + ReorderBufferReturnChange(rb, change, false);
>
>   spilled++;
>   }
>
> + /* Update the memory counter */
> + ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
>
> In ReorderBufferSerializeTXN(), we already use a size variable for
> txn->size, we can probably use that for the sake of consistency.

Agreed, will fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-02-27 Thread Amit Kapila

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  wrote:
>

A few comments on 0003:
===
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress. This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

2. Can we somehow measure the additional memory that will be consumed
by each backend/walsender to maintain transactions? Because I think
this also won't be accounted for logical_decoding_work_mem, so if this
is large, there could be a chance of more complaints on us for not
honoring logical_decoding_work_mem.

3.
@@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  ReorderBufferSerializeChange(rb, txn, fd, change);
  dlist_delete(&change->node);
- ReorderBufferReturnChange(rb, change, true);
+ ReorderBufferReturnChange(rb, change, false);

  spilled++;
  }

+ /* Update the memory counter */
+ ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

In ReorderBufferSerializeTXN(), we already use a size variable for
txn->size, we can probably use that for the sake of consistency.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-02-27 Thread vignesh C

On Mon, 26 Feb 2024 at 12:33, Masahiko Sawada  wrote:
>
> On Fri, Feb 23, 2024 at 6:24 PM vignesh C  wrote:
> >
> > On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada  wrote:
> > >
> > >
> > > I think this performance regression is not acceptable. In this
> > > workload, one transaction has 10k subtransactions and the logical
> > > decoding becomes quite slow if logical_decoding_work_mem is not big
> > > enough. Therefore, it's a legitimate and common approach to increase
> > > logical_decoding_work_mem to speedup the decoding. However, with thie
> > > patch, the decoding becomes slower than today. It's a bad idea in
> > > general to optimize an extreme case while sacrificing the normal (or
> > > more common) cases.
> > >
> >
> > Since this same function is used by pg_dump sorting TopoSort functions
> > also, we can just verify once if there is no performance impact with
> > large number of objects during dump sorting:
>
> Okay. I've run the pg_dump regression tests with --timer flag (note
> that pg_dump doesn't use indexed binary heap):
>
> master:
> [16:00:25] t/001_basic.pl  ok  151 ms ( 0.00 usr
> 0.00 sys +  0.09 cusr  0.06 csys =  0.15 CPU)
> [16:00:25] t/002_pg_dump.pl .. ok10157 ms ( 0.23 usr
> 0.01 sys +  1.48 cusr  0.37 csys =  2.09 CPU)
> [16:00:36] t/003_pg_dump_with_server.pl .. ok  504 ms ( 0.00 usr
> 0.01 sys +  0.10 cusr  0.07 csys =  0.18 CPU)
> [16:00:36] t/004_pg_dump_parallel.pl . ok 1044 ms ( 0.00 usr
> 0.00 sys +  0.12 cusr  0.08 csys =  0.20 CPU)
> [16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr
> 0.00 sys +  0.34 cusr  0.19 csys =  0.53 CPU)
> [16:00:40] t/010_dump_connstr.pl . ok 4813 ms ( 0.01 usr
> 0.00 sys +  2.13 cusr  0.45 csys =  2.59 CPU)
>
> patched:
> [15:59:47] t/001_basic.pl  ok  150 ms ( 0.00 usr
> 0.00 sys +  0.08 cusr  0.07 csys =  0.15 CPU)
> [15:59:47] t/002_pg_dump.pl .. ok10057 ms ( 0.23 usr
> 0.02 sys +  1.49 cusr  0.36 csys =  2.10 CPU)
> [15:59:57] t/003_pg_dump_with_server.pl .. ok  509 ms ( 0.00 usr
> 0.00 sys +  0.09 cusr  0.08 csys =  0.17 CPU)
> [15:59:58] t/004_pg_dump_parallel.pl . ok 1048 ms ( 0.01 usr
> 0.00 sys +  0.11 cusr  0.11 csys =  0.23 CPU)
> [15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr
> 0.00 sys +  0.34 cusr  0.20 csys =  0.54 CPU)
> [16:00:01] t/010_dump_connstr.pl . ok 4762 ms ( 0.01 usr
> 0.00 sys +  2.15 cusr  0.42 csys =  2.58 CPU)
>
> There is no noticeable difference between the two results.

Thanks for verifying it, I have also run in my environment and found
no noticeable difference between them:
Head:
[07:29:41] t/001_basic.pl  ok  332 ms
[07:29:41] t/002_pg_dump.pl .. ok11029 ms
[07:29:52] t/003_pg_dump_with_server.pl .. ok  705 ms
[07:29:53] t/004_pg_dump_parallel.pl . ok 1198 ms
[07:29:54] t/005_pg_dump_filterfile.pl ... ok 2822 ms
[07:29:57] t/010_dump_connstr.pl . ok 5582 ms

With Patch:
[07:42:16] t/001_basic.pl  ok  328 ms
[07:42:17] t/002_pg_dump.pl .. ok11044 ms
[07:42:28] t/003_pg_dump_with_server.pl .. ok  719 ms
[07:42:29] t/004_pg_dump_parallel.pl . ok 1188 ms
[07:42:30] t/005_pg_dump_filterfile.pl ... ok 2816 ms
[07:42:33] t/010_dump_connstr.pl . ok 5609 ms

Regards,
Vignesh

Re: Improve eviction algorithm in ReorderBuffer

2024-02-26 Thread Masahiko Sawada

 patch, but this reminds me I have some doubts
> >> about how the eviction interferes with using the GenerationContext for
> >> some of the data. I suspect we can easily get into a situation where we
> >> evict the largest transaction, but that doesn't actually reduce the
> >> memory usage at all, because the memory context blocks are shared with
> >> some other transactions and don't get 100% empty (so we can't release
> >> them). But it's actually worse, because GenerationContext does not even
> >> reuse this memory. So do we even gain anything by the eviction?
> >>
> >> When the earlier patch versions also considered age of the transaction,
> >> to try evicting the older ones first, I think that was interesting. I
> >> think we may want to do something like this even with the binary heap.
> >
> > Thank you for raising this issue. This is one of the highest priority
> > items in my backlog. We've seen cases where the logical decoding uses
> > much more memory than logical_decoding_work_mem value[1][2] (e.g. it
> > used 4GB memory even though the logical_decoding_work_mem was 256kB).
> > I think that the problem would still happen even with this improvement
> > on the eviction.
> >
> > I believe these are separate problems we can address, and evicting
> > large transactions first would still be the right strategy. We might
> > want to improve how we store changes in memory contexts. For example,
> > it might be worth having per-transaction memory context so that we can
> > actually free memory blocks by the eviction. We can discuss it in a
> > separate thread.
> >
>
> Yes, I think using per-transaction context for large transactions might
> work. I don't think we want too many contexts, so we'd start with the
> shared context, and then at some point (when the transaction exceeds say
> 5% of the memory limit) we'd switch it to a separate one.
>
> But that's a matter for a separate patch, so let's discuss elsewhere.

+1

>
> >>
> >> For example, a system may be doing a lot of eviction / spilling with
> >> logical_decoding_work_mem=64MB, but setting 128MB may completely
> >> eliminate that. Of course, if there are large transactions, this may not
> >> be possible (the GUC would have to exceed RAM). But I don't think that's
> >> very common, the incidents that I've observed were often resolved by
> >> bumping the logical_decoding_work_mem by a little bit.
> >>
> >> I wonder if there's something we might do to help users to tune this. We
> >> should be able to measure the "peak" memory usage (how much memory we'd
> >> need to not spill), so maybe we could log that as a WARNING, similarly
> >> to checkpoints - there we only log "checkpoints too frequent, tune WAL
> >> limits", but perhaps we might do more here?  Or maybe we could add the
> >> watermark to the system catalog?
> >
> > Interesting ideas.
> >
> > The statistics such as spill_count shown  in pg_stat_replication_slots
> > view could already give hints to users to increase the
> > logical_decoding_work_mem. In addition to that, it's an interesting
> > idea to have the high water mark in the view.
> >
>
> The spill statistics are useful, but I'm not sure it can answer the main
> question:
>
> How high would the memory limit need to be to not spill?
>
> Maybe there's something we can measure / log to help with this.

Right. I like the idea of the high watermark. The
pg_stat_replication_slots would be the place to store such
information. Since the reorder buffer evicts or streams transactions
anyway based on the logical_decoding_work_mem, probably we need to
compute the maximum amount of data in the reorder buffer at one point
in time while assuming no transactions were evicted and streamed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v7-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data


v7-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v7-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-02-26 Thread Tomas Vondra




On 2/26/24 07:46, Masahiko Sawada wrote:
> On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
>  wrote:
>>...
>>
>> overall design
>> --
>>
>> As for the design, I agree with the approach of using a binaryheap to
>> track transactions by size. When going over the thread history,
>> describing the initial approach with only keeping "large" transactions
>> above some threshold (e.g. 10%), I was really concerned that'll either
>> lead to abrupt changes in behavior (when transactions move just around
>> the 10%), or won't help with many common cases (with most transactions
>> being below the limit).
>>
>> I was going to suggest some sort of "binning" - keeping lists for
>> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
>> evicting transactions from a list, i.e. based on approximate size. But
>> if the indexed binary heap seems to be cheap enough, I think it's a
>> better solution.
> 
> I've also considered the binning idea. But it was not clear to me how
> it works well in a case where all transactions belong to the
> particular class. For example, if we need to free up 1MB memory, we
> could end up evicting 2000 transactions consuming 50 bytes instead of
> 100 transactions consuming 1000 bytes, resulting in that we end up
> serializing more transactions. Also, I'm concerned about the cost of
> maintaining the binning lists.
> 

I don't think the list maintenance would be very costly - in particular,
the lists would not need to be sorted by size. You're right in some
extreme cases we might evict the smallest transactions in the list. I
think on average we'd evict transactions with average size, which seems
OK for this use case.

Anyway, I don't think we need to be distracted with this. I mentioned it
merely to show it was considered, but the heap seems to work well
enough, and in the end is even simpler because the complexity is hidden
outside reorderbuffer.

>>
>> The one thing I'm a bit concerned about is the threshold used to start
>> using binary heap - these thresholds with binary decisions may easily
>> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
>> with significant runtime change (e.g. you add/remove one transaction and
>> the code takes a much more expensive path). The value (1024) seems
>> rather arbitrary, I wonder if there's something to justify that choice.
> 
> True. 1024 seems small to me. In my environment, I started to see a
> big difference from around 4 transactions. But it varies depending
> on environments and workloads.
> 
> I think that this performance problem we're addressing doesn't
> normally happen as long as all transactions being decoded are
> top-level transactions. Otherwise, we also need to improve
> ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
> max_connections = 1024 is a possible value in some systems, and I've
> observed such systems sometimes. On the other hand, I've observed >
> 5000 in just a few cases, and having more than 5000 transactions in
> ReorderBuffer seems unlikely to happen without subtransactions. I
> think we can say it's an extreme case, the number is still an
> arbitrary number though.
> 
> Or probably we can compute the threshold based on max_connections,
> e.g., max_connections * 10. That way, we can ensure that users won't
> incur the max-heap maintenance costs as long as they don't use
> subtransactions.
> 

Tying this to max_connections seems like an interesting option. It'd
make this adaptive to a system. I haven't thought about the exact value
(m_c * 10), but it seems better than arbitrary hard-coded values.

>>
>> In any case, I agree it'd be good to have some dampening factor, to
>> reduce the risk of trashing because of adding/removing a single
>> transaction to the decoding.
>>
>>
>> related stuff / GenerationContext
>> -
>>
>> It's not the fault of this patch, but this reminds me I have some doubts
>> about how the eviction interferes with using the GenerationContext for
>> some of the data. I suspect we can easily get into a situation where we
>> evict the largest transaction, but that doesn't actually reduce the
>> memory usage at all, because the memory context blocks are shared with
>> some other transactions and don't get 100% empty (so we can't release
>> them). But it's actually worse, because GenerationContext does not even
>> reuse this memory. So do we even gain anything by the eviction?
>>
>> When the earlier patch versions also considered age of the transaction,
>> to try evicting the older ones first, I think that was interesting. I
>> think we may want to do something like this even with the binary heap.
> 
> Thank you for raising this issue. This is one of the highest priority
> items in my backlog. We've seen cases where the logical decoding uses
> much more memory than logical_decoding_work_mem value[1][2] (e.g. it
> used 4GB memory even though the logical_decoding_work_mem was 256kB).
> I think that the problem w

Re: Improve eviction algorithm in ReorderBuffer

2024-02-25 Thread Masahiko Sawada

On Fri, Feb 23, 2024 at 6:24 PM vignesh C  wrote:
>
> On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada  wrote:
> >
> >
> > I think this performance regression is not acceptable. In this
> > workload, one transaction has 10k subtransactions and the logical
> > decoding becomes quite slow if logical_decoding_work_mem is not big
> > enough. Therefore, it's a legitimate and common approach to increase
> > logical_decoding_work_mem to speedup the decoding. However, with thie
> > patch, the decoding becomes slower than today. It's a bad idea in
> > general to optimize an extreme case while sacrificing the normal (or
> > more common) cases.
> >
>
> Since this same function is used by pg_dump sorting TopoSort functions
> also, we can just verify once if there is no performance impact with
> large number of objects during dump sorting:

Okay. I've run the pg_dump regression tests with --timer flag (note
that pg_dump doesn't use indexed binary heap):

master:
[16:00:25] t/001_basic.pl  ok  151 ms ( 0.00 usr
0.00 sys +  0.09 cusr  0.06 csys =  0.15 CPU)
[16:00:25] t/002_pg_dump.pl .. ok10157 ms ( 0.23 usr
0.01 sys +  1.48 cusr  0.37 csys =  2.09 CPU)
[16:00:36] t/003_pg_dump_with_server.pl .. ok  504 ms ( 0.00 usr
0.01 sys +  0.10 cusr  0.07 csys =  0.18 CPU)
[16:00:36] t/004_pg_dump_parallel.pl . ok 1044 ms ( 0.00 usr
0.00 sys +  0.12 cusr  0.08 csys =  0.20 CPU)
[16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr
0.00 sys +  0.34 cusr  0.19 csys =  0.53 CPU)
[16:00:40] t/010_dump_connstr.pl . ok 4813 ms ( 0.01 usr
0.00 sys +  2.13 cusr  0.45 csys =  2.59 CPU)

patched:
[15:59:47] t/001_basic.pl  ok  150 ms ( 0.00 usr
0.00 sys +  0.08 cusr  0.07 csys =  0.15 CPU)
[15:59:47] t/002_pg_dump.pl .. ok10057 ms ( 0.23 usr
0.02 sys +  1.49 cusr  0.36 csys =  2.10 CPU)
[15:59:57] t/003_pg_dump_with_server.pl .. ok  509 ms ( 0.00 usr
0.00 sys +  0.09 cusr  0.08 csys =  0.17 CPU)
[15:59:58] t/004_pg_dump_parallel.pl . ok 1048 ms ( 0.01 usr
0.00 sys +  0.11 cusr  0.11 csys =  0.23 CPU)
[15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr
0.00 sys +  0.34 cusr  0.20 csys =  0.54 CPU)
[16:00:01] t/010_dump_connstr.pl . ok 4762 ms ( 0.01 usr
0.00 sys +  2.15 cusr  0.42 csys =  2.58 CPU)

There is no noticeable difference between the two results.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-02-25 Thread Masahiko Sawada

On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
 wrote:
>
> Hi,
>
> I did a basic review and testing of this patch today. Overall I think
> the patch is in very good shape - I agree with the tradeoffs it makes,
> and I like the approach in general. I do have a couple minor comments
> about the code, and then maybe a couple thoughts about the approach.

Thank you for the review comments and tests!

>
>
> First, some comments - I'll put them here, but I also kept them in
> "review" commits, because that makes it easier to show the exact place
> in the code the comment is about.
>
> 1) binaryheap_allocate got a new "indexed" argument, but the comment is
> not updated to document it

Fixed.

>
> 2) I think it's preferable to use descriptive argument names for
> bh_set_node. I don't think there's a good reason to keep it short.

Agreed.

>
> 3) In a couple places we have code like this:
>
> if (heap->bh_indexed)
> bh_nodeidx_delete(heap->bh_nodeidx, result);
>
> Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
> that it can be called without it.

Fixed.

>
> 4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
> either expect to find the node (update of already tracked transaction)
> or not (when inserting it). The life cycle may be non-trivial (node
> added, updated and removed, ...), would be useful assert I think.

Agreed.

>
> 5) Do we actually need the various mem_freed local variables in a couple
> places, when we expect the value to be equal to txn->size (there's even
> assert enforcing that)?

You're right.

>
> 6) ReorderBufferCleanupTXN has a comment about maybe not using the same
> threshold both to enable & disable usage of the binaryheap. I agree with
> that, otherwise we could easily end up "trashing" if we add/remove
> transactions right around the threshold. I think 90-95% for disabling
> the heap would work fine.

Agreeehd.

>
> 7) The code disabling binaryheap (based on the threshold) is copied in a
> couple places, perhaps it should be a separate function called from
> those places.

Fixed.

>
> 8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
> memory size check internally, to make the calls simpler.

Agreed.

>
> 9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
> split maybe not very clear. It's not clear to me why it's divided like
> this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.

I think that now we have two use cases: updating the memory counter
after freeing individual change, and updating the memory counter after
freeing all changes of the transaction (i.e., making the counter to
0). In the former case, we need to check if the change is
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID but we don't need to pass the
transaction as the change has its transaction. On the other hand, in
the latter case, we don't need the change but need to pass the
transaction. If we do both things in one function, the function would
have two arguments: change and txn, and the callers set either one
that they know. I've updated the patch accordingly.

BTW it might be worth considering to create a separate patch for the
updates around ReorderBufferChangeMemoryUpdate() that batches the
memory counter updates as it seems an independent change from the
max-heap stuff.

>
> performance
> ---
>
> I did some benchmarks, to see the behavior in simple good/bad cases (see
> the attached scripts.tgz). "large" is one large transaction inserting 1M
> rows, small is 64k single-row inserts, and subxacts is the original case
> with ~100k subxacts. Finally, subxacts-small is many transactions with
> 128 subxacts each (the main transactions are concurrent).
>
> The results are pretty good, I think:
>
>  testmaster patched
>   -
> large  25872459   95%
> small   956 856   89%
>  subxacts13891529112%
>subxacts-small 13632   13187   97%

Thank you for doing the performance test. I ran the same script you
shared on my machine just in case and got similar results:

 masterpatched
large:28312827
small:12261222
subxacts:134076  2744
subxacts-small:2338423127

In my case, the differences seem to be within a noise range.

>
> This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
> where the subxact timing goes way down, but the overall conclusions do
> not change.
>
> I was a bit surprised I haven't seen any clear regression, but in the
> end that's a good thing, right? There's a couple results in this thread
> showing ~10% regression, but I've been unable to reproduce those.
> Perhaps the newer patch versions fix that, I guess.

Yes, the 10% regression is fixed in the v4 patch. We don't update the
max-heap at all until the number of transactions reaches the threshold
so

Re: Improve eviction algorithm in ReorderBuffer

2024-02-23 Thread Tomas Vondra

Hi,

I did a basic review and testing of this patch today. Overall I think
the patch is in very good shape - I agree with the tradeoffs it makes,
and I like the approach in general. I do have a couple minor comments
about the code, and then maybe a couple thoughts about the approach.


First, some comments - I'll put them here, but I also kept them in
"review" commits, because that makes it easier to show the exact place
in the code the comment is about.

1) binaryheap_allocate got a new "indexed" argument, but the comment is
not updated to document it

2) I think it's preferable to use descriptive argument names for
bh_set_node. I don't think there's a good reason to keep it short.

3) In a couple places we have code like this:

if (heap->bh_indexed)
bh_nodeidx_delete(heap->bh_nodeidx, result);

Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
that it can be called without it.

4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
either expect to find the node (update of already tracked transaction)
or not (when inserting it). The life cycle may be non-trivial (node
added, updated and removed, ...), would be useful assert I think.

5) Do we actually need the various mem_freed local variables in a couple
places, when we expect the value to be equal to txn->size (there's even
assert enforcing that)?

6) ReorderBufferCleanupTXN has a comment about maybe not using the same
threshold both to enable & disable usage of the binaryheap. I agree with
that, otherwise we could easily end up "trashing" if we add/remove
transactions right around the threshold. I think 90-95% for disabling
the heap would work fine.

7) The code disabling binaryheap (based on the threshold) is copied in a
couple places, perhaps it should be a separate function called from
those places.

8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
memory size check internally, to make the calls simpler.

9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
split maybe not very clear. It's not clear to me why it's divided like
this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.


performance
---

I did some benchmarks, to see the behavior in simple good/bad cases (see
the attached scripts.tgz). "large" is one large transaction inserting 1M
rows, small is 64k single-row inserts, and subxacts is the original case
with ~100k subxacts. Finally, subxacts-small is many transactions with
128 subxacts each (the main transactions are concurrent).

The results are pretty good, I think:

 testmaster patched
  -
large  25872459   95%
small   956 856   89%
 subxacts13891529112%
   subxacts-small 13632   13187   97%

This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
where the subxact timing goes way down, but the overall conclusions do
not change.

I was a bit surprised I haven't seen any clear regression, but in the
end that's a good thing, right? There's a couple results in this thread
showing ~10% regression, but I've been unable to reproduce those.
Perhaps the newer patch versions fix that, I guess.

Anyway, I think that at some point we'd have to accept that some cases
may have slight regression. I think that's inherent for almost any
heuristics - there's always going to be some rare case that defeats it.
What's important is that the case needs to be rare and/or the impact
very limited. And I think that's true here.


overall design
--

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the d

Re: Improve eviction algorithm in ReorderBuffer

2024-02-23 Thread vignesh C

On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada  wrote:
>
> On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Sawada-san,
> >
> > Thanks for making v3 patchset. I have also benchmarked the case [1].
> > Below results are the average of 5th, there are almost the same result
> > even when median is used for the comparison. On my env, the regression
> > cannot be seen.
> >
> > HEAD (1e285a5)  HEAD + v3 patches   difference
> > 10910.722 ms10714.540 msaround 1.8%
> >
>
> Thank you for doing the performance test!
>
> > Also, here are mino comments for v3 set.
> >
> > 01.
> > bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.
>
> Will add them.
>
> >
> > 02. ReorderBufferTXNSizeCompare
> > Should we assert {ta, tb} are not NULL?
>
> Not sure we really need it as other binaryheap users don't have such checks.
>
> On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > I've run a benchmark test that I shared before[1]. Here are results of
> > > decoding a transaction that has 1M subtransaction each of which has 1
> > > INSERT:
> > >
> > > HEAD:
> > > 1810.192 ms
> > >
> > > HEAD w/ patch:
> > > 2001.094 ms
> > >
> > > I set a large enough value to logical_decoding_work_mem not to evict
> > > any transactions. I can see about about 10% performance regression in
> > > this case.
> >
> > Thanks for running. I think this workload is the worst and an extreme case 
> > which
> > would not be occurred on the real system (Such a system should be fixed), 
> > so we
> > can say that the regression is up to -10%. I felt it could be negligible 
> > but how
> > do other think?
>
> I think this performance regression is not acceptable. In this
> workload, one transaction has 10k subtransactions and the logical
> decoding becomes quite slow if logical_decoding_work_mem is not big
> enough. Therefore, it's a legitimate and common approach to increase
> logical_decoding_work_mem to speedup the decoding. However, with thie
> patch, the decoding becomes slower than today. It's a bad idea in
> general to optimize an extreme case while sacrificing the normal (or
> more common) cases.
>

Since this same function is used by pg_dump sorting TopoSort functions
also, we can just verify once if there is no performance impact with
large number of objects during dump sorting:
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-   int sz;
binaryheap *heap;

-   sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-   heap = (binaryheap *) palloc(sz);
+   heap = (binaryheap *) palloc(sizeof(binaryheap));
heap->bh_space = capacity;
heap->bh_compare = compare;
heap->bh_arg = arg;

heap->bh_size = 0;
heap->bh_has_heap_property = true;
+   heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type)
* capacity);

Regards,
Vignesh

Re: Improve eviction algorithm in ReorderBuffer

2024-02-14 Thread Ajin Cherian

On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada 
wrote:

> On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada 
> wrote:
> >>
> >>
> >> I've attached the new version patch set.
> >>
> >> Regards,
> >>
> >>
> >> --
> >> Masahiko Sawada
> >> Amazon Web Services: https://aws.amazon.com
> >
> >
> > Thanks for the patch. I reviewed that patch and did minimal testing and
> it seems to show the speed up as claimed. Some minor comments:
>
> Thank you for the comments!
>
> > patch 0001:
> >
> > +static void
> > +bh_enlarge_node_array(binaryheap *heap)
> > +{
> > + if (heap->bh_size < heap->bh_space)
> > + return;
> >
> > why not check "if (heap->bh_size >= heap->bh_space)" outside this
> function to avoid calling this function when not necessary? This check was
> there in code before the patch.
>
> Agreed.
>
> >
> > patch 0003:
> >
> > +/*
> > + * The threshold of the number of transactions in the max-heap
> (rb->txn_heap)
> > + * to switch the state.
> > + */
> > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
> >
> > Typo: I think you meant REORDER_ and not REORDE_
>
> Fixed.
>
> These comments are addressed in the v4 patch set I just shared[1].
>
>
> These changes look good to me. I've done some tests with a few varying
levels of subtransaction and I could see that the patch was at least 5%
better in all of them.

regards,
Ajin Cherian
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-02-09 Thread Masahiko Sawada

On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian  wrote:
>
>
>
> On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada  wrote:
>>
>>
>> I've attached the new version patch set.
>>
>> Regards,
>>
>>
>> --
>> Masahiko Sawada
>> Amazon Web Services: https://aws.amazon.com
>
>
> Thanks for the patch. I reviewed that patch and did minimal testing and it 
> seems to show the speed up as claimed. Some minor comments:

Thank you for the comments!

> patch 0001:
>
> +static void
> +bh_enlarge_node_array(binaryheap *heap)
> +{
> + if (heap->bh_size < heap->bh_space)
> + return;
>
> why not check "if (heap->bh_size >= heap->bh_space)" outside this function to 
> avoid calling this function when not necessary? This check was there in code 
> before the patch.

Agreed.

>
> patch 0003:
>
> +/*
> + * The threshold of the number of transactions in the max-heap (rb->txn_heap)
> + * to switch the state.
> + */
> +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
>
> Typo: I think you meant REORDER_ and not REORDE_

Fixed.

These comments are addressed in the v4 patch set I just shared[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoDhuybyryVkmVkgPY8uVrjGLYchL8EY8-rBm1hbZJpwLw%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-02-09 Thread Masahiko Sawada

On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> Thanks for making v3 patchset. I have also benchmarked the case [1].
> Below results are the average of 5th, there are almost the same result
> even when median is used for the comparison. On my env, the regression
> cannot be seen.
>
> HEAD (1e285a5)  HEAD + v3 patches   difference
> 10910.722 ms10714.540 msaround 1.8%
>

Thank you for doing the performance test!

> Also, here are mino comments for v3 set.
>
> 01.
> bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

Will add them.

>
> 02. ReorderBufferTXNSizeCompare
> Should we assert {ta, tb} are not NULL?

Not sure we really need it as other binaryheap users don't have such checks.

On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > I've run a benchmark test that I shared before[1]. Here are results of
> > decoding a transaction that has 1M subtransaction each of which has 1
> > INSERT:
> >
> > HEAD:
> > 1810.192 ms
> >
> > HEAD w/ patch:
> > 2001.094 ms
> >
> > I set a large enough value to logical_decoding_work_mem not to evict
> > any transactions. I can see about about 10% performance regression in
> > this case.
>
> Thanks for running. I think this workload is the worst and an extreme case 
> which
> would not be occurred on the real system (Such a system should be fixed), so 
> we
> can say that the regression is up to -10%. I felt it could be negligible but 
> how
> do other think?

I think this performance regression is not acceptable. In this
workload, one transaction has 10k subtransactions and the logical
decoding becomes quite slow if logical_decoding_work_mem is not big
enough. Therefore, it's a legitimate and common approach to increase
logical_decoding_work_mem to speedup the decoding. However, with thie
patch, the decoding becomes slower than today. It's a bad idea in
general to optimize an extreme case while sacrificing the normal (or
more common) cases.

Therefore, I've improved the algorithm so that we don't touch the
max-heap at all if the number of transactions is small enough. I've
run benchmark test with two workloads:

workload-1, decode single transaction with 800k tuples (normal.sql):

* without spill
HEAD: 13235.136 ms
v3 patch: 14320.082 ms
v4 patch: 13300.665 ms

* with spill
HEAD: 22970.204 ms
v3 patch: 23625.649 ms
v4 patch: 23304.366

workload-2, decode one transaction with 100k subtransaction (many-subtxn.sql):

* without spill
HEAD: 345.718 ms
v3 patch: 409.686 ms
v4 patch: 353.026 ms

* with spill
HEAD: 136718.313 ms
v3 patch: 2675.539 ms
v4 patch: 2734.981 ms

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v4-0001-Make-binaryheap-enlareable.patch
Description: Binary data


v4-0002-Add-functions-to-binaryheap-to-efficiently-remove.patch
Description: Binary data


v4-0003-Use-max-heap-to-evict-largest-transactions-in-Reo.patch
Description: Binary data


normal.sql
Description: Binary data


many-subtxn.sql
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-02-09 Thread Ajin Cherian

On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada 
wrote:

>
> I've attached the new version patch set.
>
> Regards,
>
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Thanks for the patch. I reviewed that patch and did minimal testing and it
seems to show the speed up as claimed. Some minor comments:
patch 0001:

+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ if (heap->bh_size < heap->bh_space)
+ return;

why not check "if (heap->bh_size >= heap->bh_space)" outside this function
to avoid calling this function when not necessary? This check was there in
code before the patch.

patch 0003:

+/*
+ * The threshold of the number of transactions in the max-heap
(rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024

Typo: I think you meant REORDER_ and not REORDE_

regards,
Ajin Cherian
Fujitsu Australia

RE: Improve eviction algorithm in ReorderBuffer

2024-02-08 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

Thanks for making v3 patchset. I have also benchmarked the case [1].
Below results are the average of 5th, there are almost the same result
even when median is used for the comparison. On my env, the regression
cannot be seen.

HEAD (1e285a5)  HEAD + v3 patches   difference
10910.722 ms10714.540 msaround 1.8%

Also, here are mino comments for v3 set.

01.
bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

02. ReorderBufferTXNSizeCompare
Should we assert {ta, tb} are not NULL?

[1]: 
https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Improve eviction algorithm in ReorderBuffer

2024-02-05 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 5:16 PM Masahiko Sawada  wrote:
>
> On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
>  wrote:
> >
> > On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > The individual transactions shouldn't cross
> > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > > proposal to
> > > > > > > > maintain the lists: "...splitting it into two lists: 
> > > > > > > > transactions
> > > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > > 5% >=
> > > > > > > > list preferably.". In the previous sentence, what did you mean 
> > > > > > > > by
> > > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > > impression
> > > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > > list
> > > > > > > > which doesn't seems to be the case.
> > > > > > >
> > > > > > > I wanted to mean that there are three lists in total: the first 
> > > > > > > one
> > > > > > > maintain the transactions consuming more than 10% of
> > > > > > > logical_decoding_work_mem,
> > > > > > >
> > > > > >
> > > > > > How can we have multiple transactions in the list consuming more 
> > > > > > than
> > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > > before any xact reaches logical_decoding_work_mem?
> > > > >
> > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > > possible that the list has three transactions each of which are
> > > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > > still 30MB (less than logical_decoding_work_mem).
> > > > >
> > > >
> > > > Thanks for the clarification. I misunderstood the list to have
> > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > > thing to note is that maintaining these lists by default can also have
> > > > some overhead unless the list of open transactions crosses a certain
> > > > threshold.
> > > >
> > >
> > > On further analysis, I realized that the approach discussed here might
> > > not be the way to go. The idea of dividing transactions into several
> > > subgroups is to divide a large number of entries into multiple
> > > sub-groups so we can reduce the complexity to search for the
> > > particular entry. Since we assume that there are no big differences in
> > > entries' sizes within a sub-group, we can pick the entry to evict in
> > > O(1). However, what we really need to avoid here is that we end up
> > > increasing the number of times to evict entries because serializing an
> > > entry to the disk is more costly than searching an entry on memory in
> > > general.
> > >
> > > I think that it's no problem in a large-entries subgroup but when it
> > > comes to the smallest-entries subgroup, like for entries consuming
> > > less than 5% of the limit, it could end up evicting many entries. For
> > > example, there would be a huge difference between serializing 1 entry
> > > consuming 5% of the memory limit and serializing 5000 entries
> > > consuming 0.001% of the memory limit. Even if we can select 5000
> > > entries quickly, I think the latter would be slower in total. The more
> > > subgroups we create, the more the algorithm gets complex and the
> > > overheads could cause. So I think we need to search for the largest
> > > entry in order to minimize the number of evictions anyway.
> > >
> > > Looking for data structures and algorithms, I think binaryheap with
> > > some improvements could be promising. I mentioned before why we cannot
> > > use the current binaryheap[1]. The missing pieces are efficient ways
> > > to remove the arbitrary entry and to update the arbitrary entry's key.
> > > The current binaryheap provides binaryheap_remove_node(), which is
> > > O(log n), but it requires the entry's position in the binaryheap. We
> > > can know the entry's position just after binaryheap_add_unordered()
> > > but it might be changed after heapify. Searching the node's position
> > > is O(n). So the improvement idea is to add a hash table to the
> > > binaryheap so that it can track the positions for each entry so that
> > > we can remove the arbitrary entry in O(log n) and also update the
> > > arbitrary entry's key in O(log n). This is known as the indexed
> > > priority queue. I've attached the patch for that (0001 and 0002).
> > >
> > > That way, in terms of reord

RE: Improve eviction algorithm in ReorderBuffer

2024-02-05 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

> Thank you for the review comments!
> 
> >
> > A comment for 0001:
> >
> > 01.
> > ```
> > +static void
> > +bh_enlarge_node_array(binaryheap *heap)
> > +{
> > +if (heap->bh_size < heap->bh_space)
> > +return;
> > +
> > +heap->bh_space *= 2;
> > +heap->bh_nodes = repalloc(heap->bh_nodes,
> > +  sizeof(bh_node_type) * heap->bh_space);
> > +}
> > ```
> >
> > I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
> > structure public one and arbitrary codes and extensions can directly refer
> > bh_nodes. But if the array is repalloc()'d, the pointer would be updated so 
> > that
> > their referring would be a dangling pointer.
> 
> Hmm I'm not sure this is the case that we need to really worry about,
> and cannot come up with a good use case where extensions refer to
> bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
> Nodes already have pointers and are exposed.

Actually, me neither. I could not come up with the use-case - I just said the 
possibility.
If it is not a real issue, we can ignore.

> 
> >
> > 04.
> > ```
> >  extern binaryheap *binaryheap_allocate(int capacity,
> > binaryheap_comparator compare,
> > -   void *arg);
> > +   bool indexed, void *arg);
> > ```
> >
> > I felt pre-existing API should not be changed. How about adding
> > binaryheap_allocate_extended() or something which can specify the `bool
> indexed`?
> > binaryheap_allocate() sets heap->bh_indexed to false.
> 
> I'm really not sure it's worth inventing a
> binaryheap_allocate_extended() function just for preserving API
> compatibility. I think it's generally a good idea to have
> xxx_extended() function to increase readability and usability, for
> example, for the case where the same (kind of default) arguments are
> passed in most cases and the function is called from many places.
> However, we have a handful binaryheap_allocate() callers, and I
> believe that it would not hurt the existing callers.

I kept (external) extensions which uses binaryheap APIs in my mind.
I thought we could avoid to raise costs for updating their codes. But I could
understand the change is small, so ... up to you.

> > 05.
> > ```
> > +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
> > +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
> > ```
> >
> > IIUC, callers must consider whether the node should be shift up/down and use
> > appropriate function, right? I felt it may not be user-friendly.
> 
> Right, I couldn't come up with a better interface.
> 
> Another idea I've considered was that the caller provides a callback
> function where it can compare the old and new keys. For example, in
> reorderbuffer case, we call like:
> 
> binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
> ReorderBufferTXNUpdateCompare, (void *) &old_size);
> 
> Then in ReorderBufferTXNUpdateCompare(),
> ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
> (compare txn->size to "b" ...)
> 
> However it seems complicated...
> 

I considered similar approach: accept old node as an argument of a compare 
function.
But it requires further memory allocation. Do someone have better idea?

> 
> >
> > 08.
> > IIUC, if more than 1024 transactions are running but they have small amount 
> > of
> > changes, the performance may be degraded, right? Do you have a result in
> sucha
> > a case?
> 
> I've run a benchmark test that I shared before[1]. Here are results of
> decoding a transaction that has 1M subtransaction each of which has 1
> INSERT:
> 
> HEAD:
> 1810.192 ms
> 
> HEAD w/ patch:
> 2001.094 ms
> 
> I set a large enough value to logical_decoding_work_mem not to evict
> any transactions. I can see about about 10% performance regression in
> this case.

Thanks for running. I think this workload is the worst and an extreme case which
would not be occurred on the real system (Such a system should be fixed), so we
can say that the regression is up to -10%. I felt it could be negligible but how
do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Improve eviction algorithm in ReorderBuffer

2024-02-02 Thread Masahiko Sawada

On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
 wrote:
>
> On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > > The individual transactions shouldn't cross
> > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > proposal to
> > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > 5% >=
> > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > impression
> > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > list
> > > > > > > which doesn't seems to be the case.
> > > > > >
> > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > maintain the transactions consuming more than 10% of
> > > > > > logical_decoding_work_mem,
> > > > > >
> > > > >
> > > > > How can we have multiple transactions in the list consuming more than
> > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > before any xact reaches logical_decoding_work_mem?
> > > >
> > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > possible that the list has three transactions each of which are
> > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > still 30MB (less than logical_decoding_work_mem).
> > > >
> > >
> > > Thanks for the clarification. I misunderstood the list to have
> > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > thing to note is that maintaining these lists by default can also have
> > > some overhead unless the list of open transactions crosses a certain
> > > threshold.
> > >
> >
> > On further analysis, I realized that the approach discussed here might
> > not be the way to go. The idea of dividing transactions into several
> > subgroups is to divide a large number of entries into multiple
> > sub-groups so we can reduce the complexity to search for the
> > particular entry. Since we assume that there are no big differences in
> > entries' sizes within a sub-group, we can pick the entry to evict in
> > O(1). However, what we really need to avoid here is that we end up
> > increasing the number of times to evict entries because serializing an
> > entry to the disk is more costly than searching an entry on memory in
> > general.
> >
> > I think that it's no problem in a large-entries subgroup but when it
> > comes to the smallest-entries subgroup, like for entries consuming
> > less than 5% of the limit, it could end up evicting many entries. For
> > example, there would be a huge difference between serializing 1 entry
> > consuming 5% of the memory limit and serializing 5000 entries
> > consuming 0.001% of the memory limit. Even if we can select 5000
> > entries quickly, I think the latter would be slower in total. The more
> > subgroups we create, the more the algorithm gets complex and the
> > overheads could cause. So I think we need to search for the largest
> > entry in order to minimize the number of evictions anyway.
> >
> > Looking for data structures and algorithms, I think binaryheap with
> > some improvements could be promising. I mentioned before why we cannot
> > use the current binaryheap[1]. The missing pieces are efficient ways
> > to remove the arbitrary entry and to update the arbitrary entry's key.
> > The current binaryheap provides binaryheap_remove_node(), which is
> > O(log n), but it requires the entry's position in the binaryheap. We
> > can know the entry's position just after binaryheap_add_unordered()
> > but it might be changed after heapify. Searching the node's position
> > is O(n). So the improvement idea is to add a hash table to the
> > binaryheap so that it can track the positions for each entry so that
> > we can remove the arbitrary entry in O(log n) and also update the
> > arbitrary entry's key in O(log n). This is known as the indexed
> > priority queue. I've attached the patch for that (0001 and 0002).
> >
> > That way, in terms of reorderbuffer, we can update and remove the
> > transaction's memory usage in O(log n) (in worst case and O(1) in
> > average) and then pick the largest transaction in O(1). Since we might
> > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > we need to maintain the binaryheap anyway. I've att

Re: Improve eviction algorithm in ReorderBuffer

2024-02-01 Thread Masahiko Sawada

Hi,

On Wed, Jan 31, 2024 at 5:32 PM vignesh C  wrote:
>
> On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada  wrote:
> >
> > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > The individual transactions shouldn't cross
> > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > > proposal to
> > > > > > > > maintain the lists: "...splitting it into two lists: 
> > > > > > > > transactions
> > > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > > 5% >=
> > > > > > > > list preferably.". In the previous sentence, what did you mean 
> > > > > > > > by
> > > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > > impression
> > > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > > list
> > > > > > > > which doesn't seems to be the case.
> > > > > > >
> > > > > > > I wanted to mean that there are three lists in total: the first 
> > > > > > > one
> > > > > > > maintain the transactions consuming more than 10% of
> > > > > > > logical_decoding_work_mem,
> > > > > > >
> > > > > >
> > > > > > How can we have multiple transactions in the list consuming more 
> > > > > > than
> > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > > before any xact reaches logical_decoding_work_mem?
> > > > >
> > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > > possible that the list has three transactions each of which are
> > > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > > still 30MB (less than logical_decoding_work_mem).
> > > > >
> > > >
> > > > Thanks for the clarification. I misunderstood the list to have
> > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > > thing to note is that maintaining these lists by default can also have
> > > > some overhead unless the list of open transactions crosses a certain
> > > > threshold.
> > > >
> > >
> > > On further analysis, I realized that the approach discussed here might
> > > not be the way to go. The idea of dividing transactions into several
> > > subgroups is to divide a large number of entries into multiple
> > > sub-groups so we can reduce the complexity to search for the
> > > particular entry. Since we assume that there are no big differences in
> > > entries' sizes within a sub-group, we can pick the entry to evict in
> > > O(1). However, what we really need to avoid here is that we end up
> > > increasing the number of times to evict entries because serializing an
> > > entry to the disk is more costly than searching an entry on memory in
> > > general.
> > >
> > > I think that it's no problem in a large-entries subgroup but when it
> > > comes to the smallest-entries subgroup, like for entries consuming
> > > less than 5% of the limit, it could end up evicting many entries. For
> > > example, there would be a huge difference between serializing 1 entry
> > > consuming 5% of the memory limit and serializing 5000 entries
> > > consuming 0.001% of the memory limit. Even if we can select 5000
> > > entries quickly, I think the latter would be slower in total. The more
> > > subgroups we create, the more the algorithm gets complex and the
> > > overheads could cause. So I think we need to search for the largest
> > > entry in order to minimize the number of evictions anyway.
> > >
> > > Looking for data structures and algorithms, I think binaryheap with
> > > some improvements could be promising. I mentioned before why we cannot
> > > use the current binaryheap[1]. The missing pieces are efficient ways
> > > to remove the arbitrary entry and to update the arbitrary entry's key.
> > > The current binaryheap provides binaryheap_remove_node(), which is
> > > O(log n), but it requires the entry's position in the binaryheap. We
> > > can know the entry's position just after binaryheap_add_unordered()
> > > but it might be changed after heapify. Searching the node's position
> > > is O(n). So the improvement idea is to add a hash table to the
> > > binaryheap so that it can track the positions for each entry so that
> > > we can remove the arbitrary entry in O(log n) and also update the
> > > arbitrary entry's key in O(log n). This is known as the indexed
> > > priority queue. I've attached the patch for that (0001 and 0002).
> > >
> > > That way, in terms of reorder

Re: Improve eviction algorithm in ReorderBuffer

2024-02-01 Thread Masahiko Sawada

On Wed, Jan 31, 2024 at 2:18 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> I have started to read your patches. Here are my initial comments.
> At least, all subscription tests were passed on my env.

Thank you for the review comments!

>
> A comment for 0001:
>
> 01.
> ```
> +static void
> +bh_enlarge_node_array(binaryheap *heap)
> +{
> +if (heap->bh_size < heap->bh_space)
> +return;
> +
> +heap->bh_space *= 2;
> +heap->bh_nodes = repalloc(heap->bh_nodes,
> +  sizeof(bh_node_type) * heap->bh_space);
> +}
> ```
>
> I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
> structure public one and arbitrary codes and extensions can directly refer
> bh_nodes. But if the array is repalloc()'d, the pointer would be updated so 
> that
> their referring would be a dangling pointer.

Hmm I'm not sure this is the case that we need to really worry about,
and cannot come up with a good use case where extensions refer to
bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
Nodes already have pointers and are exposed.

> I think the internal of the structure should be a private one in this case.
>
> Comments for 0002:
>
> 02.
> ```
> +#include "utils/palloc.h"
> ```
>
> Is it really needed? I'm not sure who referrs it.

Seems not, will remove.

>
> 03.
> ```
> typedef struct bh_nodeidx_entry
> {
> bh_node_typekey;
> charstatus;
> int idx;
> } bh_nodeidx_entry;
> ```
>
> Sorry if it is a stupid question. Can you tell me how "status" is used?
> None of binaryheap and reorderbuffer components refer it.

It's required by simplehash.h

>
> 04.
> ```
>  extern binaryheap *binaryheap_allocate(int capacity,
> binaryheap_comparator compare,
> -   void *arg);
> +   bool indexed, void *arg);
> ```
>
> I felt pre-existing API should not be changed. How about adding
> binaryheap_allocate_extended() or something which can specify the `bool 
> indexed`?
> binaryheap_allocate() sets heap->bh_indexed to false.

I'm really not sure it's worth inventing a
binaryheap_allocate_extended() function just for preserving API
compatibility. I think it's generally a good idea to have
xxx_extended() function to increase readability and usability, for
example, for the case where the same (kind of default) arguments are
passed in most cases and the function is called from many places.
However, we have a handful binaryheap_allocate() callers, and I
believe that it would not hurt the existing callers.

>
> 05.
> ```
> +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
> +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
> ```
>
> IIUC, callers must consider whether the node should be shift up/down and use
> appropriate function, right? I felt it may not be user-friendly.

Right, I couldn't come up with a better interface.

Another idea I've considered was that the caller provides a callback
function where it can compare the old and new keys. For example, in
reorderbuffer case, we call like:

binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
ReorderBufferTXNUpdateCompare, (void *) &old_size);

Then in ReorderBufferTXNUpdateCompare(),
ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
(compare txn->size to "b" ...)

However it seems complicated...

>
> Comments for 0003:
>
> 06.
> ```
> This commit changes the eviction algorithm in ReorderBuffer to use
> max-heap with transaction size,a nd use two strategies depending on
> the number of transactions being decoded.
> ```
>
> s/a nd/ and/
>
> 07.
> ```
> It could be too expensive to pudate max-heap while preserving the heap
> property each time the transaction's memory counter is updated, as it
> could happen very frquently. So when the number of transactions being
> decoded is small, we add the transactions to max-heap but don't
> preserve the heap property, which is O(1). We heapify the max-heap
> just before picking the largest transaction, which is O(n). This
> strategy minimizes the overheads of updating the transaction's memory
> counter.
> ```
>
> s/pudate/update/

Will fix them.

>
> 08.
> IIUC, if more than 1024 transactions are running but they have small amount of
> changes, the performance may be degraded, right? Do you have a result in sucha
> a case?

I've run a benchmark test that I shared before[1]. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.g

Re: Improve eviction algorithm in ReorderBuffer

2024-02-01 Thread Shubham Khanna

On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada  wrote:
>
> On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > The individual transactions shouldn't cross
> > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal 
> > > > > > to
> > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% 
> > > > > > >=
> > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > impression
> > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > which doesn't seems to be the case.
> > > > >
> > > > > I wanted to mean that there are three lists in total: the first one
> > > > > maintain the transactions consuming more than 10% of
> > > > > logical_decoding_work_mem,
> > > > >
> > > >
> > > > How can we have multiple transactions in the list consuming more than
> > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > before any xact reaches logical_decoding_work_mem?
> > >
> > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > consuming more than 6.4MB are added to the list. So for example, it's
> > > possible that the list has three transactions each of which are
> > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > still 30MB (less than logical_decoding_work_mem).
> > >
> >
> > Thanks for the clarification. I misunderstood the list to have
> > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > thing to note is that maintaining these lists by default can also have
> > some overhead unless the list of open transactions crosses a certain
> > threshold.
> >
>
> On further analysis, I realized that the approach discussed here might
> not be the way to go. The idea of dividing transactions into several
> subgroups is to divide a large number of entries into multiple
> sub-groups so we can reduce the complexity to search for the
> particular entry. Since we assume that there are no big differences in
> entries' sizes within a sub-group, we can pick the entry to evict in
> O(1). However, what we really need to avoid here is that we end up
> increasing the number of times to evict entries because serializing an
> entry to the disk is more costly than searching an entry on memory in
> general.
>
> I think that it's no problem in a large-entries subgroup but when it
> comes to the smallest-entries subgroup, like for entries consuming
> less than 5% of the limit, it could end up evicting many entries. For
> example, there would be a huge difference between serializing 1 entry
> consuming 5% of the memory limit and serializing 5000 entries
> consuming 0.001% of the memory limit. Even if we can select 5000
> entries quickly, I think the latter would be slower in total. The more
> subgroups we create, the more the algorithm gets complex and the
> overheads could cause. So I think we need to search for the largest
> entry in order to minimize the number of evictions anyway.
>
> Looking for data structures and algorithms, I think binaryheap with
> some improvements could be promising. I mentioned before why we cannot
> use the current binaryheap[1]. The missing pieces are efficient ways
> to remove the arbitrary entry and to update the arbitrary entry's key.
> The current binaryheap provides binaryheap_remove_node(), which is
> O(log n), but it requires the entry's position in the binaryheap. We
> can know the entry's position just after binaryheap_add_unordered()
> but it might be changed after heapify. Searching the node's position
> is O(n). So the improvement idea is to add a hash table to the
> binaryheap so that it can track the positions for each entry so that
> we can remove the arbitrary entry in O(log n) and also update the
> arbitrary entry's key in O(log n). This is known as the indexed
> priority queue. I've attached the patch for that (0001 and 0002).
>
> That way, in terms of reorderbuffer, we can update and remove the
> transaction's memory usage in O(log n) (in worst case and O(1) in
> average) and then pick the largest transaction in O(1). Since we might
> need to call ReorderBufferSerializeTXN() even in non-streaming case,
> we need to maintain the binaryheap anyway. I've attached the patch for
> that (0003).
>
> Here are test script for many sub-transactions case:
>
> create table test (c int);
> create or replace function testfn (cnt int) returns void as $$
> begin
>   for i in 1..cnt loop
> begin
>   insert into test valu

Re: Improve eviction algorithm in ReorderBuffer

2024-01-31 Thread vignesh C

On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada  wrote:
>
> On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > > The individual transactions shouldn't cross
> > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your 
> > > > > > > proposal to
> > > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 
> > > > > > > 5% >=
> > > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > > impression
> > > > > > > that you are saying to maintain them in a separate transaction 
> > > > > > > list
> > > > > > > which doesn't seems to be the case.
> > > > > >
> > > > > > I wanted to mean that there are three lists in total: the first one
> > > > > > maintain the transactions consuming more than 10% of
> > > > > > logical_decoding_work_mem,
> > > > > >
> > > > >
> > > > > How can we have multiple transactions in the list consuming more than
> > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > > before any xact reaches logical_decoding_work_mem?
> > > >
> > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > > consuming more than 6.4MB are added to the list. So for example, it's
> > > > possible that the list has three transactions each of which are
> > > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > > still 30MB (less than logical_decoding_work_mem).
> > > >
> > >
> > > Thanks for the clarification. I misunderstood the list to have
> > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > > thing to note is that maintaining these lists by default can also have
> > > some overhead unless the list of open transactions crosses a certain
> > > threshold.
> > >
> >
> > On further analysis, I realized that the approach discussed here might
> > not be the way to go. The idea of dividing transactions into several
> > subgroups is to divide a large number of entries into multiple
> > sub-groups so we can reduce the complexity to search for the
> > particular entry. Since we assume that there are no big differences in
> > entries' sizes within a sub-group, we can pick the entry to evict in
> > O(1). However, what we really need to avoid here is that we end up
> > increasing the number of times to evict entries because serializing an
> > entry to the disk is more costly than searching an entry on memory in
> > general.
> >
> > I think that it's no problem in a large-entries subgroup but when it
> > comes to the smallest-entries subgroup, like for entries consuming
> > less than 5% of the limit, it could end up evicting many entries. For
> > example, there would be a huge difference between serializing 1 entry
> > consuming 5% of the memory limit and serializing 5000 entries
> > consuming 0.001% of the memory limit. Even if we can select 5000
> > entries quickly, I think the latter would be slower in total. The more
> > subgroups we create, the more the algorithm gets complex and the
> > overheads could cause. So I think we need to search for the largest
> > entry in order to minimize the number of evictions anyway.
> >
> > Looking for data structures and algorithms, I think binaryheap with
> > some improvements could be promising. I mentioned before why we cannot
> > use the current binaryheap[1]. The missing pieces are efficient ways
> > to remove the arbitrary entry and to update the arbitrary entry's key.
> > The current binaryheap provides binaryheap_remove_node(), which is
> > O(log n), but it requires the entry's position in the binaryheap. We
> > can know the entry's position just after binaryheap_add_unordered()
> > but it might be changed after heapify. Searching the node's position
> > is O(n). So the improvement idea is to add a hash table to the
> > binaryheap so that it can track the positions for each entry so that
> > we can remove the arbitrary entry in O(log n) and also update the
> > arbitrary entry's key in O(log n). This is known as the indexed
> > priority queue. I've attached the patch for that (0001 and 0002).
> >
> > That way, in terms of reorderbuffer, we can update and remove the
> > transaction's memory usage in O(log n) (in worst case and O(1) in
> > average) and then pick the largest transaction in O(1). Since we might
> > need to call ReorderBufferSerializeTXN() even in non-streaming case,
> > we need to maintain the binaryheap anyway.
>
> Sinc

RE: Improve eviction algorithm in ReorderBuffer

2024-01-30 Thread Hayato Kuroda (Fujitsu)

Dear Sawada-san,

I have started to read your patches. Here are my initial comments.
At least, all subscription tests were passed on my env.

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+if (heap->bh_size < heap->bh_space)
+return;
+
+heap->bh_space *= 2;
+heap->bh_nodes = repalloc(heap->bh_nodes,
+  sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.
I think the internal of the structure should be a private one in this case.

Comments for 0002:

02.
```
+#include "utils/palloc.h"
```

Is it really needed? I'm not sure who referrs it.

03.
```
typedef struct bh_nodeidx_entry
{
bh_node_typekey;
charstatus;
int idx;
} bh_nodeidx_entry;
```

Sorry if it is a stupid question. Can you tell me how "status" is used?
None of binaryheap and reorderbuffer components refer it. 

04.
```
 extern binaryheap *binaryheap_allocate(int capacity,
binaryheap_comparator compare,
-   void *arg);
+   bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool 
indexed`?
binaryheap_allocate() sets heap->bh_indexed to false.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Comments for 0003:

06.
```
This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.
```

s/a nd/ and/

07.
```
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.
```

s/pudate/update/

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in sucha
a case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: Improve eviction algorithm in ReorderBuffer

2024-01-30 Thread Masahiko Sawada

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada  wrote:
>
> On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > The individual transactions shouldn't cross
> > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal 
> > > > > > to
> > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% 
> > > > > > >=
> > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > impression
> > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > which doesn't seems to be the case.
> > > > >
> > > > > I wanted to mean that there are three lists in total: the first one
> > > > > maintain the transactions consuming more than 10% of
> > > > > logical_decoding_work_mem,
> > > > >
> > > >
> > > > How can we have multiple transactions in the list consuming more than
> > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > before any xact reaches logical_decoding_work_mem?
> > >
> > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > consuming more than 6.4MB are added to the list. So for example, it's
> > > possible that the list has three transactions each of which are
> > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > still 30MB (less than logical_decoding_work_mem).
> > >
> >
> > Thanks for the clarification. I misunderstood the list to have
> > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > thing to note is that maintaining these lists by default can also have
> > some overhead unless the list of open transactions crosses a certain
> > threshold.
> >
>
> On further analysis, I realized that the approach discussed here might
> not be the way to go. The idea of dividing transactions into several
> subgroups is to divide a large number of entries into multiple
> sub-groups so we can reduce the complexity to search for the
> particular entry. Since we assume that there are no big differences in
> entries' sizes within a sub-group, we can pick the entry to evict in
> O(1). However, what we really need to avoid here is that we end up
> increasing the number of times to evict entries because serializing an
> entry to the disk is more costly than searching an entry on memory in
> general.
>
> I think that it's no problem in a large-entries subgroup but when it
> comes to the smallest-entries subgroup, like for entries consuming
> less than 5% of the limit, it could end up evicting many entries. For
> example, there would be a huge difference between serializing 1 entry
> consuming 5% of the memory limit and serializing 5000 entries
> consuming 0.001% of the memory limit. Even if we can select 5000
> entries quickly, I think the latter would be slower in total. The more
> subgroups we create, the more the algorithm gets complex and the
> overheads could cause. So I think we need to search for the largest
> entry in order to minimize the number of evictions anyway.
>
> Looking for data structures and algorithms, I think binaryheap with
> some improvements could be promising. I mentioned before why we cannot
> use the current binaryheap[1]. The missing pieces are efficient ways
> to remove the arbitrary entry and to update the arbitrary entry's key.
> The current binaryheap provides binaryheap_remove_node(), which is
> O(log n), but it requires the entry's position in the binaryheap. We
> can know the entry's position just after binaryheap_add_unordered()
> but it might be changed after heapify. Searching the node's position
> is O(n). So the improvement idea is to add a hash table to the
> binaryheap so that it can track the positions for each entry so that
> we can remove the arbitrary entry in O(log n) and also update the
> arbitrary entry's key in O(log n). This is known as the indexed
> priority queue. I've attached the patch for that (0001 and 0002).
>
> That way, in terms of reorderbuffer, we can update and remove the
> transaction's memory usage in O(log n) (in worst case and O(1) in
> average) and then pick the largest transaction in O(1). Since we might
> need to call ReorderBufferSerializeTXN() even in non-streaming case,
> we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
selec

Re: Improve eviction algorithm in ReorderBuffer

2024-01-26 Thread Masahiko Sawada

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila  wrote:
>
> On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > The individual transactions shouldn't cross
> > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > > that you are saying to maintain them in a separate transaction list
> > > > > which doesn't seems to be the case.
> > > >
> > > > I wanted to mean that there are three lists in total: the first one
> > > > maintain the transactions consuming more than 10% of
> > > > logical_decoding_work_mem,
> > > >
> > >
> > > How can we have multiple transactions in the list consuming more than
> > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > before any xact reaches logical_decoding_work_mem?
> >
> > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > consuming more than 6.4MB are added to the list. So for example, it's
> > possible that the list has three transactions each of which are
> > consuming 10MB while the total memory usage in the reorderbuffer is
> > still 30MB (less than logical_decoding_work_mem).
> >
>
> Thanks for the clarification. I misunderstood the list to have
> transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> thing to note is that maintaining these lists by default can also have
> some overhead unless the list of open transactions crosses a certain
> threshold.
>

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
begin
  insert into test values (i);
exception when division_by_zero then
  raise notice 'caught error';
  return;
end;
  end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(5);
set logical_decoding_work_mem to '4MB';
select count(*) from pg

Re: Improve eviction algorithm in ReorderBuffer

2024-01-21 Thread Peter Smith

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there was some CFbot test failures last time it was run [2].
Please have a look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4699/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699

Kind Regards,
Peter Smith.

Re: Improve eviction algorithm in ReorderBuffer

2023-12-19 Thread Amit Kapila

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > The individual transactions shouldn't cross
> > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > > maintain the lists: "...splitting it into two lists: transactions
> > > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> > > > list preferably.". In the previous sentence, what did you mean by
> > > > transactions consuming 5% >= of the memory limit? I got the impression
> > > > that you are saying to maintain them in a separate transaction list
> > > > which doesn't seems to be the case.
> > >
> > > I wanted to mean that there are three lists in total: the first one
> > > maintain the transactions consuming more than 10% of
> > > logical_decoding_work_mem,
> > >
> >
> > How can we have multiple transactions in the list consuming more than
> > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > before any xact reaches logical_decoding_work_mem?
>
> Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> consuming more than 6.4MB are added to the list. So for example, it's
> possible that the list has three transactions each of which are
> consuming 10MB while the total memory usage in the reorderbuffer is
> still 30MB (less than logical_decoding_work_mem).
>

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2023-12-19 Thread Masahiko Sawada

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila  wrote:
>
> On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  wrote:
> >
> > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  
> > wrote:
> > >
> > >
> > > The individual transactions shouldn't cross
> > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > > maintain the lists: "...splitting it into two lists: transactions
> > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> > > list preferably.". In the previous sentence, what did you mean by
> > > transactions consuming 5% >= of the memory limit? I got the impression
> > > that you are saying to maintain them in a separate transaction list
> > > which doesn't seems to be the case.
> >
> > I wanted to mean that there are three lists in total: the first one
> > maintain the transactions consuming more than 10% of
> > logical_decoding_work_mem,
> >
>
> How can we have multiple transactions in the list consuming more than
> 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-19 Thread Amit Kapila

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada  wrote:
>
> On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  wrote:
> >
> >
> > The individual transactions shouldn't cross
> > 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> > maintain the lists: "...splitting it into two lists: transactions
> > consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> > list preferably.". In the previous sentence, what did you mean by
> > transactions consuming 5% >= of the memory limit? I got the impression
> > that you are saying to maintain them in a separate transaction list
> > which doesn't seems to be the case.
>
> I wanted to mean that there are three lists in total: the first one
> maintain the transactions consuming more than 10% of
> logical_decoding_work_mem,
>

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2023-12-18 Thread Masahiko Sawada

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila  wrote:
>
> On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada  
> wrote:
> >
> > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  
> > > wrote:
> > > >
> > >
> > > IIUC, you are giving preference to multiple list ideas as compared to
> > > (a) because you don't need to adjust the list each time the
> > > transaction size changes, is that right?
> >
> > Right.
> >
> > >  If so, I think there is a
> > > cost to keep that data structure up-to-date but it can help in
> > > reducing the number of times we need to serialize.
> >
> > Yes, there is a trade-off.
> >
> > What I don't want to do is to keep all transactions ordered since it's
> > too costly. The proposed idea uses multiple lists to keep all
> > transactions roughly ordered. The maintenance cost would be cheap
> > since each list is unordered.
> >
> > It might be a good idea to have a threshold to switch how to pick the
> > largest transaction based on the number of transactions in the
> > reorderbuffer. If there are many transactions, we can use the proposed
> > algorithm to find a possibly-largest transaction, otherwise use the
> > current way.
> >
>
> Yeah, that makes sense.
>
> > >
> > > >
> > > > > 1) A scenario where suppose you have one very large transaction that
> > > > > is consuming ~40% of the memory and 5-6 comparatively smaller
> > > > > transactions that are just above 10% of the memory limit.  And now for
> > > > > coming under the memory limit instead of getting 1 large transaction
> > > > > evicted out, we are evicting out multiple times.
> > > >
> > > > Given the large transaction list will have up to 10 transactions, I
> > > > think it's cheap to pick the largest transaction among them. It's O(N)
> > > > but N won't be large.
> > > >
> > > > > 2) Another scenario where all the transactions are under 10% of the
> > > > > memory limit but let's say there are some transactions are consuming
> > > > > around 8-9% of the memory limit each but those are not very old
> > > > > transactions whereas there are certain old transactions which are
> > > > > fairly small and consuming under 1% of memory limit and there are many
> > > > > such transactions.  So how it would affect if we frequently select
> > > > > many of these transactions to come under memory limit instead of
> > > > > selecting a couple of large transactions which are consuming 8-9%?
> > > >
> > > > Yeah, probably we can do something for small transactions (i.e. small
> > > > and on-memory transactions). One idea is to pick the largest
> > > > transaction among them by iterating over all of them. Given that the
> > > > more transactions are evicted, the less transactions the on-memory
> > > > transaction list has, unlike the current algorithm, we would still
> > > > win. Or we could even split it into several sub-lists in order to
> > > > reduce the number of transactions to check. For example, splitting it
> > > > into two lists: transactions consuming 5% < and 5% >=  of the memory
> > > > limit, and checking the 5% >= list preferably.
> > > >
> > >
> > > Which memory limit are you referring to here? Is it 
> > > logical_decoding_work_mem?
> >
> > logical_decoding_work_mem.
> >
> > >
> > > > The cost for
> > > > maintaining these lists could increase, though.
> > > >
> > >
> > > Yeah, can't we maintain a single list of all xacts that are consuming
> > > equal to or greater than the memory limit? Considering that the memory
> > > limit is logical_decoding_work_mem, then I think just picking one
> > > transaction to serialize would be sufficient.
> >
> > IIUC we serialize a transaction when the sum of all transactions'
> > memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
> > In what cases are multiple transactions consuming equal to or greater
> > than the logical_decoding_work_mem?
> >
>
> The individual transactions shouldn't cross
> 'logical_decoding_work_mem'. I got a bit confused by your proposal to
> maintain the lists: "...splitting it into two lists: transactions
> consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
> list preferably.". In the previous sentence, what did you mean by
> transactions consuming 5% >= of the memory limit? I got the impression
> that you are saying to maintain them in a separate transaction list
> which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem, the second one maintains other transactions
consuming more than or equal to 5% of logical_decoding_work_mem, and
the third one maintains other transactions consuming more than 0 and
less than 5% of logical_decoding_work_mem.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-16 Thread Amit Kapila

On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada  wrote:
>
> On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  
> > wrote:
> > >
> >
> > IIUC, you are giving preference to multiple list ideas as compared to
> > (a) because you don't need to adjust the list each time the
> > transaction size changes, is that right?
>
> Right.
>
> >  If so, I think there is a
> > cost to keep that data structure up-to-date but it can help in
> > reducing the number of times we need to serialize.
>
> Yes, there is a trade-off.
>
> What I don't want to do is to keep all transactions ordered since it's
> too costly. The proposed idea uses multiple lists to keep all
> transactions roughly ordered. The maintenance cost would be cheap
> since each list is unordered.
>
> It might be a good idea to have a threshold to switch how to pick the
> largest transaction based on the number of transactions in the
> reorderbuffer. If there are many transactions, we can use the proposed
> algorithm to find a possibly-largest transaction, otherwise use the
> current way.
>

Yeah, that makes sense.

> >
> > >
> > > > 1) A scenario where suppose you have one very large transaction that
> > > > is consuming ~40% of the memory and 5-6 comparatively smaller
> > > > transactions that are just above 10% of the memory limit.  And now for
> > > > coming under the memory limit instead of getting 1 large transaction
> > > > evicted out, we are evicting out multiple times.
> > >
> > > Given the large transaction list will have up to 10 transactions, I
> > > think it's cheap to pick the largest transaction among them. It's O(N)
> > > but N won't be large.
> > >
> > > > 2) Another scenario where all the transactions are under 10% of the
> > > > memory limit but let's say there are some transactions are consuming
> > > > around 8-9% of the memory limit each but those are not very old
> > > > transactions whereas there are certain old transactions which are
> > > > fairly small and consuming under 1% of memory limit and there are many
> > > > such transactions.  So how it would affect if we frequently select
> > > > many of these transactions to come under memory limit instead of
> > > > selecting a couple of large transactions which are consuming 8-9%?
> > >
> > > Yeah, probably we can do something for small transactions (i.e. small
> > > and on-memory transactions). One idea is to pick the largest
> > > transaction among them by iterating over all of them. Given that the
> > > more transactions are evicted, the less transactions the on-memory
> > > transaction list has, unlike the current algorithm, we would still
> > > win. Or we could even split it into several sub-lists in order to
> > > reduce the number of transactions to check. For example, splitting it
> > > into two lists: transactions consuming 5% < and 5% >=  of the memory
> > > limit, and checking the 5% >= list preferably.
> > >
> >
> > Which memory limit are you referring to here? Is it 
> > logical_decoding_work_mem?
>
> logical_decoding_work_mem.
>
> >
> > > The cost for
> > > maintaining these lists could increase, though.
> > >
> >
> > Yeah, can't we maintain a single list of all xacts that are consuming
> > equal to or greater than the memory limit? Considering that the memory
> > limit is logical_decoding_work_mem, then I think just picking one
> > transaction to serialize would be sufficient.
>
> IIUC we serialize a transaction when the sum of all transactions'
> memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
> In what cases are multiple transactions consuming equal to or greater
> than the logical_decoding_work_mem?
>

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >=  of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2023-12-16 Thread Masahiko Sawada

On Sat, Dec 16, 2023 at 1:36 AM Euler Taveira  wrote:
>
> On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:
>
>
> I assume you mean to add ReorderBufferTXN entries to the binaryheap
> and then build it by comparing their sizes (i.e. txn->size). But
> ReorderBufferTXN entries are removed and deallocated once the
> transaction finished. How can we efficiently remove these entries from
> binaryheap? I guess it would be O(n) to find the entry among the
> unordered entries, where n is the number of transactions in the
> reorderbuffer.
>
>
> O(log n) for both functions: binaryheap_remove_first() and
> binaryheap_remove_node().

Right. The binaryheap_binaryheap_remove_first() removes the topmost
entry in O(log n), but the ReorderBufferTXN being removed is not
necessarily the topmost entry, since we remove the entry when the
transaction completes (committed or aborted). The
binaryheap_remove_node() removes the entry at the given Nth in O(log
n), but I'm not sure how we can know the indexes of each entry. I
think we can remember the index of newly added entry after calling
binaryheap_add_unordered() but once we call binaryheap_build() the
index is out-of-date. So I think that in the worst case we would need
to check all entries in order to remove an arbitrary entry in
binaryheap. It's O(n). I might be missing something though.

> I didn't read your patch but do you really need to
> free entries one by one? If not, binaryheap_free().

The patch doesn't touch on how to free entries. ReorderBufferTXN
entries are freed one by one after each of which completes (committed
or aborted).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-15 Thread Euler Taveira

On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:
> 
> I assume you mean to add ReorderBufferTXN entries to the binaryheap
> and then build it by comparing their sizes (i.e. txn->size). But
> ReorderBufferTXN entries are removed and deallocated once the
> transaction finished. How can we efficiently remove these entries from
> binaryheap? I guess it would be O(n) to find the entry among the
> unordered entries, where n is the number of transactions in the
> reorderbuffer.

O(log n) for both functions: binaryheap_remove_first() and
binaryheap_remove_node(). I didn't read your patch but do you really need to
free entries one by one? If not, binaryheap_free().


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Improve eviction algorithm in ReorderBuffer

2023-12-15 Thread Masahiko Sawada

On Fri, Dec 15, 2023 at 2:59 PM Masahiko Sawada  wrote:
>
> On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
> > > >
> > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > I've heard the report internally that replication lag became huge when
> > > > > decoding transactions each consisting of 500k sub transactions. Note
> > > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> > > > >
> >
> > Can't you suggest them to use streaming mode to avoid this problem or
> > do you see some problem with that?
>
> Yeah, that's one option. But I can suggest
>

Sorry, it was still in the middle of editing.

Yeah, that's one option. But since there is a trade-off I cannot
suggest using streaming mode for every user. Also, the logical
replication client (e.g. third party tool receiving logical change
set) might not support the streaming mode yet.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-15 Thread Masahiko Sawada

On Fri, Dec 15, 2023 at 7:10 PM Alvaro Herrera  wrote:
>
> On 2023-Dec-12, Masahiko Sawada wrote:
>
> > To deal with this problem, I initially thought of the idea (a)
> > mentioned in the comment; use a binary heap to maintain the
> > transactions sorted by the amount of changes or the size. But it seems
> > not a good idea to try maintaining all transactions by  its size since
> > the size of each transaction could be changed frequently.
>
> Hmm, maybe you can just use binaryheap_add_unordered and just let the
> sizes change, and do binaryheap_build() at the point where the eviction
> is needed.

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-15 Thread Alvaro Herrera

On 2023-Dec-12, Masahiko Sawada wrote:

> To deal with this problem, I initially thought of the idea (a)
> mentioned in the comment; use a binary heap to maintain the
> transactions sorted by the amount of changes or the size. But it seems
> not a good idea to try maintaining all transactions by  its size since
> the size of each transaction could be changed frequently.

Hmm, maybe you can just use binaryheap_add_unordered and just let the
sizes change, and do binaryheap_build() at the point where the eviction
is needed.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)

Re: Improve eviction algorithm in ReorderBuffer

2023-12-14 Thread Masahiko Sawada

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila  wrote:
>
> On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I've heard the report internally that replication lag became huge when
> > > > decoding transactions each consisting of 500k sub transactions. Note
> > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> > > >
>
> Can't you suggest them to use streaming mode to avoid this problem or
> do you see some problem with that?

Yeah, that's one option. But I can suggest

>
> > > > Here is a test script for a many subtransactions scenario. In my
> > > > environment, the logical decoding took over 2min to decode one top
> > > > transaction having 100k subtransctions.
> > > >
> > > > -
> > > > create table test (c int);
> > > > create or replace function testfn (cnt int) returns void as $$
> > > > begin
> > > >   for i in 1..cnt loop
> > > > begin
> > > >   insert into test values (i);
> > > > exception when division_by_zero then
> > > >   raise notice 'caught error';
> > > >   return;
> > > > end;
> > > >   end loop;
> > > > end;
> > > > $$
> > > > language plpgsql;
> > > > select testfn(10)
> > > > set logical_decoding_work_mem to '4MB';
> > > > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > > > 
> > > >
> > > > To deal with this problem, I initially thought of the idea (a)
> > > > mentioned in the comment; use a binary heap to maintain the
> > > > transactions sorted by the amount of changes or the size. But it seems
> > > > not a good idea to try maintaining all transactions by  its size since
> > > > the size of each transaction could be changed frequently.
> > > >
> > > > The attached patch uses a different approach that consists of three
> > > > strategies; (1) maintain the list of transactions whose size is larger
> > > > than 10% of logical_decoding_work_mem, and preferentially evict a
> > > > transaction from this list.
> > > >
>
> IIUC, you are giving preference to multiple list ideas as compared to
> (a) because you don't need to adjust the list each time the
> transaction size changes, is that right?

Right.

>  If so, I think there is a
> cost to keep that data structure up-to-date but it can help in
> reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

>
>  If the list is empty, all transactions are
> > > > small enough, (2) so we evict the oldest top-level transaction from
> > > > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > > > freeing memory blocks in GenerationContext. Finally, if this is also
> > > > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > > > note the fact that even if a transaction is evicted the
> > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > > > 0. In the worst case where all (quite a few) transactions are smaller
> > > > than 10% of the memory limit, we might end up checking many
> > > > transactions to find non-zero size transaction entries to evict. So
> > > > the patch adds a new list to maintain all transactions that have at
> > > > least one change in memory.
> > > >
> > > > Summarizing the algorithm I've implemented in the patch,
> > > >
> > > > 1. pick a transaction from the list of large transactions (larger than
> > > > 10% of memory limit).
> > > > 2. pick a transaction from the top-level transaction list in LSN order.
> > > > 3. pick a transaction from the list of transactions that have at least
> > > > one change in memory.
> > > >
> > > > With the patch, the above test case completed within 3 seconds in my
> > > > environment.
> > >
> > > Thanks for working on this, I think it would be good to test other
> > > scenarios as well where this might have some negative impact and see
> > > where we stand.
> >
> > Agreed.
> >
> > > 1) A scenario where suppose you have one very large transaction that
> > > is consuming ~40% of the memory and 5-6 comparatively smaller
> > > transactions that are just above 10% of the memory limit.  And now for
> > > coming under the memory limit instead of getting 1 large transaction
> > > evicted out, we are evicting out multiple times.
> >
> > Given the large transaction list will have up to 10 transactions, I
> > think it's cheap to pick the largest transaction among them. It's O(N)
> >

Re: Improve eviction algorithm in ReorderBuffer

2023-12-14 Thread Amit Kapila

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
> >
> > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've heard the report internally that replication lag became huge when
> > > decoding transactions each consisting of 500k sub transactions. Note
> > > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> > >

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

> > > Here is a test script for a many subtransactions scenario. In my
> > > environment, the logical decoding took over 2min to decode one top
> > > transaction having 100k subtransctions.
> > >
> > > -
> > > create table test (c int);
> > > create or replace function testfn (cnt int) returns void as $$
> > > begin
> > >   for i in 1..cnt loop
> > > begin
> > >   insert into test values (i);
> > > exception when division_by_zero then
> > >   raise notice 'caught error';
> > >   return;
> > > end;
> > >   end loop;
> > > end;
> > > $$
> > > language plpgsql;
> > > select testfn(10)
> > > set logical_decoding_work_mem to '4MB';
> > > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > > 
> > >
> > > To deal with this problem, I initially thought of the idea (a)
> > > mentioned in the comment; use a binary heap to maintain the
> > > transactions sorted by the amount of changes or the size. But it seems
> > > not a good idea to try maintaining all transactions by  its size since
> > > the size of each transaction could be changed frequently.
> > >
> > > The attached patch uses a different approach that consists of three
> > > strategies; (1) maintain the list of transactions whose size is larger
> > > than 10% of logical_decoding_work_mem, and preferentially evict a
> > > transaction from this list.
> > >

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right? If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

 If the list is empty, all transactions are
> > > small enough, (2) so we evict the oldest top-level transaction from
> > > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > > freeing memory blocks in GenerationContext. Finally, if this is also
> > > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > > note the fact that even if a transaction is evicted the
> > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > > 0. In the worst case where all (quite a few) transactions are smaller
> > > than 10% of the memory limit, we might end up checking many
> > > transactions to find non-zero size transaction entries to evict. So
> > > the patch adds a new list to maintain all transactions that have at
> > > least one change in memory.
> > >
> > > Summarizing the algorithm I've implemented in the patch,
> > >
> > > 1. pick a transaction from the list of large transactions (larger than
> > > 10% of memory limit).
> > > 2. pick a transaction from the top-level transaction list in LSN order.
> > > 3. pick a transaction from the list of transactions that have at least
> > > one change in memory.
> > >
> > > With the patch, the above test case completed within 3 seconds in my
> > > environment.
> >
> > Thanks for working on this, I think it would be good to test other
> > scenarios as well where this might have some negative impact and see
> > where we stand.
>
> Agreed.
>
> > 1) A scenario where suppose you have one very large transaction that
> > is consuming ~40% of the memory and 5-6 comparatively smaller
> > transactions that are just above 10% of the memory limit.  And now for
> > coming under the memory limit instead of getting 1 large transaction
> > evicted out, we are evicting out multiple times.
>
> Given the large transaction list will have up to 10 transactions, I
> think it's cheap to pick the largest transaction among them. It's O(N)
> but N won't be large.
>
> > 2) Another scenario where all the transactions are under 10% of the
> > memory limit but let's say there are some transactions are consuming
> > around 8-9% of the memory limit each but those are not very old
> > transactions whereas there are certain old transactions which are
> > fairly small and consuming under 1% of memory limit and there are many
> > such transactions.  So how it would affect if we frequently select
> > many of these transactions to come under memory limit instead of
> > selecting a couple of large transactions which are consuming 8-9%?
>
> Yeah, probably we can do something for small transactions (i.e. small
> and on-memory transactions). One idea is to pick the largest
> transaction among them by iterating over all of them. Given that the
> more transactions are evicted, the less tr

Re: Improve eviction algorithm in ReorderBuffer

2023-12-12 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
>
> > Thanks for working on this, I think it would be good to test other
> > scenarios as well where this might have some negative impact and see
> > where we stand.
>
> Agreed.
>
> > 1) A scenario where suppose you have one very large transaction that
> > is consuming ~40% of the memory and 5-6 comparatively smaller
> > transactions that are just above 10% of the memory limit.  And now for
> > coming under the memory limit instead of getting 1 large transaction
> > evicted out, we are evicting out multiple times.
>
> Given the large transaction list will have up to 10 transactions, I
> think it's cheap to pick the largest transaction among them. It's O(N)
> but N won't be large.

Yeah, that makes sense.

> > 2) Another scenario where all the transactions are under 10% of the
> > memory limit but let's say there are some transactions are consuming
> > around 8-9% of the memory limit each but those are not very old
> > transactions whereas there are certain old transactions which are
> > fairly small and consuming under 1% of memory limit and there are many
> > such transactions.  So how it would affect if we frequently select
> > many of these transactions to come under memory limit instead of
> > selecting a couple of large transactions which are consuming 8-9%?
>
> Yeah, probably we can do something for small transactions (i.e. small
> and on-memory transactions). One idea is to pick the largest
> transaction among them by iterating over all of them. Given that the
> more transactions are evicted, the less transactions the on-memory
> transaction list has, unlike the current algorithm, we would still
> win. Or we could even split it into several sub-lists in order to
> reduce the number of transactions to check. For example, splitting it
> into two lists: transactions consuming 5% < and 5% >=  of the memory
> limit, and checking the 5% >= list preferably. The cost for
> maintaining these lists could increase, though.
>
> Do you have any ideas?

Yeah something like what you mention might be good, we maintain 3 list
that says large, medium, and small transactions.  In a large
transaction, list suppose we allow transactions that consume more than
10% so there could be at max 10 transactions so we can do a sequence
search and spill the largest of all.  Whereas in the medium list
suppose we keep transactions ranging from e.g. 3-10% then it's just
fine to pick from the head because the size differences between the
largest and smallest transaction in this list are not very
significant.  And remaining in the small transaction list and from the
small transaction list we can choose to spill multiple transactions at
a time.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Improve eviction algorithm in ReorderBuffer

2023-12-12 Thread Masahiko Sawada

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
>
> On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > As the comment of ReorderBufferLargestTXN() says, it's very slow with
> > many subtransactions:
> >
> > /*
> >  * Find the largest transaction (toplevel or subxact) to evict (spill to 
> > disk).
> >  *
> >  * XXX With many subtransactions this might be quite slow, because we'll 
> > have
> >  * to walk through all of them. There are some options how we could improve
> >  * that: (a) maintain some secondary structure with transactions sorted by
> >  * amount of changes, (b) not looking for the entirely largest transaction,
> >  * but e.g. for transaction using at least some fraction of the memory 
> > limit,
> >  * and (c) evicting multiple transactions at once, e.g. to free a given 
> > portion
> >  * of the memory limit (e.g. 50%).
> >  */
> >
> > This is because the reorderbuffer has transaction entries for each
> > top-level and sub transaction, and ReorderBufferLargestTXN() walks
> > through all transaction entries to pick the transaction to evict.
> > I've heard the report internally that replication lag became huge when
> > decoding transactions each consisting of 500k sub transactions. Note
> > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> >
> > Here is a test script for a many subtransactions scenario. In my
> > environment, the logical decoding took over 2min to decode one top
> > transaction having 100k subtransctions.
> >
> > -
> > create table test (c int);
> > create or replace function testfn (cnt int) returns void as $$
> > begin
> >   for i in 1..cnt loop
> > begin
> >   insert into test values (i);
> > exception when division_by_zero then
> >   raise notice 'caught error';
> >   return;
> > end;
> >   end loop;
> > end;
> > $$
> > language plpgsql;
> > select testfn(10)
> > set logical_decoding_work_mem to '4MB';
> > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > 
> >
> > To deal with this problem, I initially thought of the idea (a)
> > mentioned in the comment; use a binary heap to maintain the
> > transactions sorted by the amount of changes or the size. But it seems
> > not a good idea to try maintaining all transactions by  its size since
> > the size of each transaction could be changed frequently.
> >
> > The attached patch uses a different approach that consists of three
> > strategies; (1) maintain the list of transactions whose size is larger
> > than 10% of logical_decoding_work_mem, and preferentially evict a
> > transaction from this list. If the list is empty, all transactions are
> > small enough, (2) so we evict the oldest top-level transaction from
> > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > freeing memory blocks in GenerationContext. Finally, if this is also
> > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > note the fact that even if a transaction is evicted the
> > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > 0. In the worst case where all (quite a few) transactions are smaller
> > than 10% of the memory limit, we might end up checking many
> > transactions to find non-zero size transaction entries to evict. So
> > the patch adds a new list to maintain all transactions that have at
> > least one change in memory.
> >
> > Summarizing the algorithm I've implemented in the patch,
> >
> > 1. pick a transaction from the list of large transactions (larger than
> > 10% of memory limit).
> > 2. pick a transaction from the top-level transaction list in LSN order.
> > 3. pick a transaction from the list of transactions that have at least
> > one change in memory.
> >
> > With the patch, the above test case completed within 3 seconds in my
> > environment.
>
> Thanks for working on this, I think it would be good to test other
> scenarios as well where this might have some negative impact and see
> where we stand.

Agreed.

> 1) A scenario where suppose you have one very large transaction that
> is consuming ~40% of the memory and 5-6 comparatively smaller
> transactions that are just above 10% of the memory limit.  And now for
> coming under the memory limit instead of getting 1 large transaction
> evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

> 2) Another scenario where all the transactions are under 10% of the
> memory limit but let's say there are some transactions are consuming
> around 8-9% of the memory limit each but those are not very old
> transactions whereas there are certain old transactions which are
> fairly small and consuming under 1% of memory limit and there are many
> such transactions.  So how it would affect if we frequently select
> many of these transactions to come under memory limit instead

Re: Improve eviction algorithm in ReorderBuffer

2023-12-11 Thread Dilip Kumar

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  wrote:
>
> Hi all,
>
> As the comment of ReorderBufferLargestTXN() says, it's very slow with
> many subtransactions:
>
> /*
>  * Find the largest transaction (toplevel or subxact) to evict (spill to 
> disk).
>  *
>  * XXX With many subtransactions this might be quite slow, because we'll have
>  * to walk through all of them. There are some options how we could improve
>  * that: (a) maintain some secondary structure with transactions sorted by
>  * amount of changes, (b) not looking for the entirely largest transaction,
>  * but e.g. for transaction using at least some fraction of the memory limit,
>  * and (c) evicting multiple transactions at once, e.g. to free a given 
> portion
>  * of the memory limit (e.g. 50%).
>  */
>
> This is because the reorderbuffer has transaction entries for each
> top-level and sub transaction, and ReorderBufferLargestTXN() walks
> through all transaction entries to pick the transaction to evict.
> I've heard the report internally that replication lag became huge when
> decoding transactions each consisting of 500k sub transactions. Note
> that ReorderBufferLargetstTXN() is used only in non-streaming mode.
>
> Here is a test script for a many subtransactions scenario. In my
> environment, the logical decoding took over 2min to decode one top
> transaction having 100k subtransctions.
>
> -
> create table test (c int);
> create or replace function testfn (cnt int) returns void as $$
> begin
>   for i in 1..cnt loop
> begin
>   insert into test values (i);
> exception when division_by_zero then
>   raise notice 'caught error';
>   return;
> end;
>   end loop;
> end;
> $$
> language plpgsql;
> select testfn(10)
> set logical_decoding_work_mem to '4MB';
> select count(*) from pg_logical_slot_peek_changes('s', null, null)
> 
>
> To deal with this problem, I initially thought of the idea (a)
> mentioned in the comment; use a binary heap to maintain the
> transactions sorted by the amount of changes or the size. But it seems
> not a good idea to try maintaining all transactions by  its size since
> the size of each transaction could be changed frequently.
>
> The attached patch uses a different approach that consists of three
> strategies; (1) maintain the list of transactions whose size is larger
> than 10% of logical_decoding_work_mem, and preferentially evict a
> transaction from this list. If the list is empty, all transactions are
> small enough, (2) so we evict the oldest top-level transaction from
> rb->toplevel_by_lsn list. Evicting older transactions would help in
> freeing memory blocks in GenerationContext. Finally, if this is also
> empty, (3) we evict a transaction that size is > 0. Here, we need to
> note the fact that even if a transaction is evicted the
> ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> 0. In the worst case where all (quite a few) transactions are smaller
> than 10% of the memory limit, we might end up checking many
> transactions to find non-zero size transaction entries to evict. So
> the patch adds a new list to maintain all transactions that have at
> least one change in memory.
>
> Summarizing the algorithm I've implemented in the patch,
>
> 1. pick a transaction from the list of large transactions (larger than
> 10% of memory limit).
> 2. pick a transaction from the top-level transaction list in LSN order.
> 3. pick a transaction from the list of transactions that have at least
> one change in memory.
>
> With the patch, the above test case completed within 3 seconds in my
> environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.  I mean
1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit.  And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.
2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions.  So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

>
> As a side note, the idea (c) mentioned in the comment, evicting
> multiple transactions at once to free a given portion of the memory,
> would also help in avoiding back and forth the memory threshold. It's
> also worth considering.

Yes, I think it is worth considering.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Improve eviction algorithm in ReorderBuffer

2023-12-11 Thread Masahiko Sawada

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
 * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
 *
 * XXX With many subtransactions this might be quite slow, because we'll have
 * to walk through all of them. There are some options how we could improve
 * that: (a) maintain some secondary structure with transactions sorted by
 * amount of changes, (b) not looking for the entirely largest transaction,
 * but e.g. for transaction using at least some fraction of the memory limit,
 * and (c) evicting multiple transactions at once, e.g. to free a given portion
 * of the memory limit (e.g. 50%).
 */

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
begin
  insert into test values (i);
exception when division_by_zero then
  raise notice 'caught error';
  return;
end;
  end loop;
end;
$$
language plpgsql;
select testfn(10)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)


To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by  its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


improve_eviction_rb_poc.patch
Description: Binary data

93 matches

Mail list logo