date:20220519

PG15 beta1 sort performance regression due to Generation context change

2022-05-19 Thread David Rowley

Hackers,

Over the past few days I've been gathering some benchmark results
together to show the sort performance improvements in PG15 [1].

One of the test cases I did was to demonstrate Heikki's change to use
a k-way merge (65014000b).

The test I did to try this out was along the lines of:

set max_parallel_workers_per_gather = 0;
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);

insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB!
vacuum freeze t;

The query I ran was:

select * from t order by a offset 140247142;

I tested various sizes of work_mem starting at 4MB and doubled that
all the way to 16GB. For many of the smaller values of work_mem the
performance is vastly improved by Heikki's change, however for
work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9
seconds and PG15 beta 1 took 29 seconds!

I've been trying to get to the bottom of this today and finally have
discovered this is due to the tuple size allocations in the sort being
exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts
to store tuples in sorts) the tuple for the sort would be stored in an
aset context. After 40af10b57 we'll use a generation context. The
idea with that change is that the generation context does no
power-of-2 round ups for allocations, so we save memory in most cases.
However, due to this particular test having a tuple size of 64-bytes,
there was no power-of-2 wastage with aset.

The problem is that generation chunks have a larger chunk header than
aset do due to having to store the block pointer that the chunk
belongs to so that GenerationFree() can increment the nfree chunks in
the block. aset.c does not require this as freed chunks just go onto a
freelist that's global to the entire context.

Basically, for my test query, the slowdown is because instead of being
able to store 620702 tuples per tape over 226 tapes with an aset
context, we can now only store 576845 tuples per tape resulting in
requiring 244 tapes when using the generation context.

If I had added column "g" to make the tuple size 72 bytes causing
aset's code to round allocations up to 128 bytes and generation.c to
maintain the 72 bytes then the sort would have stored 385805 tuples
over 364 batches for aset and 538761 tuples over 261 batches using the
generation context. That would have been a huge win.

So it basically looks like I discovered a very bad case that causes a
significant slowdown. Yet other cases that are not an exact power of
2 stand to gain significantly from this change.

One thing 40af10b57 does is stops those terrible performance jumps
when the tuple size crosses a power-of-2 boundary. The performance
should be more aligned to the size of the data being sorted now...
Unfortunately, that seems to mean regressions for large sorts with
power-of-2 sized tuples.

64 matches

Mail list logo