Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-13 Thread Andres Freund
Hi,

On 2023-10-13 11:30:35 -0700, Andres Freund wrote:
> On 2023-10-13 10:39:10 -0700, Andres Freund wrote:
> > On 2023-10-12 09:24:19 -0700, Andres Freund wrote:
> > > I kind of had hoped somebody would comment on the approach.  Given that 
> > > nobody
> > > has, I'll push the minimal fix of resetting the state in
> > > ReleaseBulkInsertStatePin(), even though I think architecturally that's 
> > > not
> > > great.
> > 
> > I spent some time working on a test that shows the problem more cheaply than
> > the case upthread. I think it'd be desirable to have a test that's likely to
> > catch an issue like this fairly quickly. We've had other problems in this
> > realm before - there's only a single test that fails if I remove the
> > ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at 
> > all.
> > 
> > I'm a bit on the fence on how large to make the relation. For me the bug
> > triggers when filling both relations up to the 3rd block, but how many rows
> > that takes is somewhat dependent on space utilization on the page and stuff.
> > 
> > Right now the test uses data/desc.data and ends up with 328kB and 312kB in 
> > two
> > partitions. Alternatively I could make the test create a new file to load 
> > with
> > copy that has fewer rows than data/desc.data - I didn't see another data 
> > file
> > that works conveniently and has fewer rows.  The copy is reasonably fast, 
> > even
> > under something as expensive as rr (~60ms). So I'm inclined to just go with
> > that?
> 
> Patch with fix and test attached (0001).

Pushed that.

Greetings,

Andres




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-13 Thread Andres Freund
Hi,

On 2023-10-13 10:39:10 -0700, Andres Freund wrote:
> On 2023-10-12 09:24:19 -0700, Andres Freund wrote:
> > I kind of had hoped somebody would comment on the approach.  Given that 
> > nobody
> > has, I'll push the minimal fix of resetting the state in
> > ReleaseBulkInsertStatePin(), even though I think architecturally that's not
> > great.
> 
> I spent some time working on a test that shows the problem more cheaply than
> the case upthread. I think it'd be desirable to have a test that's likely to
> catch an issue like this fairly quickly. We've had other problems in this
> realm before - there's only a single test that fails if I remove the
> ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all.
> 
> I'm a bit on the fence on how large to make the relation. For me the bug
> triggers when filling both relations up to the 3rd block, but how many rows
> that takes is somewhat dependent on space utilization on the page and stuff.
> 
> Right now the test uses data/desc.data and ends up with 328kB and 312kB in two
> partitions. Alternatively I could make the test create a new file to load with
> copy that has fewer rows than data/desc.data - I didn't see another data file
> that works conveniently and has fewer rows.  The copy is reasonably fast, even
> under something as expensive as rr (~60ms). So I'm inclined to just go with
> that?

Patch with fix and test attached (0001).

Given how easy a missing ReleaseBulkInsertStatePin() can cause corruption (not
due to this bug, but the issue fixed in b1ecb9b3fcf), I think we should
consider adding an assertion along the lines of 0002 to HEAD. Perhaps adding a
new bufmgr.c function to avoid having to get the fields in the buffer tag we
don't care about. Or perhaps we should just promote the check to an elog, we
already call BufferGetBlockNumber(), using BufferGetTag() instead doesn't cost
much more.

Greetings,

Andres Freund
>From e202cf5c3cae0551ae8fabd2629b4f9c6a0734d5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 13 Oct 2023 10:54:16 -0700
Subject: [PATCH v1 1/2] Fix bulk table extension when copying into multiple
 partitions

When COPYing into a partitioned table that does now permit the use of
table_multi_insert(), we could error out with
  ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes

because BulkInsertState->next_free was not reset between partitions. This
problem occured only when not able to use table_multi_insert(), as a dedicated
BulkInsertState for each partition is used in that case.

The bug was introduced in 00d1e02be24, but it was hard to hit at that point as
commonly bulk relation extension is not used when not using
table_multi_insert(). It became more likely after 82a4edabd27, which expanded
the use of bulk extension.

To fix the bug, reset the bulk relation extension state in BulkInsertState in
ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very
similar issue.  Obviously the name is not quite correct, but there might be
external callers, and bulk insert state needs to be reset in precisely in the
situations that ReleaseBulkInsertStatePin() already needed to be called.

Medium term the better fix likely is to disallow reusing BulkInsertState
across relations.

Add a test that, without the fix, reproduces #18130 in most
configurations. The test also catches the problem fixed in b1ecb9b3fcf when
run with small shared_buffers.

Reported-by: Ivan Kolombet 
Analyzed-by: Tom Lane 
Analyzed-by: Andres Freund 
Bug: #18130
Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org
Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us
Backpatch: 16-
---
 src/backend/access/heap/heapam.c   | 11 +
 src/test/regress/expected/copy.out | 37 ++
 src/test/regress/sql/copy.sql  | 37 ++
 3 files changed, 85 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 88a123d38a6..51eceaca6cf 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1792,6 +1792,17 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 	if (bistate->current_buf != InvalidBuffer)
 		ReleaseBuffer(bistate->current_buf);
 	bistate->current_buf = InvalidBuffer;
+
+	/*
+	 * Despite the name, we also release information about bulk
+	 * inserts. Otherwise we can end up erroring out due to looking for free
+	 * space in ->next_free of one partition, even though ->next_free was set
+	 * when extending another partition. It's obviously also could be bad for
+	 * efficiency to look at existing blocks at offsets from another
+	 * partition, even if we don't error out.
+	 */
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 }
 
 
diff --git a/src/test/regress/expected/copy.out b/src/test/regress/expected/copy.out
index a5912c13a8c..b48365ec981 100644
--- a/src/test/regress/expected/copy.out
+++ 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-13 Thread Andres Freund
Hi,

On 2023-10-12 09:24:19 -0700, Andres Freund wrote:
> On 2023-10-12 11:44:09 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
> > >>> I just did a git bisect run to discover when the failure documented
> > >>> in bug #18130 [1] started.  And the answer is commit 82a4edabd.
> > 
> > > Uh, huh.  The problem is that COPY uses a single BulkInsertState for 
> > > multiple
> > > partitions. Which to me seems to run counter to the following comment:
> > >  *The caller can also provide a BulkInsertState object to 
> > > optimize many
> > >  *insertions into the same relation.  This keeps a pin on the 
> > > current
> > >  *insertion target page (to save pin/unpin cycles) and also 
> > > passes a
> > >  *BULKWRITE buffer selection strategy object to the buffer 
> > > manager.
> > >  *Passing NULL for bistate selects the default behavior.
> > 
> > > The reason this doesn't cause straight up corruption due to reusing a pin 
> > > from
> > > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() 
> > > and a
> > > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
> > > insertion state, which is what leads to the errors from the bug report.
> > 
> > > Resetting the relevant BulkInsertState fields fixes the problem. But I'm 
> > > not
> > > sure that's the right fix. ISTM that independent of whether we fix this 
> > > via
> > > ReleaseBulkInsertStatePin() resetting the fields or via not reusing
> > > BulkInsertState, we should add assertions defending against future issues 
> > > like
> > > this (e.g. by adding a relation field to BulkInsertState in cassert 
> > > builds,
> > > and asserting that the relation is the same as in prior calls unless
> > > ReleaseBulkInsertStatePin() has been called).
> > 
> > Ping?  We really ought to have a fix for this committed in time for
> > 16.1.
> 
> I kind of had hoped somebody would comment on the approach.  Given that nobody
> has, I'll push the minimal fix of resetting the state in
> ReleaseBulkInsertStatePin(), even though I think architecturally that's not
> great.

I spent some time working on a test that shows the problem more cheaply than
the case upthread. I think it'd be desirable to have a test that's likely to
catch an issue like this fairly quickly. We've had other problems in this
realm before - there's only a single test that fails if I remove the
ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all.

I'm a bit on the fence on how large to make the relation. For me the bug
triggers when filling both relations up to the 3rd block, but how many rows
that takes is somewhat dependent on space utilization on the page and stuff.

Right now the test uses data/desc.data and ends up with 328kB and 312kB in two
partitions. Alternatively I could make the test create a new file to load with
copy that has fewer rows than data/desc.data - I didn't see another data file
that works conveniently and has fewer rows.  The copy is reasonably fast, even
under something as expensive as rr (~60ms). So I'm inclined to just go with
that?

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 11:44:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
> >>> I just did a git bisect run to discover when the failure documented
> >>> in bug #18130 [1] started.  And the answer is commit 82a4edabd.
> 
> > Uh, huh.  The problem is that COPY uses a single BulkInsertState for 
> > multiple
> > partitions. Which to me seems to run counter to the following comment:
> >  *  The caller can also provide a BulkInsertState object to optimize many
> >  *  insertions into the same relation.  This keeps a pin on the current
> >  *  insertion target page (to save pin/unpin cycles) and also passes a
> >  *  BULKWRITE buffer selection strategy object to the buffer manager.
> >  *  Passing NULL for bistate selects the default behavior.
> 
> > The reason this doesn't cause straight up corruption due to reusing a pin 
> > from
> > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and 
> > a
> > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
> > insertion state, which is what leads to the errors from the bug report.
> 
> > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
> > sure that's the right fix. ISTM that independent of whether we fix this via
> > ReleaseBulkInsertStatePin() resetting the fields or via not reusing
> > BulkInsertState, we should add assertions defending against future issues 
> > like
> > this (e.g. by adding a relation field to BulkInsertState in cassert builds,
> > and asserting that the relation is the same as in prior calls unless
> > ReleaseBulkInsertStatePin() has been called).
> 
> Ping?  We really ought to have a fix for this committed in time for
> 16.1.

I kind of had hoped somebody would comment on the approach.  Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-12 Thread Tom Lane
Andres Freund  writes:
>> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
>>> I just did a git bisect run to discover when the failure documented
>>> in bug #18130 [1] started.  And the answer is commit 82a4edabd.

> Uh, huh.  The problem is that COPY uses a single BulkInsertState for multiple
> partitions. Which to me seems to run counter to the following comment:
>  *The caller can also provide a BulkInsertState object to optimize many
>  *insertions into the same relation.  This keeps a pin on the current
>  *insertion target page (to save pin/unpin cycles) and also passes a
>  *BULKWRITE buffer selection strategy object to the buffer manager.
>  *Passing NULL for bistate selects the default behavior.

> The reason this doesn't cause straight up corruption due to reusing a pin from
> another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
> call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
> insertion state, which is what leads to the errors from the bug report.

> Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
> sure that's the right fix. ISTM that independent of whether we fix this via
> ReleaseBulkInsertStatePin() resetting the fields or via not reusing
> BulkInsertState, we should add assertions defending against future issues like
> this (e.g. by adding a relation field to BulkInsertState in cassert builds,
> and asserting that the relation is the same as in prior calls unless
> ReleaseBulkInsertStatePin() has been called).

Ping?  We really ought to have a fix for this committed in time for
16.1.

regards, tom lane




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-09-25 Thread Andres Freund
Hi,

On 2023-09-25 12:48:30 -0700, Andres Freund wrote:
> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
> > I just did a git bisect run to discover when the failure documented
> > in bug #18130 [1] started.  And the answer is commit 82a4edabd.
> > Now, it's pretty obvious that that commit didn't in itself cause
> > problems like this:
> > 
> > postgres=# \copy test from 'bug18130.csv' csv
> > ERROR:  could not read block 5 in file "base/5/17005": read only 0 of 8192 
> > bytes
> > CONTEXT:  COPY test, line 472: 
> > "0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 
> > 14:16:36.750981+00,14026347..."
> 
> Ugh.
> 
> 
> > IMO there must be some very nasty bug lurking in the new
> > multiple-block extension logic, that happens to be exposed by this
> > test case with 82a4edabd's adjustments to the when-to-extend choices
> > but wasn't before that.
> 
> > To save other people the trouble of extracting the in-line data
> > in the bug submission, I've attached the test files I was using.
> 
> Thanks, looking at this now.

(had to switch locations in between)

Uh, huh.  The problem is that COPY uses a single BulkInsertState for multiple
partitions. Which to me seems to run counter to the following comment:
 *  The caller can also provide a BulkInsertState object to optimize many
 *  insertions into the same relation.  This keeps a pin on the current
 *  insertion target page (to save pin/unpin cycles) and also passes a
 *  BULKWRITE buffer selection strategy object to the buffer manager.
 *  Passing NULL for bistate selects the default behavior.

The reason this doesn't cause straight up corruption due to reusing a pin from
another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
insertion state, which is what leads to the errors from the bug report.

Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
sure that's the right fix. ISTM that independent of whether we fix this via
ReleaseBulkInsertStatePin() resetting the fields or via not reusing
BulkInsertState, we should add assertions defending against future issues like
this (e.g. by adding a relation field to BulkInsertState in cassert builds,
and asserting that the relation is the same as in prior calls unless
ReleaseBulkInsertStatePin() has been called).

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-09-25 Thread Andres Freund
Hi,

On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
> I just did a git bisect run to discover when the failure documented
> in bug #18130 [1] started.  And the answer is commit 82a4edabd.
> Now, it's pretty obvious that that commit didn't in itself cause
> problems like this:
> 
> postgres=# \copy test from 'bug18130.csv' csv
> ERROR:  could not read block 5 in file "base/5/17005": read only 0 of 8192 
> bytes
> CONTEXT:  COPY test, line 472: 
> "0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 
> 14:16:36.750981+00,14026347..."

Ugh.


> IMO there must be some very nasty bug lurking in the new
> multiple-block extension logic, that happens to be exposed by this
> test case with 82a4edabd's adjustments to the when-to-extend choices
> but wasn't before that.

> To save other people the trouble of extracting the in-line data
> in the bug submission, I've attached the test files I was using.

Thanks, looking at this now.


> The DDL is simplified slightly from what was submitted.  I'm not
> entirely sure why a no-op trigger is needed to provoke the bug...

A trigger might prevent using the multi-insert API, which would lead to
different execution paths...


Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-09-25 Thread Tom Lane
Andres Freund  writes:
> On 2023-09-06 18:01:53 -0400, Tom Lane wrote:
>> It turns out that this patch is what's making buildfarm member
>> chipmunk fail in contrib/pg_visibility [1].  That's easily reproduced
>> by running the test with shared_buffers = 10MB.  I didn't dig further
>> than the "git bisect" result:

> At first I was a bit confounded by not being able to reproduce this. My test
> environment had max_connections=110 for some other reason - and the problem
> doesn't reproduce with that setting...

I just did a git bisect run to discover when the failure documented
in bug #18130 [1] started.  And the answer is commit 82a4edabd.
Now, it's pretty obvious that that commit didn't in itself cause
problems like this:

postgres=# \copy test from 'bug18130.csv' csv
ERROR:  could not read block 5 in file "base/5/17005": read only 0 of 8192 bytes
CONTEXT:  COPY test, line 472: 
"0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 
14:16:36.750981+00,14026347..."

IMO there must be some very nasty bug lurking in the new
multiple-block extension logic, that happens to be exposed by this
test case with 82a4edabd's adjustments to the when-to-extend choices
but wasn't before that.

To save other people the trouble of extracting the in-line data
in the bug submission, I've attached the test files I was using.
The DDL is simplified slightly from what was submitted.  I'm not
entirely sure why a no-op trigger is needed to provoke the bug...

regards, tom lane

[1] 
https://www.postgresql.org/message-id/18130-7a86a7356a75209d%40postgresql.org

-- run this, then \copy test from 'bug18130.csv' csv

create table if not exists test (
a smallint,
b bigint,
c bigint,
d bigint,
e smallint,
plan_date date,
g numeric(18,7),
h bigint,
i timestamptz,
j bigint,
k integer,
l smallint,
primary key (a, b, c, d, e, plan_date)
) partition by list(a);

/* this is just to generate partition structure automatically */
create or replace function test_add_partitions(a integer, year integer)
returns void
volatile strict
as
$$
declare
parent text;
root text;
begin
root := 'test_' || a::text;
execute 'create table if not exists ' || root || ' partition of test for
values in (' || a::text || ') partition by hash (b);';
for b in 0..7 loop
parent := root || '_' || b::text;

execute 'create table if not exists ' || parent ||
' partition of ' || root || ' for values with (modulus
8, remainder ' || b::text || ')' ||
' partition by range (plan_date);';

execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm1 partition of ' ||
parent ||
' for values from (''' || year::text || '-01-01'') to
(''' || year::text ||
'-02-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm2 partition of ' ||
parent ||
' for values from (''' || year::text || '-02-01'') to
(''' || year::text ||
'-03-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm3 partition of ' ||
parent ||
' for values from (''' || year::text || '-03-01'') to
(''' || year::text ||
'-04-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm4 partition of ' ||
parent ||
' for values from (''' || year::text || '-04-01'') to
(''' || year::text ||
'-05-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm5 partition of ' ||
parent ||
' for values from (''' || year::text || '-05-01'') to
(''' || year::text ||
'-06-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm6 partition of ' ||
parent ||
' for values from (''' || year::text || '-06-01'') to
(''' || year::text ||
'-07-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm7 partition of ' ||
parent ||
' for values from (''' || year::text || '-07-01'') to
(''' || year::text ||
'-08-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm8 partition of ' ||
parent ||
' for values from (''' || year::text || '-08-01'') to
(''' || year::text ||
'-09-01'');';
execute 'create table if not exists ' ||
parent || '_y' || year::text || 'm9 partition of ' ||
parent ||
' for values from (''' || year::text || '-09-01'') to
(''' 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-09-15 Thread Andres Freund
Hi,

On 2023-09-06 18:01:53 -0400, Tom Lane wrote:
> It turns out that this patch is what's making buildfarm member
> chipmunk fail in contrib/pg_visibility [1].  That's easily reproduced
> by running the test with shared_buffers = 10MB.  I didn't dig further
> than the "git bisect" result:

At first I was a bit confounded by not being able to reproduce this. My test
environment had max_connections=110 for some other reason - and the problem
doesn't reproduce with that setting...


> $ git bisect bad
> 82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit
> commit 82a4edabd272f70d044faec8cf7fd1eab92d9991
> Author: Andres Freund 
> Date:   Mon Aug 14 09:54:03 2023 -0700
>
> hio: Take number of prior relation extensions into account
>
> but I imagine the problem is that the patch's more aggressive
> relation-extension heuristic is causing the table to have more
> pages than the test case expects.  Is it really a good idea
> for such a heuristic to depend on shared_buffers?

The heuristic doesn't directly depend on shared buffers. However, the amount
we extend by is limited by needing to pin shared buffers covering all the
newly extended buffers.

That's what ends up limiting things here - shared_buffers = 10MB and
max_connections = 10 doesn't allow for a lot of buffers to be pinned
concurrently in each backend.  Although perhaps LimitAdditionalPins() is a bit
too conservative, due to not checking the private refcount array and just
assuming REFCOUNT_ARRAY_ENTRIES.


> If you don't want to change the heuristic then we'll have to find a way to
> tweak this test to avoid it.

We could tweak LimitAdditionalPins() by checking PrivateRefCountArray instead
of assuming the worst-case REFCOUNT_ARRAY_ENTRIES.

However, it seems that the logic in the test is pretty fragile independent of
this issue? Different alignment, block size or an optimization of the page
layout could also break the test?

Unfortunately a query that doesn't falsely alert in this case is a bit ugly,
due to needing to deal with the corner case of an empty page at the end:

select *
from pg_visibility_map('copyfreeze')
where
  (not all_visible or not all_frozen)
  -- deal with trailing empty pages due to potentially bulk-extending too 
aggressively
  and exists(SELECT * FROM copyfreeze WHERE ctid >= ('('||blkno||', 0)')::tid)
;

Before 82a4edabd27 this situation was rare - you'd have needed contended
extensions. But after it has become more common. I worry that that might cause
other issues :(. OTOH, I think we'll need to extend way more aggressively at
some point...

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-09-06 Thread Tom Lane
Andres Freund  writes:
> On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote:
>> Since the wins from this patch were replicated and it has been pushed, I
>> understand that this open item can be marked as closed, so I've done
>> that.

> Thanks!

It turns out that this patch is what's making buildfarm member
chipmunk fail in contrib/pg_visibility [1].  That's easily reproduced
by running the test with shared_buffers = 10MB.  I didn't dig further
than the "git bisect" result:

$ git bisect bad
82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit
commit 82a4edabd272f70d044faec8cf7fd1eab92d9991
Author: Andres Freund 
Date:   Mon Aug 14 09:54:03 2023 -0700

hio: Take number of prior relation extensions into account

but I imagine the problem is that the patch's more aggressive
relation-extension heuristic is causing the table to have more
pages than the test case expects.  Is it really a good idea
for such a heuristic to depend on shared_buffers?  If you don't
want to change the heuristic then we'll have to find a way to
tweak this test to avoid it.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-09-06%2014%3A14%3A51




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-16 Thread Andres Freund
On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote:
> Since the wins from this patch were replicated and it has been pushed, I
> understand that this open item can be marked as closed, so I've done
> that.

Thanks!




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-16 Thread Alvaro Herrera
Hello,

On 2023-Aug-12, Andres Freund wrote:

> On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote:

> > > Any chance you could your benchmark? I don't see as much of a regression 
> > > vs 16
> > > as you...
> > 
> > Sure. The results are promising for me too:
> >
> >  nclients = 1, execution time = 13.743
> >  nclients = 2, execution time = 7.552
> >  nclients = 4, execution time = 4.758
> >  nclients = 8, execution time = 3.035
> >  nclients = 16, execution time = 2.172
> >  nclients = 32, execution time = 1.959
> > nclients = 64, execution time = 1.819
> > nclients = 128, execution time = 1.583
> > nclients = 256, execution time = 1.631
> 
> Nice. We are consistently better than both 15 and "post integer parsing 16".

Since the wins from this patch were replicated and it has been pushed, I
understand that this open item can be marked as closed, so I've done
that.

Thanks,

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-12 Thread Andres Freund
Hi,

On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote:
> > I think there could be a quite simple fix: Track by how much we've extended
> > the relation previously in the same bistate. If we already extended by many
> > blocks, it's very likey that we'll do so further.
> >
> > A simple prototype patch attached. The results for me are promising. I 
> > copied
> > a smaller file [1], to have more accurate throughput results at shorter runs
> > (15s).
> 
> Thank you for the patch!

Attached is a mildly updated version of that patch. No functional changes,
just polished comments and added a commit message.


> > Any chance you could your benchmark? I don't see as much of a regression vs 
> > 16
> > as you...
> 
> Sure. The results are promising for me too:
>
>  nclients = 1, execution time = 13.743
>  nclients = 2, execution time = 7.552
>  nclients = 4, execution time = 4.758
>  nclients = 8, execution time = 3.035
>  nclients = 16, execution time = 2.172
>  nclients = 32, execution time = 1.959
> nclients = 64, execution time = 1.819
> nclients = 128, execution time = 1.583
> nclients = 256, execution time = 1.631

Nice. We are consistently better than both 15 and "post integer parsing 16".


I'm really a bit baffled at myself for not using this approach from the get
go.

This also would make it much more beneficial to use a BulkInsertState in
nodeModifyTable.c, even without converting to table_multi_insert().


I was tempted to optimize RelationAddBlocks() a bit, by not calling
RelationExtensionLockWaiterCount() if we are already extending by
MAX_BUFFERS_TO_EXTEND_BY. Before this patch, it was pretty much impossible to
reach that case, because of the MAX_BUFFERED_* limits in copyfrom.c, but now
it's more common. But that probably ought to be done only HEAD, not 16.

A related oddity: RelationExtensionLockWaiterCount()->LockWaiterCount() uses
an exclusive lock on the lock partition - which seems not at all necessary?


Unless somebody sees a reason not to, I'm planning to push this?

Greetings,

Andres Freund
>From 26e7faad65273beefeb7b40a169c1b631e204501 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 12 Aug 2023 12:31:43 -0700
Subject: [PATCH v2] hio: Take number of prior relation extensions into account

The new relation extension logic, introduced in 00d1e02be24, could lead to
slowdowns in some scenarios. E.g., when loading narrow rows into a table using
COPY, the caller of RelationGetBufferForTuple() will only request a small
number of pages. Without concurrency, we just extended using pwritev() in that
case. However, if there is *some* concurrency, we switched between extending
by a small number of pages and a larger number of pages, depending on the
number of waiters for the relation extension logic.  However, some
filesystems, XFS in particular, do not perform well when switching between
extending files using fallocate() and pwritev().

To avoid that issue, remember the number of prior relation extensions in
BulkInsertState and extend more aggressively if there were prior relation
extensions. That not just avoids the aforementioned slowdown, but also leads
to noticeable performance gains in other situations, primarily due to
extending more aggressively when there is no concurrency. I should have done
it this way from the get go.

Reported-by: Masahiko Sawada 
Author: Andres Freund 
Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=kp6mszngk3bq9yr...@mail.gmail.com
Backpatch: 16-, where the new relation extension code was added
---
 src/include/access/hio.h | 13 ++---
 src/backend/access/heap/heapam.c |  1 +
 src/backend/access/heap/hio.c| 19 +++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 228433ee4a2..9bc563b7628 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -32,15 +32,22 @@ typedef struct BulkInsertStateData
 	Buffer		current_buf;	/* current insertion target page */
 
 	/*
-	 * State for bulk extensions. Further pages that were unused at the time
-	 * of the extension. They might be in use by the time we use them though,
-	 * so rechecks are needed.
+	 * State for bulk extensions.
+	 *
+	 * last_free..next_free are further pages that were unused at the time of
+	 * the last extension. They might be in use by the time we use them
+	 * though, so rechecks are needed.
 	 *
 	 * XXX: Eventually these should probably live in RelationData instead,
 	 * alongside targetblock.
+	 *
+	 * already_extended_by is the number of pages that this bulk inserted
+	 * extended by. If we already extended by a significant number of pages,
+	 * we can be more aggressive about extending going forward.
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Masahiko Sawada
On Tue, Aug 8, 2023 at 3:10 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:
> > On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
> > >
> > > On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > > > So, it looks like this item can be closed off.  I'll hold off from
> > > > doing that for a few days just in case anyone else wants to give
> > > > feedback or test themselves.
> > >
> > > Alright, closed.
> >
> > IIUC the problem with multiple concurrent COPY is not resolved yet.
>
> Yea - it was just hard to analyze until the other regressions were fixed.
>
>
> > The result of nclients = 1 became better thanks to recent fixes, but
> > there still seems to be the performance regression at nclient = 2~16
> > (on RHEL 8 and 9). Andres reported[1] that after changing
> > MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
> > would not be the solution, as he mentioned.
>
> I think there could be a quite simple fix: Track by how much we've extended
> the relation previously in the same bistate. If we already extended by many
> blocks, it's very likey that we'll do so further.
>
> A simple prototype patch attached. The results for me are promising. I copied
> a smaller file [1], to have more accurate throughput results at shorter runs
> (15s).

Thank you for the patch!

>
> HEAD before:
> clients  tps
> 1 41
> 2 76
> 4136
> 8248
> 16   360
> 32   375
> 64   317
>
>
> HEAD after:
> clients  tps
> 1 43
> 2 80
> 4155
> 8280
> 16   369
> 32   405
> 64   344
>
> Any chance you could your benchmark? I don't see as much of a regression vs 16
> as you...

Sure. The results are promising for me too:

 nclients = 1, execution time = 13.743
 nclients = 2, execution time = 7.552
 nclients = 4, execution time = 4.758
 nclients = 8, execution time = 3.035
 nclients = 16, execution time = 2.172
 nclients = 32, execution time = 1.959
nclients = 64, execution time = 1.819
nclients = 128, execution time = 1.583
nclients = 256, execution time = 1.631

Here are results of the same benchmark test you used:

w/o patch:
clientstps
1   66.702
2   59.696
4   73.783
8   168.115
16  400.134
32  574.098
64  565.373
128 526.303
256 591.751

w/ patch:
clients   tps
1   67.735
2   122.534
4   240.707
8   398.944
16  541.097
32  643.083
64  614.775
128 616.007
256 577.885

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Andres Freund
Hi,

On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote:
> On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
> >
> > On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > > So, it looks like this item can be closed off.  I'll hold off from
> > > doing that for a few days just in case anyone else wants to give
> > > feedback or test themselves.
> >
> > Alright, closed.
>
> IIUC the problem with multiple concurrent COPY is not resolved yet.

Yea - it was just hard to analyze until the other regressions were fixed.


> The result of nclients = 1 became better thanks to recent fixes, but
> there still seems to be the performance regression at nclient = 2~16
> (on RHEL 8 and 9). Andres reported[1] that after changing
> MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
> would not be the solution, as he mentioned.

I think there could be a quite simple fix: Track by how much we've extended
the relation previously in the same bistate. If we already extended by many
blocks, it's very likey that we'll do so further.

A simple prototype patch attached. The results for me are promising. I copied
a smaller file [1], to have more accurate throughput results at shorter runs
(15s).

HEAD before:
clients  tps
1 41
2 76
4136
8248
16   360
32   375
64   317


HEAD after:
clients  tps
1 43
2 80
4155
8280
16   369
32   405
64   344

Any chance you could your benchmark? I don't see as much of a regression vs 16
as you...

Greetings,

Andres Freund

[1] COPY (SELECT generate_series(1, 10)) TO '/tmp/data.copy';
diff --git i/src/include/access/hio.h w/src/include/access/hio.h
index 228433ee4a2..5ae39aec7f8 100644
--- i/src/include/access/hio.h
+++ w/src/include/access/hio.h
@@ -41,6 +41,7 @@ typedef struct BulkInsertStateData
 	 */
 	BlockNumber next_free;
 	BlockNumber last_free;
+	uint32		already_extended_by;
 } BulkInsertStateData;
 
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 7ed72abe597..6a66214a580 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -1776,6 +1776,7 @@ GetBulkInsertState(void)
 	bistate->current_buf = InvalidBuffer;
 	bistate->next_free = InvalidBlockNumber;
 	bistate->last_free = InvalidBlockNumber;
+	bistate->already_extended_by = 0;
 	return bistate;
 }
 
diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c
index c275b08494d..a2be4273df1 100644
--- i/src/backend/access/heap/hio.c
+++ w/src/backend/access/heap/hio.c
@@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		 */
 		extend_by_pages += extend_by_pages * waitcount;
 
+		/*
+		 * If we previously extended using the same bistate, it's very likely
+		 * we'll extend some more. Try to extend by as many pages.
+		 */
+		if (bistate)
+			extend_by_pages = Max(extend_by_pages, bistate->already_extended_by);
+
 		/*
 		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
 		 * them all concurrently.
@@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 		/* maintain bistate->current_buf */
 		IncrBufferRefCount(buffer);
 		bistate->current_buf = buffer;
+		bistate->already_extended_by += extend_by_pages;
 	}
 
 	return buffer;


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread Masahiko Sawada
On Mon, Aug 7, 2023 at 3:16 PM David Rowley  wrote:
>
> On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> > So, it looks like this item can be closed off.  I'll hold off from
> > doing that for a few days just in case anyone else wants to give
> > feedback or test themselves.
>
> Alright, closed.

IIUC the problem with multiple concurrent COPY is not resolved yet.
I've run the same benchmark that I used for the first report:

* PG15 (cb2ae5741f)
 nclients = 1, execution time = 15.213
 nclients = 2, execution time = 9.470
 nclients = 4, execution time = 6.508
 nclients = 8, execution time = 4.526
 nclients = 16, execution time = 3.788
 nclients = 32, execution time = 3.837
 nclients = 64, execution time = 4.286
 nclients = 128, execution time = 4.388
 nclients = 256, execution time = 4.333

* PG16 (67a007dc0c)
 nclients = 1, execution time = 14.494
 nclients = 2, execution time = 12.962
 nclients = 4, execution time = 17.757
 nclients = 8, execution time = 10.865
 nclients = 16, execution time = 7.371
 nclients = 32, execution time = 4.929
 nclients = 64, execution time = 2.212
 nclients = 128, execution time = 2.020
 nclients = 256, execution time = 2.196

The result of nclients = 1 became better thanks to recent fixes, but
there still seems to be the performance regression at nclient = 2~16
(on RHEL 8 and 9). Andres reported[1] that after changing
MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it
would not be the solution, as he mentioned.

Regards,

[1] 
https://www.postgresql.org/message-id/20230711185159.v2j5vnyrtodnwhgz%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-07 Thread David Rowley
On Wed, 2 Aug 2023 at 13:35, David Rowley  wrote:
> So, it looks like this item can be closed off.  I'll hold off from
> doing that for a few days just in case anyone else wants to give
> feedback or test themselves.

Alright, closed.

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread David Rowley
On Wed, 2 Aug 2023 at 12:25, David Rowley  wrote:
> master @ 3845577cb
> latency average = 1575.879 ms
>
>6.79%  postgres  [.] pg_strtoint32_safe
>
> master~1
> latency average = 1968.004 ms
>
>   14.28%  postgres  [.] pg_strtoint32_safe
>
> REL_16_STABLE
> latency average = 1735.163 ms
>
>6.04%  postgres  [.] pg_strtoint32_safe
>
> REL_16_STABLE~1
> latency average = 2188.186 ms
>
>   13.83%  postgres  [.] pg_strtoint32_safe

And just to complete that set, here's the REL_15_STABLE performance
using the same test:

latency average = 1829.108 ms

15.46%  postgres  [.] pg_strtoint32

So, it looks like this item can be closed off.  I'll hold off from
doing that for a few days just in case anyone else wants to give
feedback or test themselves.

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread David Rowley
On Wed, 2 Aug 2023 at 07:38, Dean Rasheed  wrote:
> Running the new test on slightly older Intel hardware (i9-9900K, gcc
> 11.3), I get the following:

Thanks for running those tests. I've now pushed the fastpath4.patch
after making a few adjustments to the header comments to mention the
new stuff that was added in v16.

> I don't know why HEAD and v16 aren't consistent, but it's seems to be
> quite reproducible, even though the numutils source is the same in
> both branches, and using gdb to dump the disassembly for
> pg_strtoint32_safe() shows that it's also the same.

I also see it's inconsistent, but the other way around. Here are some
fresh tests with master and REL_16_STABLE with the committed code and
the version directly prior to the commit:

master @ 3845577cb
latency average = 1575.879 ms

   6.79%  postgres  [.] pg_strtoint32_safe

master~1
latency average = 1968.004 ms

  14.28%  postgres  [.] pg_strtoint32_safe

REL_16_STABLE
latency average = 1735.163 ms

   6.04%  postgres  [.] pg_strtoint32_safe

REL_16_STABLE~1
latency average = 2188.186 ms

  13.83%  postgres  [.] pg_strtoint32_safe

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread Dean Rasheed
On Tue, 1 Aug 2023 at 15:01, David Rowley  wrote:
>
> Here's a patch with an else condition when the first digit check fails.
>
> master + fastpath4.patch:
> latency average = 1579.576 ms
> latency average = 1572.716 ms
> latency average = 1563.398 ms
>
> (appears slightly faster than fastpath3.patch)
>

Running the new test on slightly older Intel hardware (i9-9900K, gcc
11.3), I get the following:

REL_15_STABLE
latency average = 1687.695 ms
latency stddev = 3.068 ms

REL_16_STABLE
latency average = 1931.756 ms
latency stddev = 2.065 ms

REL_16_STABLE + pg_strtoint_fastpath1.patch
latency average = 1635.731 ms
latency stddev = 3.074 ms

REL_16_STABLE + pg_strtoint_fastpath2.patch
latency average = 1687.303 ms
latency stddev = 4.243 ms

REL_16_STABLE + pg_strtoint_fastpath3.patch
latency average = 1610.307 ms
latency stddev = 2.193 ms

REL_16_STABLE + pg_strtoint_fastpath4.patch
latency average = 1577.604 ms
latency stddev = 4.060 ms

HEAD
latency average = 1868.737 ms
latency stddev = 6.114 ms

HEAD + pg_strtoint_fastpath1.patch
latency average = 1683.215 ms
latency stddev = 2.322 ms

HEAD + pg_strtoint_fastpath2.patch
latency average = 1650.014 ms
latency stddev = 3.920 ms

HEAD + pg_strtoint_fastpath3.patch
latency average = 1670.337 ms
latency stddev = 5.074 ms

HEAD + pg_strtoint_fastpath4.patch
latency average = 1653.568 ms
latency stddev = 8.224 ms

I don't know why HEAD and v16 aren't consistent, but it's seems to be
quite reproducible, even though the numutils source is the same in
both branches, and using gdb to dump the disassembly for
pg_strtoint32_safe() shows that it's also the same.

Anyway, insofar as these results can be trusted, fastpath4.patch looks good.

Regards,
Dean




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread David Rowley
On Wed, 2 Aug 2023 at 01:26, Dean Rasheed  wrote:
>
> On Tue, 1 Aug 2023 at 13:55, David Rowley  wrote:
> >
> > I tried adding the "at least 1 digit check" by adding an else { goto
> > slow; } in the above code, but it seems to generate slower code than
> > just checking if (unlikely(ptr == s)) { goto slow; } after the loop.
> >
>
> That check isn't quite right, because "ptr" will not equal "s" if
> there is a sign character, so it won't detect an input with no digits
> in that case.

Ah, yeah. Thanks.

Here's a patch with an else condition when the first digit check fails.

master + fastpath4.patch:
latency average = 1579.576 ms
latency average = 1572.716 ms
latency average = 1563.398 ms

(appears slightly faster than fastpath3.patch)

David


pg_strtoint_fastpath4.patch
Description: Binary data


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread Dean Rasheed
On Tue, 1 Aug 2023 at 13:55, David Rowley  wrote:
>
> I tried adding the "at least 1 digit check" by adding an else { goto
> slow; } in the above code, but it seems to generate slower code than
> just checking if (unlikely(ptr == s)) { goto slow; } after the loop.
>

That check isn't quite right, because "ptr" will not equal "s" if
there is a sign character, so it won't detect an input with no digits
in that case.

Regards,
Dean




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread David Rowley
On Tue, 1 Aug 2023 at 13:25, Andres Freund  wrote:
> There's a lot of larger numbers in the file, which likely reduces the impact
> some. And there's the overhead of actually inserting the rows into the table,
> making the difference appear smaller than it is.

It might be worth special casing the first digit as we can save doing
the multiplication by 10 and the overflow checks on the first digit.
That should make it slightly faster to parse smaller numbers.

> COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM 
> generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy';
>
> psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE 
> lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
>   pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM 
> '/tmp/lotsaints_wide.copy' WHERE false") -t 5
>
> 15: 2992.605
> HEAD:   3325.201
> fastpath1.patch 2932.606
> fastpath2.patch 2783.915
>
> Interestingly fastpath1 is slower now, even though it wasn't with earlier
> patches (which still is repeatable). I do not have the foggiest as to why.

I'm glad to see that.

I've adjusted the patch to add the fast path for the 16 and 64-bit
versions of the function.  I also added the special case for
processing the first digit, which looks like:

/* process the first digit */
digit = (*ptr - '0');

if (likely(digit < 10))
{
ptr++;
tmp = digit;
}

/* process remaining digits */
for (;;)

I tried adding the "at least 1 digit check" by adding an else { goto
slow; } in the above code, but it seems to generate slower code than
just checking if (unlikely(ptr == s)) { goto slow; } after the loop.

I also noticed that I wasn't getting the same performance after
adjusting the 16 and 64 bit versions. I assume that's down to code
alignment, but unsure of that.  I ended up adjusting all the "while
(*ptr)" loops into "for (;;)" loops
since the NUL char check is handled by the "else break;".  I also
removed the needless NUL char check in the isspace loops. It can't be
isspace and '\0'. I also replaced the isdigit() function call and
replaced it for manually checking the digit range.  I see my compiler
(gcc12.2) effectively generates the same code as the unsigned char
fast path version checking if (digit < 10). Once I did that, I got the
performance back again.

With your new test with the small-sized ints, I get:

REL_15_STABLE:
latency average = 1696.390 ms

master @ d3a38318a
latency average = 1928.803 ms

master + fastpath1.patch:
latency average = 1634.736 ms

master + fastpath2.patch:
latency average = 1628.704 ms

master + fastpath3.patch
latency average = 1590.417 ms

I see no particular reason not to go ahead with the attached patch and
get this thread closed off. Any objections?

David


pg_strtoint_fastpath3.patch
Description: Binary data


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-08-01 Thread David Rowley
On Mon, 31 Jul 2023 at 21:39, John Naylor  wrote:
> master + pg_strtoint_fastpath1.patch
> latency average = 938.146 ms
> latency stddev = 9.354 ms
>
> master + pg_strtoint_fastpath2.patch
> latency average = 902.808 ms
> latency stddev = 3.957 ms

Thanks for checking those two on your machine. I'm glad to see
fastpath2 faster on yours too.

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-31 Thread Andres Freund
Hi,

On 2023-07-27 20:53:16 +1200, David Rowley wrote:
> To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
> AMD 3990x machine.  Today's REL_16_STABLE takes 530.344 ms. We're
> talking about another patch to speed up the pg_strtoint functions
> which gets this down to 483.790 ms. Do we need to do this for v16 or
> can we just leave this as it is already?  The slowdown does not seem
> to be much above what we'd ordinarily classify as noise using this
> test on my machine.

I think we need to do something for 16 - it appears on recent-ish AMD the
regression is quite a bit smaller than on intel.  You see something < 1%, I
see more like 4%.  I think there's also other cases where the slowdown is more
substantial.

Besides intel vs amd, it also looks like the gcc version might make a
difference. The code generated by 13 is noticeably slower than 12 for me...

> Benchmark setup:
> 
> COPY (SELECT generate_series(1, 200) a, (random() * 10 -
> 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

There's a lot of larger numbers in the file, which likely reduces the impact
some. And there's the overhead of actually inserting the rows into the table,
making the difference appear smaller than it is.

If I avoid the actual insert into the table and use more columns, I see an about
10% regression here.

COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM 
generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy';

psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE 
lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \
  pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM 
'/tmp/lotsaints_wide.copy' WHERE false") -t 5

15: 2992.605
HEAD:   3325.201
fastpath1.patch 2932.606
fastpath2.patch 2783.915


Interestingly fastpath1 is slower now, even though it wasn't with earlier
patches (which still is repeatable). I do not have the foggiest as to why.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-31 Thread John Naylor
On Thu, Jul 27, 2023 at 7:17 AM David Rowley  wrote:
>
> It would be really good if someone with another a newish intel CPU
> could test this too.

I ran the lotsaints test from last email on an i7-10750H (~3 years old) and
got these results (gcc 13.1 , turbo off):

REL_15_STABLE:
latency average = 956.453 ms
latency stddev = 4.854 ms

REL_16_STABLE @ 695f5deb7902 (28-JUL-2023):
latency average = 999.354 ms
latency stddev = 3.611 ms

master @ 39055cb4cc (31-JUL-2023):
latency average = 995.310 ms
latency stddev = 5.176 ms

master + revert c1308ce2d (the replace-palloc0 fix)
latency average = 1080.909 ms
latency stddev = 8.707 ms

master + pg_strtoint_fastpath1.patch
latency average = 938.146 ms
latency stddev = 9.354 ms

master + pg_strtoint_fastpath2.patch
latency average = 902.808 ms
latency stddev = 3.957 ms

For me, PG16 seems to regress from PG15, and the second patch seems faster
than the first.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-27 Thread David Rowley
On Thu, 27 Jul 2023 at 14:51, David Rowley  wrote:
> Just to keep this moving and to make it easier for people to test the
> pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch.
> The only thing I changed was to move the line that was allocating the
> array to a location more aligned with the order that the fields are
> defined in the struct.

I just did another round of benchmarking to see where we're at after
fox_COPY_DEFAULT.patch has been committed.

Below I've benchmarked REL_15_STABLE up to REL_16_STABLE with some
hand-selected key commits, many of which have been mentioned on this
thread already because they seem to affect the performance of COPY.

To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my
AMD 3990x machine.  Today's REL_16_STABLE takes 530.344 ms. We're
talking about another patch to speed up the pg_strtoint functions
which gets this down to 483.790 ms. Do we need to do this for v16 or
can we just leave this as it is already?  The slowdown does not seem
to be much above what we'd ordinarily classify as noise using this
test on my machine.

Benchmark setup:

COPY (SELECT generate_series(1, 200) a, (random() * 10 -
5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

Benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY
lotsaints FROM '/tmp/lotsaints.copy';") -t 15

2864eb977 REL_15_STABLE
latency average = 526.014 ms

8.84%  postgres  [.] pg_strtoint32

29452de73 Sat Dec  3 10:50:39 2022 -0500 The commit before "Improve
performance of pg_strtointNN functions"
latency average = 508.453 ms

10.21%  postgres  [.] pg_strtoint32

6b423ec67 Sun Dec  4 16:18:18 2022 +1300 Improve performance of
pg_strtointNN functions
latency average = 492.943 ms

7.73%  postgres  [.] pg_strtoint32

1939d2628 Fri Dec  9 10:08:44 2022 -0500 The commit before "Convert a
few datatype input functions to use "soft" error reporting."
latency average = 485.016 ms

8.43%  postgres  [.] pg_strtoint32

ccff2d20e Fri Dec  9 10:14:53 2022 -0500 Convert a few datatype input
functions to use "soft" error reporting.
latency average = 501.325 ms

6.90%  postgres  [.] pg_strtoint32_safe

60684dd83 Tue Dec 13 17:33:28 2022 -0800 The commit before
"Non-decimal integer literals"
latency average = 500.889 ms

8.27%  postgres  [.] pg_strtoint32_safe

6fcda9aba Wed Dec 14 05:40:38 2022 +0100 Non-decimal integer literals
latency average = 521.904 ms

9.02%  postgres  [.] pg_strtoint32_safe

1b6f632a3 Sat Feb 4 07:56:09 2023 +0100 The commit before "Allow
underscores in integer and numeric constants."
latency average = 523.195 ms

9.21%  postgres  [.] pg_strtoint32_safe

faff8f8e4 Sat Feb  4 09:48:51 2023 + Allow underscores in integer
and numeric constants.
latency average = 493.064 ms

10.25%  postgres  [.] pg_strtoint32_safe

9f8377f7a Mon Mar 13 10:01:56 2023 -0400 Add a DEFAULT option to COPY  FROM
latency average = 597.617 ms

  12.91%  postgres  [.] CopyReadLine
  10.62%  postgres  [.] CopyReadAttributesText
  10.51%  postgres  [.] pg_strtoint32_safe
   7.97%  postgres  [.] NextCopyFrom

REL_16_STABLE @ c1308ce2d Thu Jul 27 14:48:44 2023 +1200 Fix
performance problem with new COPY DEFAULT code
latency average = 530.344 ms

  13.51%  postgres  [.] CopyReadLine
   9.62%  postgres  [.] pg_strtoint32_safe
   8.97%  postgres  [.] CopyReadAttributesText
   8.43%  postgres  [.] NextCopyFrom

REL_16_STABLE + pg_strtoint_fastpath1.patch
latency average = 493.136 ms

  13.79%  postgres  [.] CopyReadLine
  11.82%  postgres  [.] CopyReadAttributesText
   7.07%  postgres  [.] NextCopyFrom
   6.81%  postgres  [.] pg_strtoint32_safe

REL_16_STABLE + pg_strtoint_fastpath2.patch
latency average = 483.790 ms

  13.87%  postgres  [.] CopyReadLine
  10.40%  postgres  [.] CopyReadAttributesText
   8.22%  postgres  [.] NextCopyFrom
   5.52%  postgres  [.] pg_strtoint32_safe

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-26 Thread David Rowley
On Wed, 26 Jul 2023 at 03:50, Andres Freund  wrote:
> On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> > On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> > I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
> > reason to delay committing that?  It would be good to eliminate that
> > as a variable for the current performance regression.
>
> Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
> want to commit it? Or Andrew?

Just to keep this moving and to make it easier for people to test the
pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch.
The only thing I changed was to move the line that was allocating the
array to a location more aligned with the order that the fields are
defined in the struct.

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-26 Thread David Rowley
> On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> > On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> > > HEAD:   812.690
> > >
> > > your patch: 821.354
> > >
> > > strtoint from 8692f6644e7:  824.543
> > >
> > > strtoint from 6b423ec677d^: 806.678
> >
> > I'm surprised to see the imul version is faster. It's certainly not
> > what we found when working on 6b423ec67.
>
> What CPUs did you test it on? I'd not be surprised if this were heavily
> dependent on the microarchitecture.

This was on AMD 3990x.

> One idea I had was to add a fastpath that won't parse all strings, but will
> parse the strings that we would generate, and fall back to the more general
> variant if it fails. See the attached, rough, prototype:

There were a couple of problems with fastpath.patch.  You need to
reset the position of ptr at the start of the slow path and also you
were using tmp in the if (neg) part instead of tmp_s in the fast path
section.

I fixed that up and made two versions of the patch, one using the
overflow functions (pg_strtoint_fastpath1.patch) and one testing if
the number is going to overflow (same as current master)
(pg_strtoint_fastpath2.patch)

AMD 3990x:

master + fix_COPY_DEFAULT.patch:
latency average = 525.226 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 488.171 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 481.827 ms


Apple M2 Pro:

master + fix_COPY_DEFAULT.patch:
latency average = 348.433 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 336.778 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 335.992 ms

Zen 4 7945HX CPU:

master + fix_COPY_DEFAULT.patch:
latency average = 296.881 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 287.052 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 280.742 ms

The M2 chip does not seem to be clearly faster with the fastpath2
method of overflow checking, but both AMD CPUs seem pretty set on
fastpath2 being faster.

It would be really good if someone with another a newish intel CPU
could test this too.

David
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a50c5ccd4a 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -298,9 +298,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 {
const char *ptr = s;
const char *firstdigit;
-   uint32  tmp = 0;
+   uint32  tmp;
+   int32   tmp_s = 0;
boolneg = false;
 
+   /*
+* The majority of cases are likely to be base-10 digits without any
+* underscore separator characters.  We'll first try to parse the 
string with
+* the assumption that's the case and only fallback on a slower
+* implementation which handles hex, octal and binary strings and
+* underscores.
+*/
+
+   /* leave it up to the slow path to look for leading spaces */
+
+   if (*ptr == '-')
+   {
+   ptr++;
+   neg = true;
+   }
+
+   /* a leading '+' is uncommon so leave that for the slow path */
+
+   /* process digits */
+   for (;;)
+   {
+   unsigned char   digit = (*ptr - '0');
+
+   /*
+* Exploit unsigned arithmetic to save having to check both the 
upper
+* and lower bounds of the digit
+*/
+   if (digit >= 10)
+   break;
+
+   ptr++;
+
+   if (unlikely(pg_mul_s32_overflow(tmp_s, 10, _s)) ||
+   unlikely(pg_sub_s32_overflow(tmp_s, digit, _s)))
+   goto out_of_range;
+   }
+
+   /* we need at least one digit */
+   if (unlikely(ptr == s))
+   goto slow;
+
+   /* when the string does not end in a digit, let the slow path handle it 
*/
+   if (unlikely(*ptr != '\0'))
+   goto slow;
+
+   if (!neg)
+   {
+   /* could fail if input is most negative number */
+   if (unlikely(tmp_s == PG_INT32_MIN))
+   goto out_of_range;
+   tmp_s = -tmp_s;
+   }
+
+   return tmp_s;
+
+slow:
+   tmp = 0;
+   ptr = s;
+   /* no need to reset neg */
+
/* skip leading spaces */
while (likely(*ptr) && isspace((unsigned char) *ptr))
ptr++;
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..bb48e5157e 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -301,6 +301,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
uint32  tmp = 0;
boolneg = false;
 
+   /*
+* The 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund
Hi,

On 2023-07-25 08:50:19 -0700, Andres Freund wrote:
> One idea I had was to add a fastpath that won't parse all strings, but will
> parse the strings that we would generate, and fall back to the more general
> variant if it fails. See the attached, rough, prototype:
> 
> fix_COPY_DEFAULT.patch + fastpath.patch:
> 746.971
> 
> fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
> 715.570
> 
> Now, the precise contents of this fastpath are not yet clear (wrt imul or
> not), but I think the idea has promise.

Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows
up.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund
Hi,

On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
> reason to delay committing that?  It would be good to eliminate that
> as a variable for the current performance regression.

Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
want to commit it? Or Andrew?


> > prep:
> > COPY (SELECT generate_series(1, 200) a, (random() * 10 - 
> > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);
> >
> > benchmark:
> > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY 
> > lotsaints FROM '/tmp/lotsaints.copy';") -t 15
> >
> > I disabled turbo mode, pinned the server to a single core of my Xeon Gold 
> > 5215:
> >
> > HEAD:   812.690
> >
> > your patch: 821.354
> >
> > strtoint from 8692f6644e7:  824.543
> >
> > strtoint from 6b423ec677d^: 806.678
> 
> I'm surprised to see the imul version is faster. It's certainly not
> what we found when working on 6b423ec67.

What CPUs did you test it on? I'd not be surprised if this were heavily
dependent on the microarchitecture.


> I've been fooling around a bit to try to learn what might be going on.
> I wrote 2 patches:
> 
> 1) pg_strtoint_imul.patch:  Effectively reverts 6b423ec67 and puts the
> code how it likely would have looked had I not done that work at all.
> (I've assumed that the hex, octal, binary parsing would have been
> added using the overflow multiplication functions just as the base 10
> parsing).


> 2) pg_strtoint_optimize.patch: I've made another pass over the
> functions with the current overflow checks and optimized the digit
> checking code so that it can be done in a single check like: if (digit
> < 10).  This can be done by assigning the result of *ptr - '0' to an
> unsigned char.  Anything less than '0' will wrap around and anything
> above '9' will remain so.  I've also removed a few inefficiencies with
> the isspace checking. We didn't need to do "while (*ptr &&
> isspace(*ptr)).  It's fine just to do while (isspace(*ptr)) since '\0'
> isn't a space char.  I also got rid of the isxdigit call.  The
> hexlookup array contains -1 for non-hex chars. We may as well check
> the digit value is >= 0.
> 
> Here are the results I get using your test as quoted above:
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch
> 
> latency average = 568.102 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
> 
> latency average = 531.238 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
> 
> latency average = 559.333 ms (surprisingly also faster on my machine)
> 
> The optimized version of the pg_strtoint functions wins over the imul
> patch.  Could you test to see if this is also the case in your Xeon
> machine?

(these are the numbers with turbo disabled, if I enable it they're all in the
6xx ms range, but the variance is higher)


fix_COPY_DEFAULT.patch
774.344

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch
777.673

fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
777.545

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
795.298

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit())
773.477

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch
774.443

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit())
774.513

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_')
763.879


One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

fix_COPY_DEFAULT.patch + fastpath.patch:
746.971

fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
715.570

Now, the precise contents of this fastpath are not yet clear (wrt imul or
not), but I think the idea has promise.



> > (when I say strtoint from, I did not replace the goto labels, so that part 
> > is
> > unchanged and unrelated)
> 
> I didn't quite follow this.

I meant that I didn't undo ereport()->ereturn().

Greetings,

Andres Freund
diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c
index 471fbb7ee63..52bbff801d0 100644
--- c/src/backend/utils/adt/numutils.c
+++ i/src/backend/utils/adt/numutils.c
@@ -301,6 +301,53 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	uint32		tmp = 0;
 	bool		neg = false;
 
+	/*
+	 * Fast path recognizing the most common output.
+	 */
+	if (1)
+	{
+		int32		tmp_s = 0;
+
+		/* leading spaces are uncommon => slow path */
+
+		/* handle - sign, + is uncommon => slow path */
+		if (*ptr == '-')
+		{
+			ptr++;
+			neg = true;
+		}
+
+		/* 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread David Rowley
On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> prep:
> COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int 
> b, 3243423 c) TO '/tmp/lotsaints.copy';
> DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);
>
> benchmark:
> psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints 
> FROM '/tmp/lotsaints.copy';") -t 15
>
> I disabled turbo mode, pinned the server to a single core of my Xeon Gold 
> 5215:
>
> HEAD:   812.690
>
> your patch: 821.354
>
> strtoint from 8692f6644e7:  824.543
>
> strtoint from 6b423ec677d^: 806.678

I'm surprised to see the imul version is faster. It's certainly not
what we found when working on 6b423ec67.

I've been fooling around a bit to try to learn what might be going on.
I wrote 2 patches:

1) pg_strtoint_imul.patch:  Effectively reverts 6b423ec67 and puts the
code how it likely would have looked had I not done that work at all.
(I've assumed that the hex, octal, binary parsing would have been
added using the overflow multiplication functions just as the base 10
parsing).

2) pg_strtoint_optimize.patch: I've made another pass over the
functions with the current overflow checks and optimized the digit
checking code so that it can be done in a single check like: if (digit
< 10).  This can be done by assigning the result of *ptr - '0' to an
unsigned char.  Anything less than '0' will wrap around and anything
above '9' will remain so.  I've also removed a few inefficiencies with
the isspace checking. We didn't need to do "while (*ptr &&
isspace(*ptr)).  It's fine just to do while (isspace(*ptr)) since '\0'
isn't a space char.  I also got rid of the isxdigit call.  The
hexlookup array contains -1 for non-hex chars. We may as well check
the digit value is >= 0.

Here are the results I get using your test as quoted above:

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch

latency average = 568.102 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch

latency average = 531.238 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch

latency average = 559.333 ms (surprisingly also faster on my machine)

The optimized version of the pg_strtoint functions wins over the imul
patch.  Could you test to see if this is also the case in your Xeon
machine?

> (when I say strtoint from, I did not replace the goto labels, so that part is
> unchanged and unrelated)

I didn't quite follow this.

I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
reason to delay committing that?  It would be good to eliminate that
as a variable for the current performance regression.

David
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a6e9d00727 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -104,10 +104,10 @@ static const int8 hexlookup[128] = {
  * pg_strtoint16() will throw ereport() upon bad input format or overflow;
  * while pg_strtoint16_safe() instead returns such complaints in *escontext,
  * if it's an ErrorSaveContext.
-*
- * NB: Accumulate input as an unsigned number, to deal with two's complement
+ *
+ * NB: Accumulate input as a negative number, to deal with two's complement
  * representation of the most negative number, which can't be represented as a
- * signed positive number.
+ * positive number.
  */
 int16
 pg_strtoint16(const char *s)
@@ -120,7 +120,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 {
const char *ptr = s;
const char *firstdigit;
-   uint16  tmp = 0;
+   int16   tmp = 0;
boolneg = false;
 
/* skip leading spaces */
@@ -145,10 +145,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
{
if (isxdigit((unsigned char) *ptr))
{
-   if (unlikely(tmp > -(PG_INT16_MIN / 16)))
-   goto out_of_range;
+   int8 digit = hexlookup[(unsigned char) *ptr++];
 
-   tmp = tmp * 16 + hexlookup[(unsigned char) 
*ptr++];
+   if (unlikely(pg_mul_s16_overflow(tmp, 16, 
)) ||
+   unlikely(pg_sub_s16_overflow(tmp, 
digit, )))
+   goto out_of_range;
}
else if (*ptr == '_')
{
@@ -169,10 +170,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
{
if (*ptr >= '0' && *ptr <= '7')
{
-   if (unlikely(tmp > -(PG_INT16_MIN / 8)))
-   goto out_of_range;
+   int8 digit = (*ptr++ - '0');
 
-   tmp = tmp * 8 + (*ptr++ - '0');
+ 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-24 Thread Andres Freund
Hi,

Hm, in some cases your patch is better, but in others both the old code
(8692f6644e7) and HEAD beat yours on my machine. TBH, not entirely sure why.

prep:
COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int b, 
3243423 c) TO '/tmp/lotsaints.copy';
DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);

benchmark:
psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints 
FROM '/tmp/lotsaints.copy';") -t 15

I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215:

HEAD:   812.690

your patch: 821.354

strtoint from 8692f6644e7:  824.543

strtoint from 6b423ec677d^: 806.678

(when I say strtoint from, I did not replace the goto labels, so that part is
unchanged and unrelated)


IOW, for me the code from 15 is the fastest by a good bit... There's an imul,
sure, but the fact that it sets a flag makes it faster than having to add more
tests and branches.


Looking at a profile reminded me of how silly it is that we are wasting a good
chunk of the time in these isdigit() checks, even though we already rely on on
the ascii values via (via *ptr++ - '0'). I think that's done in the headers
for some platforms, but not others (glibc). And we've even done already for
octal and binary!

Open coding isdigit() gives us:


HEAD:   797.434

your patch: 803.570

strtoint from 8692f6644e7:  778.943

strtoint from 6b423ec677d^: 777.741


It's somewhat odd that HEAD and your patch switch position here...


> - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
> + /* process hex digits */
> + else if (ptr[1] == 'x' || ptr[1] == 'X')
>   {
>
>   firstdigit = ptr += 2;

I find this unnecessarily hard to read. I realize it's been added in
6fcda9aba83, but I don't see a reason to use such a construct here.


I find it somewhat grating how much duplication there now is in this
code due to being repeated for all the bases...


Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-20 Thread David Rowley
On Thu, 20 Jul 2023 at 20:37, Dean Rasheed  wrote:
>
> On Thu, 20 Jul 2023 at 00:56, David Rowley  wrote:
> I agree with the principal though. In the attached updated patch, I
> replaced that test with a simpler one:
>
> +/*
> + * Process the number's digits. We optimize for decimal input (expected 
> to
> + * be the most common case) first. Anything that doesn't start with a 
> base
> + * prefix indicator must be decimal.
> + */
> +
> +   /* process decimal digits */
> +   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
>
> So hopefully any compiler should only need to do the one comparison
> against '0' for any non-zero decimal input.
>
> Doing that didn't give any measurable performance improvement for me,
> but it did at least not make it noticeably worse, and seems more
> likely to generate better code with not-so-smart compilers. I'd be
> interested to know how that performs for you (and if your compiler
> really does generate 3 "first digit is '0'" tests for the unpatched
> code).

That seems better.  I benchmarked it on two machines:

gcc12.2/linux/amd3990x
create table a (a int);
insert into a select x from generate_series(1,1000)x;
copy a to '/tmp/a.dump';

master @ ab29a7a9c
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2226.912 ms (00:02.227)
Time: 2214.168 ms (00:02.214)
Time: 2206.481 ms (00:02.206)
Time: 2219.640 ms (00:02.220)
Time: 2205.004 ms (00:02.205)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 2067.108 ms (00:02.067)
Time: 2070.401 ms (00:02.070)
Time: 2073.423 ms (00:02.073)
Time: 2065.407 ms (00:02.065)
Time: 2066.536 ms (00:02.067) (~7% faster)

apple m2 pro/clang

master @ 9089287a
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1286.369 ms (00:01.286)
Time: 1300.534 ms (00:01.301)
Time: 1295.661 ms (00:01.296)
Time: 1296.404 ms (00:01.296)
Time: 1268.361 ms (00:01.268)
Time: 1259.321 ms (00:01.259)

master + pg_strtoint32_base_10_first.v2.patch
postgres=# truncate a; copy a from '/tmp/a.dump';
Time: 1253.519 ms (00:01.254)
Time: 1235.256 ms (00:01.235)
Time: 1269.501 ms (00:01.270)
Time: 1267.801 ms (00:01.268)
Time: 1275.758 ms (00:01.276)
Time: 1261.478 ms (00:01.261) (a bit noisy but avg of ~1.8% faster)

David




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-20 Thread Dean Rasheed
On Thu, 20 Jul 2023 at 00:56, David Rowley  wrote:
>
> I noticed that 6fcda9aba added quite a lot of conditions that need to
> be checked before we process a normal decimal integer string. I think
> we could likely do better and code it to assume that most strings will
> be decimal and put that case first rather than last.

That sounds sensible, but ...

> In the attached, I've changed that for the 32-bit version only.  A
> more complete patch would need to do the 16 and 64-bit versions too.
>
> Without that, we need to check if the first digit is '0' a total of 3
> times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
> 'b' or 'B'.

That's not what I see. For me, the compiler (gcc 11, using either -O2
or -O3) is smart enough to spot that the first part of each of the 3
checks is the same, and it only tests whether the first digit is '0'
once, before immediately jumping to the decimal parsing code. I didn't
test other compilers though, so I can believe that different compilers
might not be so smart.

OTOH, this test in your patch:

+/* process decimal digits */
+if (isdigit((unsigned char) ptr[0]) &&
+(isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1]
== '\0' || isspace(ptr[1])))

is doing more work than it needs to, and actually makes things
noticeably worse for me. It needs to do a minimum of 2 isdigit()
checks before it will parse the input as a decimal, whereas before
(for me at least) it just did one simple comparison of ptr[0] against
'0'.

I agree with the principal though. In the attached updated patch, I
replaced that test with a simpler one:

+/*
+ * Process the number's digits. We optimize for decimal input (expected to
+ * be the most common case) first. Anything that doesn't start with a base
+ * prefix indicator must be decimal.
+ */
+
+   /* process decimal digits */
+   if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))

So hopefully any compiler should only need to do the one comparison
against '0' for any non-zero decimal input.

Doing that didn't give any measurable performance improvement for me,
but it did at least not make it noticeably worse, and seems more
likely to generate better code with not-so-smart compilers. I'd be
interested to know how that performs for you (and if your compiler
really does generate 3 "first digit is '0'" tests for the unpatched
code).

Regards,
Dean

---

Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"),
and I tested COPY loading 50M rows:

HEAD + P1
=

7137.966 ms
7193.190 ms
7094.491 ms
7123.520 ms

HEAD + P1 + pg_strtoint32_base_10_first.patch
=

7561.658 ms
7548.282 ms
7551.360 ms
7560.671 ms

HEAD + P1 + pg_strtoint32_base_10_first.v2.patch


7238.775 ms
7184.937 ms
7123.257 ms
7159.618 ms
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
new file mode 100644
index 471fbb7..e2320e6
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -314,113 +314,124 @@ pg_strtoint32_safe(const char *s, Node *
 	else if (*ptr == '+')
 		ptr++;
 
-	/* process digits */
-	if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+	/*
+	 * Process the number's digits. We optimize for decimal input (expected to
+	 * be the most common case) first. Anything that doesn't start with a base
+	 * prefix indicator must be decimal.
+	 */
+
+	/* process decimal digits */
+	if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1])))
 	{
-		firstdigit = ptr += 2;
+		firstdigit = ptr;
 
 		while (*ptr)
 		{
-			if (isxdigit((unsigned char) *ptr))
+			if (isdigit((unsigned char) *ptr))
 			{
-if (unlikely(tmp > -(PG_INT32_MIN / 16)))
+if (unlikely(tmp > -(PG_INT32_MIN / 10)))
 	goto out_of_range;
 
-tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
+tmp = tmp * 10 + (*ptr++ - '0');
 			}
 			else if (*ptr == '_')
 			{
-/* underscore must be followed by more digits */
+/* underscore may not be first */
+if (unlikely(ptr == firstdigit))
+	goto invalid_syntax;
+/* and it must be followed by more digits */
 ptr++;
-if (*ptr == '\0' || !isxdigit((unsigned char) *ptr))
+if (*ptr == '\0' || !isdigit((unsigned char) *ptr))
 	goto invalid_syntax;
 			}
 			else
 break;
 		}
 	}
-	else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+	/* process hex digits */
+	else if (ptr[1] == 'x' || ptr[1] == 'X')
 	{
 		firstdigit = ptr += 2;
 
 		while (*ptr)
 		{
-			if (*ptr >= '0' && *ptr <= '7')
+			if (isxdigit((unsigned char) *ptr))
 			{
-if (unlikely(tmp > -(PG_INT32_MIN / 8)))
+if (unlikely(tmp > -(PG_INT32_MIN / 16)))
 	goto out_of_range;
 
-tmp = tmp * 8 + (*ptr++ - '0');
+tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++];
 			}
 			else if (*ptr == '_')
 			{
 /* underscore must be followed by more digits */
 ptr++;
-if (*ptr == '\0' || *ptr < 

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-19 Thread David Rowley
On Wed, 19 Jul 2023 at 23:14, Dean Rasheed  wrote:
> Hmm, I'm somewhat sceptical about this second patch. It's not obvious
> why adding such tests would speed it up, and indeed, testing on my
> machine with 50M rows, I see a noticeable speed-up from patch 1, and a
> slow-down from patch 2:

I noticed that 6fcda9aba added quite a lot of conditions that need to
be checked before we process a normal decimal integer string. I think
we could likely do better and code it to assume that most strings will
be decimal and put that case first rather than last.

In the attached, I've changed that for the 32-bit version only.  A
more complete patch would need to do the 16 and 64-bit versions too.

-- setup
create table a (a int);
insert into a select x from generate_series(1,1000)x;
copy a to '~/a.dump';

-- test
truncate a; copy a from '/tmp/a.dump';

master @ ab29a7a9c
Time: 2152.633 ms (00:02.153)
Time: 2121.091 ms (00:02.121)
Time: 2100.995 ms (00:02.101)
Time: 2101.724 ms (00:02.102)
Time: 2103.949 ms (00:02.104)

master + pg_strtoint32_base_10_first.patch
Time: 2061.464 ms (00:02.061)
Time: 2035.513 ms (00:02.036)
Time: 2028.356 ms (00:02.028)
Time: 2043.218 ms (00:02.043)
Time: 2037.035 ms (00:02.037) (~3.6% faster)

Without that, we need to check if the first digit is '0' a total of 3
times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O',
'b' or 'B'. It seems to be coded with the assumption that hex strings
are the most likely. I think decimals are the most likely by far.

David
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..60c90f7252 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -314,113 +314,119 @@ pg_strtoint32_safe(const char *s, Node *escontext)
else if (*ptr == '+')
ptr++;
 
-   /* process digits */
-   if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
+   /* process decimal digits */
+   if (isdigit((unsigned char) ptr[0]) &&
+   (isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1] == 
'\0' || isspace(ptr[1])))
{
-   firstdigit = ptr += 2;
+   firstdigit = ptr;
 
while (*ptr)
{
-   if (isxdigit((unsigned char) *ptr))
+   if (isdigit((unsigned char) *ptr))
{
-   if (unlikely(tmp > -(PG_INT32_MIN / 16)))
+   if (unlikely(tmp > -(PG_INT32_MIN / 10)))
goto out_of_range;
 
-   tmp = tmp * 16 + hexlookup[(unsigned char) 
*ptr++];
+   tmp = tmp * 10 + (*ptr++ - '0');
}
else if (*ptr == '_')
{
-   /* underscore must be followed by more digits */
+   /* underscore may not be first */
+   if (unlikely(ptr == firstdigit))
+   goto invalid_syntax;
+   /* and it must be followed by more digits */
ptr++;
-   if (*ptr == '\0' || !isxdigit((unsigned char) 
*ptr))
+   if (*ptr == '\0' || !isdigit((unsigned char) 
*ptr))
goto invalid_syntax;
}
else
break;
}
}
-   else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O'))
+   /* process hex digits */
+   else if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X'))
{
firstdigit = ptr += 2;
 
while (*ptr)
{
-   if (*ptr >= '0' && *ptr <= '7')
+   if (isxdigit((unsigned char) *ptr))
{
-   if (unlikely(tmp > -(PG_INT32_MIN / 8)))
+   if (unlikely(tmp > -(PG_INT32_MIN / 16)))
goto out_of_range;
 
-   tmp = tmp * 8 + (*ptr++ - '0');
+   tmp = tmp * 16 + hexlookup[(unsigned char) 
*ptr++];
}
else if (*ptr == '_')
{
/* underscore must be followed by more digits */
ptr++;
-   if (*ptr == '\0' || *ptr < '0' || *ptr > '7')
+   if (*ptr == '\0' || !isxdigit((unsigned char) 
*ptr))
goto invalid_syntax;
}
else
break;
}
}
-   else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-19 Thread Dean Rasheed
On Wed, 19 Jul 2023 at 09:24, Masahiko Sawada  wrote:
>
> > > 2) pg_strtoint32_safe() got substantially slower, mainly due
> > >to
> > > faff8f8e47f Allow underscores in integer and numeric constants.
> > > 6fcda9aba83 Non-decimal integer literals
> >
> > Agreed.
> >
> I have made some progress on dealing with performance regression on
> single client COPY. I've attached a patch to fix 2). With the patch I
> shared[1] to deal with 1), single client COPY performance seems to be
> now as good as (or slightly better than) PG15 . Here are the results
> (averages of 5 times) of loading 50M rows via COPY:
>

Hmm, I'm somewhat sceptical about this second patch. It's not obvious
why adding such tests would speed it up, and indeed, testing on my
machine with 50M rows, I see a noticeable speed-up from patch 1, and a
slow-down from patch 2:


PG15


7390.461 ms
7497.655 ms
7485.850 ms
7406.336 ms

HEAD


8388.707 ms
8283.484 ms
8391.638 ms
8363.306 ms

HEAD + P1
=

7255.128 ms
7185.319 ms
7197.822 ms
7191.176 ms

HEAD + P2
=

8687.164 ms
8654.907 ms
8641.493 ms
8668.865 ms

HEAD + P1 + P2
==

7780.126 ms
7786.427 ms
7775.047 ms
7785.938 ms


So for me at least, just applying patch 1 gives the best results, and
makes it slightly faster than PG15 (possibly due to 6b423ec677).

Regards,
Dean




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-19 Thread Masahiko Sawada
On Wed, Jul 12, 2023 at 5:40 PM Masahiko Sawada  wrote:
>
> On Wed, Jul 12, 2023 at 3:52 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> > > While testing PG16, I observed that in PG16 there is a big performance
> > > degradation in concurrent COPY into a single relation with 2 - 16
> > > clients in my environment. I've attached a test script that measures
> > > the execution time of COPYing 5GB data in total to the single relation
> > > while changing the number of concurrent insertions, in PG16 and PG15.
> > > Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> > > vCPUs, 512GB RAM):
> > >
> > > * PG15 (4b15868b69)
> > > PG15: nclients = 1, execution time = 14.181
> > >
> > > * PG16 (c24e9ef330)
> > > PG16: nclients = 1, execution time = 17.112
> >
> > > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> > > extend tables more efficiently". With commit 1cbbee0338 (the previous
> > > commit of 00d1e02be2), I got a better numbers, it didn't have a better
> > > scalability, though:
> > >
> > > PG16: nclients = 1, execution time = 17.444
> >
> > I think the single client case is indicative of an independent regression, 
> > or
> > rather regressions - it can't have anything to do with the fallocate() issue
> > and reproduces before that too in your numbers.
>
> Right.
>
> >
> > 1) COPY got slower, due to:
> > 9f8377f7a27 Add a DEFAULT option to COPY  FROM
> >
> > This added a new palloc()/free() to every call to NextCopyFrom(). It's not 
> > at
> > all clear to me why this needs to happen in NextCopyFrom(), particularly
> > because it's already stored in CopyFromState?
>
> Yeah, it seems to me that we can palloc the bool array once and use it
> for the entire COPY FROM. With the attached small patch, the
> performance becomes much better:
>
> 15:
> 14.70500 sec
>
> 16:
> 17.42900 sec
>
> 16 w/ patch:
> 14.85600 sec
>
> >
> >
> > 2) pg_strtoint32_safe() got substantially slower, mainly due
> >to
> > faff8f8e47f Allow underscores in integer and numeric constants.
> > 6fcda9aba83 Non-decimal integer literals
>
> Agreed.
>
> >
> > pinned to one cpu, turbo mode disabled, I get the following best-of-three 
> > times for
> >   copy test from '/tmp/tmp_4.data'
> > (too impatient to use the larger file every time)
> >
> > 15:
> > 6281.107 ms
> >
> > HEAD:
> > 7000.469 ms
> >
> > backing out 9f8377f7a27:
> > 6433.516 ms
> >
> > also backing out faff8f8e47f, 6fcda9aba83:
> > 6235.453 ms
> >
> >
> > I suspect 1) can relatively easily be fixed properly. But 2) seems much
> > harder. The changes increased the number of branches substantially, that's
> > gonna cost in something as (previously) tight as pg_strtoint32().
>
> I'll look at how to fix 2).

I have made some progress on dealing with performance regression on
single client COPY. I've attached a patch to fix 2). With the patch I
shared[1] to deal with 1), single client COPY performance seems to be
now as good as (or slightly better than) PG15 . Here are the results
(averages of 5 times) of loading 50M rows via COPY:

15:
7.609 sec

16:
8.637 sec

16 w/ two patches:
7.179 sec

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoBb9Sbddh%2BnQk1BxUFaRHaa%2BfL8fCzO-Lvxqb6ZcpAHqw%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_numeric.patch
Description: Binary data


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-12 Thread Masahiko Sawada
On Wed, Jul 12, 2023 at 3:52 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> > Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> > vCPUs, 512GB RAM):
> >
> > * PG15 (4b15868b69)
> > PG15: nclients = 1, execution time = 14.181
> >
> > * PG16 (c24e9ef330)
> > PG16: nclients = 1, execution time = 17.112
>
> > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> > extend tables more efficiently". With commit 1cbbee0338 (the previous
> > commit of 00d1e02be2), I got a better numbers, it didn't have a better
> > scalability, though:
> >
> > PG16: nclients = 1, execution time = 17.444
>
> I think the single client case is indicative of an independent regression, or
> rather regressions - it can't have anything to do with the fallocate() issue
> and reproduces before that too in your numbers.

Right.

>
> 1) COPY got slower, due to:
> 9f8377f7a27 Add a DEFAULT option to COPY  FROM
>
> This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
> all clear to me why this needs to happen in NextCopyFrom(), particularly
> because it's already stored in CopyFromState?

Yeah, it seems to me that we can palloc the bool array once and use it
for the entire COPY FROM. With the attached small patch, the
performance becomes much better:

15:
14.70500 sec

16:
17.42900 sec

16 w/ patch:
14.85600 sec

>
>
> 2) pg_strtoint32_safe() got substantially slower, mainly due
>to
> faff8f8e47f Allow underscores in integer and numeric constants.
> 6fcda9aba83 Non-decimal integer literals

Agreed.

>
> pinned to one cpu, turbo mode disabled, I get the following best-of-three 
> times for
>   copy test from '/tmp/tmp_4.data'
> (too impatient to use the larger file every time)
>
> 15:
> 6281.107 ms
>
> HEAD:
> 7000.469 ms
>
> backing out 9f8377f7a27:
> 6433.516 ms
>
> also backing out faff8f8e47f, 6fcda9aba83:
> 6235.453 ms
>
>
> I suspect 1) can relatively easily be fixed properly. But 2) seems much
> harder. The changes increased the number of branches substantially, that's
> gonna cost in something as (previously) tight as pg_strtoint32().

I'll look at how to fix 2).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_COPY_DEFAULT.patch
Description: Binary data


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):
>
> * PG15 (4b15868b69)
> PG15: nclients = 1, execution time = 14.181
>
> * PG16 (c24e9ef330)
> PG16: nclients = 1, execution time = 17.112

> The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> extend tables more efficiently". With commit 1cbbee0338 (the previous
> commit of 00d1e02be2), I got a better numbers, it didn't have a better
> scalability, though:
>
> PG16: nclients = 1, execution time = 17.444

I think the single client case is indicative of an independent regression, or
rather regressions - it can't have anything to do with the fallocate() issue
and reproduces before that too in your numbers.

1) COPY got slower, due to:
9f8377f7a27 Add a DEFAULT option to COPY  FROM

This added a new palloc()/free() to every call to NextCopyFrom(). It's not at
all clear to me why this needs to happen in NextCopyFrom(), particularly
because it's already stored in CopyFromState?


2) pg_strtoint32_safe() got substantially slower, mainly due
   to
faff8f8e47f Allow underscores in integer and numeric constants.
6fcda9aba83 Non-decimal integer literals

pinned to one cpu, turbo mode disabled, I get the following best-of-three times 
for
  copy test from '/tmp/tmp_4.data'
(too impatient to use the larger file every time)

15:
6281.107 ms

HEAD:
7000.469 ms

backing out 9f8377f7a27:
6433.516 ms

also backing out faff8f8e47f, 6fcda9aba83:
6235.453 ms


I suspect 1) can relatively easily be fixed properly. But 2) seems much
harder. The changes increased the number of branches substantially, that's
gonna cost in something as (previously) tight as pg_strtoint32().



For higher concurrency numbers, I now was able to reproduce the regression, to
a smaller degree. Much smaller after fixing the above. The reason we run into
the issue here is basically that the rows in the test are very narrow and reach

#define MAX_BUFFERED_TUPLES 1000

at a small number of pages, so we go back and forth between extending with
fallocate() and not.

I'm *not* saying that that is the solution, but after changing that to 5000,
the numbers look a lot better (with the other regressions "worked around"):

(this is again with turboboost disabled, to get more reproducible numbers)

clients 1   2   4   8   16  32

15,buffered=100025725   13211   9232563948624700
15,buffered=500026107   14550   8644605049434766
HEAD+fixes,buffered=100025875   14505   8200490035653433
HEAD+fixes,buffered=500025830   12975   6527359427392642

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw%2B4hm%3D2EOFXbJQ3tOg%40mail.gmail.com




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Andres Freund
Hi,

On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote:
> On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > > Out of curiosity I've tried and it is reproducible as you have stated : 
> > > XFS
> > > @ 4.18.0-425.10.1.el8_7.x86_64:
> > >...
> > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > > loop of sync writes (D) by kworker/2:1H-kblockd:
> >
> > That clearly won't go well.  It's not reproducible on newer systems,
> > unfortunately :(. Or well, fortunately maybe.
> >
> >
> > I wonder if a trick to avoid this could be to memorialize the fact that we
> > bulk extended before and extend by that much going forward? That'd avoid the
> > swapping back and forth.
>
> I haven't seen this thread [1] "Question on slow fallocate", from XFS
> mailing list being mentioned here (it was started by Masahiko), but I
> do feel it contains very important hints even challenging the whole
> idea of zeroing out files (or posix_fallocate()). Please especially
> see Dave's reply.

I think that's just due to the reproducer being a bit too minimal and the
actual problem being addressed not being explained.


> He also argues that posix_fallocate() != fallocate().  What's interesting is
> that it's by design and newer kernel versions should not prevent such
> behaviour, see my testing result below.

I think the problem there was that I was not targetting a different file
between the different runs, somehow assuming the test program would be taking
care of that.

I don't think the test program actually tests things in a particularly useful
way - it does fallocate()s in 8k chunks - which postgres never does.



> All I can add is that this those kernel versions (4.18.0) seem to very
> popular across customers (RHEL, Rocky) right now and that I've tested
> on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
> Masahiko test.c and still got 6-7x slower time when using XFS on that
> kernel. After installing kernel-ml (6.4.2) the test.c result seems to
> be the same (note it it occurs only when 1st allocating space, but of
> course it doesnt if the same file is rewritten/"reallocated"):

test.c really doesn't reproduce postgres behaviour in any meaningful way,
using it as a benchmark is a bad idea.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Jakub Wartak
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well.  It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.

I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate().  What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

[root@rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root@rockyora ~]# time ./test test.0 0
total   20
fallocate   0
filewrite   20

real0m0.405s
user0m0.006s
sys 0m0.391s
[root@rockyora ~]# time ./test test.0 1
total   20
fallocate   20
filewrite   0

real0m0.137s
user0m0.005s
sys 0m0.132s
[root@rockyora ~]# time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m0.968s
user0m0.020s
sys 0m0.928s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m6.059s
user0m0.000s
sys 0m0.788s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m0.598s
user0m0.003s
sys 0m0.225s
[root@rockyora ~]#

iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda  0.00 15394.00  0.00122.02 0.0013.00
0.00   0.080.000.05   0.75 0.00 8.12   0.06 100.00
dm-0 0.00 15407.00  0.00122.02 0.00 0.00
0.00   0.000.000.06   0.98 0.00 8.11   0.06 100.00

So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)

-J.

[1] - https://www.spinics.net/lists/linux-xfs/msg73035.html




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Masahiko Sawada
On Tue, Jul 11, 2023 at 1:24 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well.  It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.
>
>
> One thing that confuses me is that Sawada-san observed a regression at a
> single client - yet from what I can tell, there's actually not a single
> fallocate() being generated in that case, because the table is so narrow that
> we never end up extending by a sufficient number of blocks in
> RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.
>
> I don't yet have a RHEL8 system at hand, could either of you send the result
> of a
>   strace -c -p $pid_of_backend_doing_copy
>

Here are the results:

* PG16: nclients = 1, execution time = 23.758
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 53.180.308675   0358898   pwrite64
 33.920.196900   2 81202   pwritev
  7.780.045148   0 81378   lseek
  5.060.029371   2 11141   read
  0.040.000250   291   openat
  0.020.94   189   close
  0.000.00   0 1   munmap
  0.000.00   084   brk
  0.000.00   0 1   sendto
  0.000.00   0 2 1 recvfrom
  0.000.00   0 2   kill
  0.000.00   0 1   futex
  0.000.00   0 1   epoll_wait
-- --- --- - - 
100.000.580438   1532891 1 total

* PG16: nclients = 2, execution time = 18.045
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 54.190.218721   1187803   pwrite64
 29.240.118002   2 40242   pwritev
  6.230.025128   0 41239   lseek
  5.030.020285   2  9133   read
  2.040.008229   9   855   fallocate
  1.280.005151   0  5598  1000 futex
  1.120.004516   1  3965   kill
  0.780.003141   1  3058 1 epoll_wait
  0.060.000224   2   100   openat
  0.030.000136   198   close
  0.010.50   084   brk
  0.000.13   022   setitimer
  0.000.06   015 1 rt_sigreturn
  0.000.02   2 1   munmap
  0.000.02   2 1   sendto
  0.000.02   0 3 2 recvfrom
-- --- --- - - 
100.000.403608   1292217  1004 total

* PG16: nclients = 4, execution time = 18.807
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 64.610.240171   2 93868   pwrite64
 15.110.056173   4 11337   pwritev
  7.290.027100   7  3465   fallocate
  4.050.015048   2  5179   read
  3.550.013188   0 14941   lseek
  2.650.009858   1  8675  2527 futex
  1.760.006536   1  4190   kill
  0.880.003268   1  2199   epoll_wait
  0.060.000213   2   101   openat
  0.030.000130   199   close
  0.010.31   118   rt_sigreturn
  0.010.27   117   setitimer
  0.000.00   0 1   munmap
  0.000.00   084   brk
  0.000.00   0 1   sendto
  0.000.00   0 1   recvfrom
-- --- --- - - 
100.000.371743   2144176  2527 total

* PG16: nclients = 8, execution time = 11.982
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 73.16  

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Masahiko Sawada
On Tue, Jul 11, 2023 at 12:34 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> > Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> > vCPUs, 512GB RAM):
>
> Gah, RHEL with its frankenkernels, the bane of my existance.
>
> FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
> you tested this on RHEL9 as well by any chance?

I've tested this on RHEL9 (m5.24xlarge; 96vCPUs, 384GB RAM), and it
seems to be reproducible on RHEL9 too.

$ uname -rms
Linux 5.14.0-284.11.1.el9_2.x86_64 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)

PG15: nclients = 1, execution time = 22.193
PG15: nclients = 2, execution time = 12.430
PG15: nclients = 4, execution time = 8.677
PG15: nclients = 8, execution time = 6.144
PG15: nclients = 16, execution time = 5.938
PG15: nclients = 32, execution time = 5.775
PG15: nclients = 64, execution time = 5.928
PG15: nclients = 128, execution time = 6.346
PG15: nclients = 256, execution time = 6.641

PG16: nclients = 1, execution time = 24.601
PG16: nclients = 2, execution time = 27.845
PG16: nclients = 4, execution time = 40.575
PG16: nclients = 8, execution time = 24.379
PG16: nclients = 16, execution time = 15.835
PG16: nclients = 32, execution time = 8.370
PG16: nclients = 64, execution time = 4.042
PG16: nclients = 128, execution time = 2.956
PG16: nclients = 256, execution time = 2.591

Tests with test.c program:

$ rm -f test.data; time ./test test.data 0
total   20
fallocate   0
filewrite   20

real0m0.709s
user0m0.057s
sys 0m0.649s

$ rm -f test.data; time ./test test.data 1
total   20
fallocate   20
filewrite   0

real0m0.853s
user0m0.058s
sys 0m0.791s

$ rm -f test.data; time ./test test.data 2
total   20
fallocate   10
filewrite   10

real2m10.002s
user0m0.366s
sys 0m6.649s

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> Out of curiosity I've tried and it is reproducible as you have stated : XFS
> @ 4.18.0-425.10.1.el8_7.x86_64:
>...
> According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> the XFS issues sync writes while ext4 does not, xfs looks like constant
> loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well.  It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.


I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.


One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.

I don't yet have a RHEL8 system at hand, could either of you send the result
of a
  strace -c -p $pid_of_backend_doing_copy

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote:
> On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  wrote:
> >
> > After further investigation, the performance degradation comes from
> > calling posix_fallocate() (called via FileFallocate()) and pwritev()
> > (called via FileZero) alternatively depending on how many blocks we
> > extend by. And it happens only on the xfs filesystem.
>
> FYI, the attached simple C program proves the fact that calling
> alternatively posix_fallocate() and pwrite() causes slow performance
> on posix_fallocate():
>
> $ gcc -o test test.c
> $ time ./test test.1 1
> total   20
> fallocate   20
> filewrite   0
>
> real0m1.305s
> user0m0.050s
> sys 0m1.255s
>
> $ time ./test test.2 2
> total   20
> fallocate   10
> filewrite   10
>
> real1m29.222s
> user0m0.139s
> sys 0m3.139s

On an xfs filesystem, with a very recent kernel:

time /tmp/msw_test /srv/dev/fio/msw 0
total   20
fallocate   0
filewrite   20

real0m0.456s
user0m0.017s
sys 0m0.439s


time /tmp/msw_test /srv/dev/fio/msw 1
total   20
fallocate   20
filewrite   0

real0m0.141s
user0m0.010s
sys 0m0.131s


time /tmp/msw_test /srv/dev/fio/msw 2
total   20
fallocate   10
filewrite   10

real0m0.297s
user0m0.017s
sys 0m0.280s


So I don't think I can reproduce your problem on that system...

I also tried adding a fdatasync() into the loop, but that just made things
uniformly slow.


I guess I'll try to dig up whether this is a problem in older upstream
kernels, or whether it's been introduced in RHEL.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):

Gah, RHEL with its frankenkernels, the bane of my existance.

FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote:
> On 2023-Jul-03, Masahiko Sawada wrote:
> 
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> 
> This item came up in the RMT meeting.  Andres, I think this item belongs
> to you, because of commit 00d1e02be2.

I'll take a look - I wasn't even aware of this thread until now.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Alvaro Herrera
Hello,

On 2023-Jul-03, Masahiko Sawada wrote:

> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.

This item came up in the RMT meeting.  Andres, I think this item belongs
to you, because of commit 00d1e02be2.

The regression seems serious enough at low client counts:

> * PG15 (4b15868b69)
> PG15: nclients = 1, execution time = 14.181
> PG15: nclients = 2, execution time = 9.319
> PG15: nclients = 4, execution time = 5.872
> PG15: nclients = 8, execution time = 3.773
> PG15: nclients = 16, execution time = 3.202

> * PG16 (c24e9ef330)
> PG16: nclients = 1, execution time = 17.112
> PG16: nclients = 2, execution time = 14.084
> PG16: nclients = 4, execution time = 27.997
> PG16: nclients = 8, execution time = 10.554
> PG16: nclients = 16, execution time = 7.074

So the fact that the speed has clearly gone up at larger client counts
is not an excuse for not getting it fixed, XFS-specificity
notwithstanding.

> The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> extend tables more efficiently". With commit 1cbbee0338 (the previous
> commit of 00d1e02be2), I got a better numbers, it didn't have a better
> scalability, though:
> 
> PG16: nclients = 1, execution time = 17.444
> PG16: nclients = 2, execution time = 10.690
> PG16: nclients = 4, execution time = 7.010
> PG16: nclients = 8, execution time = 4.282
> PG16: nclients = 16, execution time = 3.373

Well, these numbers are better, but they still look worse than PG15.
I suppose there are other commits that share blame.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-03 Thread Jakub Wartak
Hi Masahiko,

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:

[root@rockyora ~]# time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m5.868s
user0m0.035s
sys 0m5.716s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m25.858s
user0m0.108s
sys 0m3.596s
[root@rockyora ~]# time ./test test.3 2
total   20
fallocate   10
filewrite   10

real0m25.927s
user0m0.091s
sys 0m3.621s
[root@rockyora ~]# time ./test test.4 1
total   20
fallocate   20
filewrite   0

real0m3.044s
user0m0.043s
sys 0m2.934s

According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:
[..]
  8,023417224.115364875   312  D  WS 44624928 + 16
[kworker/2:1H]
  8,023417324.115482679 0  C  WS 44624928 + 16 [0]
  8,023417424.115548251  6501  A  WS 42525760 + 16 <- (253,0)
34225216
  8,023417524.115548660  6501  A  WS 44624960 + 16 <- (8,2)
42525760
  8,023417624.115549111  6501  Q  WS 44624960 + 16 [test]
  8,023417724.115551351  6501  G  WS 44624960 + 16 [test]
  8,023417824.115552111  6501  I  WS 44624960 + 16 [test]
  8,023417924.115559713   312  D  WS 44624960 + 16
[kworker/2:1H]
  8,023418024.115677217 0  C  WS 44624960 + 16 [0]
  8,023418124.115743150  6501  A  WS 42525792 + 16 <- (253,0)
34225248
  8,023418224.115743502  6501  A  WS 44624992 + 16 <- (8,2)
42525792
  8,023418324.115743949  6501  Q  WS 44624992 + 16 [test]
  8,023418424.115746175  6501  G  WS 44624992 + 16 [test]
  8,023418524.115746918  6501  I  WS 44624992 + 16 [test]
  8,023418624.115754492   312  D  WS 44624992 + 16
[kworker/2:1H]

So it looks like you are onto something.

Regards,
-J.


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-03 Thread Masahiko Sawada
On Mon, Jul 3, 2023 at 4:36 PM Heikki Linnakangas  wrote:
>
> On 03/07/2023 05:59, Masahiko Sawada wrote:
> > On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  
> > wrote:
> >>
> >> After further investigation, the performance degradation comes from
> >> calling posix_fallocate() (called via FileFallocate()) and pwritev()
> >> (called via FileZero) alternatively depending on how many blocks we
> >> extend by. And it happens only on the xfs filesystem.
> >
> > FYI, the attached simple C program proves the fact that calling
> > alternatively posix_fallocate() and pwrite() causes slow performance
> > on posix_fallocate():
> >
> > $ gcc -o test test.c
> > $ time ./test test.1 1
> > total   20
> > fallocate   20
> > filewrite   0
> >
> > real0m1.305s
> > user0m0.050s
> > sys 0m1.255s
> >
> > $ time ./test test.2 2
> > total   20
> > fallocate   10
> > filewrite   10
> >
> > real1m29.222s
> > user0m0.139s
> > sys 0m3.139s
>
> This must be highly dependent on the underlying OS and filesystem.

Right. The above were the result where I created the file on the xfs
filesystem. The kernel version and the xfs filesystem version are:

% uname -rms
Linux 4.18.0-372.9.1.el8.x86_64 x86_64

% sudo xfs_db -r /dev/nvme4n1p2
xfs_db> version
versionnum [0xb4b5+0x18a] =
V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK

As far as I tested, it happens only on the xfs filesystem (at least
the above version) and doesn't happen on ext4 and ext3 filesystems.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-03 Thread Heikki Linnakangas

On 03/07/2023 05:59, Masahiko Sawada wrote:

On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  wrote:


After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem.


FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m1.305s
user0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total   20
fallocate   10
filewrite   10

real1m29.222s
user0m0.139s
sys 0m3.139s


This must be highly dependent on the underlying OS and filesystem. I'm 
not seeing that effect on my laptop:


/data$ time /tmp/test test.0 0
total   20
fallocate   0
filewrite   20

real0m1.856s
user0m0.140s
sys 0m1.688s
/data$ time /tmp/test test.1 1
total   20
fallocate   20
filewrite   0

real0m1.335s
user0m0.156s
sys 0m1.179s
/data$ time /tmp/test test.2 2
total   20
fallocate   10
filewrite   10

real0m2.159s
user0m0.165s
sys 0m1.880s

/data$ uname -a
Linux heikkilaptop 6.0.0-6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.12-1 
(2022-12-09) x86_64 GNU/Linux


/data is an nvme drive with ext4 filesystem.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-02 Thread Masahiko Sawada
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  wrote:
>
> After further investigation, the performance degradation comes from
> calling posix_fallocate() (called via FileFallocate()) and pwritev()
> (called via FileZero) alternatively depending on how many blocks we
> extend by. And it happens only on the xfs filesystem.

FYI, the attached simple C program proves the fact that calling
alternatively posix_fallocate() and pwrite() causes slow performance
on posix_fallocate():

$ gcc -o test test.c
$ time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m1.305s
user0m0.050s
sys 0m1.255s

$ time ./test test.2 2
total   20
fallocate   10
filewrite   10

real1m29.222s
user0m0.139s
sys 0m3.139s

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
#include 
#include 
#include 
#include 

int
main(int argc, char **argv)
{
char *filename = argv[1];
int ratio = atoi(argv[2]);
char block[8192] = {0};
int fd;
int total_len = 0;
int n_fallocate = 0;
int n_filewrite = 0;
int i;

fd = open(filename, O_RDWR | O_CREAT, S_IRWXU);
if (fd < 0)
{
fprintf(stderr, "could not open file %s: %m\n", filename);
return 1;
}

for (i = 0; i < 20; i++)
{
int ret;

if (ratio != 0 && i % ratio == 0)
{
posix_fallocate(fd, total_len, 8192);
n_fallocate++;
}
else
{
pwrite(fd, block, 8192, total_len);
n_filewrite++;
}
total_len += 8192;
}

printf("total\t%d\n", i);
printf("fallocate\t%d\n", n_fallocate);
printf("filewrite\t%d\n", n_filewrite);

close(fd);
return 0;
}

Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-02 Thread Masahiko Sawada
Hi all,

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):

* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202
PG15: nclients = 32, execution time = 3.023
PG15: nclients = 64, execution time = 3.829
PG15: nclients = 128, execution time = 4.111
PG15: nclients = 256, execution time = 4.158

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074
PG16: nclients = 32, execution time = 4.607
PG16: nclients = 64, execution time = 2.093
PG16: nclients = 128, execution time = 2.141
PG16: nclients = 256, execution time = 2.202

PG16 has better scalability (more than 64 clients) but it took much
more time than PG15, especially at 1 - 16 clients.

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373
PG16: nclients = 32, execution time = 3.205
PG16: nclients = 64, execution time = 3.705
PG16: nclients = 128, execution time = 4.196
PG16: nclients = 256, execution time = 4.201

While investigating the cause, I found an interesting fact that in
mdzeroextend if I use only either FileFallocate() or FileZero, we can
get better numbers. For example, If I always use FileZero with the
following change:

@@ -574,7 +574,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 * that decision should be made though? For now just use a cutoff of
 * 8, anything between 4 and 8 worked OK in some local testing.
 */
-   if (numblocks > 8)
+   if (false)
{
int ret;

I got:

PG16: nclients = 1, execution time = 16.898
PG16: nclients = 2, execution time = 8.740
PG16: nclients = 4, execution time = 4.656
PG16: nclients = 8, execution time = 2.733
PG16: nclients = 16, execution time = 2.021
PG16: nclients = 32, execution time = 1.693
PG16: nclients = 64, execution time = 1.742
PG16: nclients = 128, execution time = 2.180
PG16: nclients = 256, execution time = 2.296

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem. Does anyone
observe a similar performance issue with the attached benchmark
script?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
PG15_PGDATA="/home/masahiko/pgsql/15.s/data"
PG16_PGDATA="/home/masahiko/pgsql/16.s/data"
PG15_BIN="/home/masahiko/pgsql/15.s/bin"
PG16_BIN="/home/masahiko/pgsql/16.s/bin"

ROWS=$((100 * 1000 * 1000))
CLIENTS="1 2 4 8 16 32 64 128 256"

${PG15_BIN}/pg_ctl stop -D ${PG15_PGDATA} -mi
${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA} -mi
rm -rf $PG15_PGDATA $PG16_PGDATA

echo "initializing clusters ..."
${PG15_BIN}/initdb -D $PG15_PGDATA -E UTF8 --no-locale
${PG16_BIN}/initdb -D $PG16_PGDATA -E UTF8 --no-locale

cat <> ${PG15_PGDATA}/postgresql.conf
port = 5515
max_wal_size = 50GB
shared_buffers = 20GB
max_connections = 500
EOF
cat <> ${PG16_PGDATA}/postgresql.conf
port = 5516
max_wal_size = 50GB
shared_buffers = 20GB
max_connections = 500
EOF

if [ "$1" != "skip_file_init" ]; then
echo "prepare load files..."
${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA}
for c in $CLIENTS
do
	rm -f /tmp/tmp_${c}.data
	${PG16_BIN}/psql -d postgres -p 5516 -X -c "copy (select generate_series(1, $ROWS / $c)) to '/tmp/tmp_${c}.data'"
done
${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA}
fi

echo "start benchmark ..."
#for version in PG15 PG16
for version in PG16
do
PSQL=""
if [ "$version" == "PG15" ]; then
	PSQL="${PG15_BIN}/psql -p 5515 -d postgres"
	${PG15_BIN}/pg_ctl start -D ${PG15_PGDATA}
else
	PSQL="${PG16_BIN}/psql -p 5516 -d postgres"
	${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA}
fi

${PSQL} -c "create unlogged table test (c int) with (autovacuum_enabled = off)"

for c in $CLIENTS
do
	${PSQL} -c "truncate test" > /dev/null 2>&1

	chileren=()
	start=`date +%s.%3N`