from:"Jakub Wartak"

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-05-20 Thread Jakub Wartak

On Tue, May 14, 2024 at 8:19 PM Robert Haas  wrote:
>
> I looked at your version and wrote something that is shorter and
> doesn't touch any existing text. Here it is.

Hi Robert, you are a real tactician here - thanks for whatever
references the original problem! :) Maybe just slight hint nearby
expensive (to me indicating a just a CPU problem?):

finding an OID that is still free can become expensive ->
finding an OID that is still free can become expensive, thus
significantly increasing INSERT/UPDATE response time.

? (this potentially makes it easier in future to pinpoint the user's
problem to the this exact limitation; but feel free to ignore that too
as I'm not attached to any of those versions)

-J.

Re: elog/ereport VS misleading backtrace_function function address

2024-05-14 Thread Jakub Wartak

Hi Peter!

On Sun, May 12, 2024 at 10:33 PM Peter Eisentraut  wrote:
>
> On 07.05.24 09:43, Jakub Wartak wrote:
> > NOTE: in case one will be testing this: one cannot ./configure with
> > --enable-debug as it prevents the compiler optimizations that actually
> > end up with the ".cold" branch optimizations that cause backtrace() to
> > return the wrong address.
>
> Is that configuration useful?  If you're interested in backtraces,
> wouldn't you also want debug symbols?

The use case here is that backtrace_functions is unreliable when we
ask customers on production builds to use it (so it's useful not just
for local tests)

> Don't production builds use debug
> symbols nowadays as well?

Reality is apparently mixed,at least from what I have checked :
- all RHEL 7.x/8.x (PGDG and our forks) do NOT come with
--enable-debug according to pg_config.
- on Debian 11/12 PGDG does come with --enable-debug

> >> I recall speculating about whether we could somehow invoke gdb
> >> to get a more comprehensive and accurate backtrace, but I don't
> >> really have a concrete idea how that could be made to work.
> > Oh no, I'm more about micro-fix rather than doing some bigger
> > revolution. The patch simply adds this one instruction in macro, so
> > that now backtrace_functions behaves better:
> >
> > 0x00773d28 <+105>:   call   0x79243f 
> > 0x00773d2d <+110>:   movzbl -0x12(%rbp),%eax  << this ends
> > up being added by the patch
> > 0x00773d31 <+114>:   call   0xdc1a0 
> >
> > I'll put that as for PG18 in CF, but maybe it could be backpatched too
> > - no hard opinion on that (I don't see how that ERROR/FATAL path could
> > cause any performance regressions)
>
> I'm missing a principled explanation of what this does.  I just see that
> it sprinkles some no-op code to make this particular thing work a bit
> differently, but this might just behave differently with the next
> compiler next year.  I'd like to see a bit more of an abstract
> explanation of what is happening here.

OK I'll try to explain using assembly, but I'm not an expert on this.
Let's go to the 1st post, assume we run with backtrace_function set:

get_collation_isdeterministic()  0x5c7680 to 0x5c76c1:
   might jump(!) to 0x14c686 
   note the two completely different address ranges (hot and separate cold)
   it's because of 913ec71d682 that added the cold branches
optimization via pg_attribute_cold to errstart_cold

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x005c7680 <+0>: push   %rbp
[..]
   0x005c7694 <+20>:call   0x5d63b0 
   0x005c7699 <+25>:test   %rax,%rax
   0x005c769c <+28>:je 0x14c686

[..]
   0x005c76b3 <+51>:call   0x5d6430 
   0x005c76b8 <+56>:mov%r12d,%eax
   0x005c76bb <+59>:mov-0x8(%rbp),%r12
   0x005c76bf <+63>:leave
   0x005c76c0 <+64>:ret
Address range 0x14c686 to 0x14c6bb:
// note here that the last
instruction is 0x14c6b6 not 0x14c6bb(!)
// note this cold path also has no
frame pointer setup
   0x0014c686 <-4698106>:   xor%esi,%esi
   0x0014c688 <-4698104>:   mov$0x15,%edi
   0x0014c68d <-4698099>:   call   0x14ec86 
   0x0014c692 <-4698094>:   mov%r12d,%esi
   0x0014c695 <-4698091>:   lea0x5028dc(%rip),%rdi
   # 0x64ef78
   0x0014c69c <-4698084>:   xor%eax,%eax
   0x0014c69e <-4698082>:   call   0x5df320 
   0x0014c6a3 <-4698077>:   lea0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x0014c6aa <-4698070>:   mov$0x440,%esi
   0x0014c6af <-4698065>:   lea0x630c8a(%rip),%rdi
   # 0x77d340
   0x0014c6b6 <-4698058>:   call   0x5df100 
End of assembler dump.

so it's
get_collation_isdeterministic() ->
get_collation_isdeterministic.cold() [but not real function ->
no proper fp/stack?] ->
.. ->
errfinish() ->
set_backtrace() // just builds and appends string
using backtrace()/backtrace_functions()/backtrace_symbol_list().
prints/logs, finishes

It seems that the thing is that the limited GNU libc
backtrace_symbol_list() won't resolve the 0x14c6b6..0x14c6bb to the
"get_collation_isdeterministic[.cold]" symbol name and it will just
simply put 0x14c6bb in that the text asm. It's wrong and it is
confusing:

2024-03-27 14:39:13.065 CET [12899] postgres(a

Re: elog/ereport VS misleading backtrace_function function address

2024-05-07 Thread Jakub Wartak

Hi Tom and -hackers!

On Thu, Mar 28, 2024 at 7:36 PM Tom Lane  wrote:
>
> Jakub Wartak  writes:
> > While chasing some other bug I've learned that backtrace_functions
> > might be misleading with top elog/ereport() address.
>
> That was understood from the beginning: this type of backtrace is
> inherently pretty imprecise, and I doubt there is much that can
> be done to make it better.  IIRC the fundamental problem is it only
> looks at global symbols, so static functions inherently defeat it.
> It was argued that this is better than nothing, which is true, but
> you have to take the info with a mountain of salt.

OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):

with the reproducer, psql returns:

psql:ri-collation-bug-example.sql:48: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1062

logfile without patch:

2024-05-07 09:05:43.043 CEST [44720] ERROR:  cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x155883
0x00155883
get_constraint_type.cold <<<<<< so it's wrong as the real function
should be get_collation_isdeterministic

logfile with attached patch:

2024-05-07 09:11:06.596 CEST [51185] ERROR:  cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x168bf0
0x00168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch

NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.

> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.

Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:

   0x00773d28 <+105>:   call   0x79243f 
   0x00773d2d <+110>:   movzbl -0x12(%rbp),%eax  << this ends
up being added by the patch
   0x00773d31 <+114>:   call   0xdc1a0 

I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)

-J.


v1-0001-Tweak-ereport-so-that-it-generates-proper-address.patch
Description: Binary data

Re: apply_scanjoin_target_to_paths and partitionwise join

2024-05-06 Thread Jakub Wartak

Hi Ashutosh & hackers,

On Mon, Apr 15, 2024 at 9:00 AM Ashutosh Bapat
 wrote:
>
> Here's patch with
>
[..]
> Adding to the next commitfest but better to consider this for the next set of 
> minor releases.

1. The patch does not pass cfbot -
https://cirrus-ci.com/task/5486258451906560 on master due to test
failure "not ok 206   + partition_join"

2. Without the patch applied, the result of the meson test on master
was clean (no failures , so master is fine). After applying patch
there were expected some hunk failures (as the patch was created for
15_STABLE):

patching file src/backend/optimizer/plan/planner.c
Hunk #1 succeeded at 7567 (offset 468 lines).
Hunk #2 succeeded at 7593 (offset 468 lines).
patching file src/test/regress/expected/partition_join.out
Hunk #1 succeeded at 4777 (offset 56 lines).
Hunk #2 succeeded at 4867 (offset 56 lines).
patching file src/test/regress/sql/partition_join.sql
Hunk #1 succeeded at 1136 (offset 1 line).

3. Without patch there is performance regression/bug on master (cost
is higher with enable_partitionwise_join=on that without it):

data preparation:
-- Test the process_outer_partition() code path
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('',
'0001', '0002');
CREATE TABLE plt1_adv_p2 PARTITION OF plt1_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt1_adv SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i;
ANALYZE plt1_adv;

CREATE TABLE plt2_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_adv_p1 PARTITION OF plt2_adv FOR VALUES IN ('0002');
CREATE TABLE plt2_adv_p2 PARTITION OF plt2_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt2_adv SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2_adv;

CREATE TABLE plt3_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt3_adv_p1 PARTITION OF plt3_adv FOR VALUES IN ('0001');
CREATE TABLE plt3_adv_p2 PARTITION OF plt3_adv FOR VALUES IN ('0003', '0004');
INSERT INTO plt3_adv SELECT i, i, to_char(i % 5, 'FM') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3_adv;

off:
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
  QUERY PLAN
---
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


with enable_partitionwise_join=ON (see the jump from cost 22.02 -> 27.65):
EXPLAIN SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1
LEFT JOIN plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c
= t3.c) WHERE coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a, t3.a;
  QUERY PLAN
---
 Sort  (cost=27.65..28.37 rows=289 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Append  (cost=2.23..15.83 rows=289 width=27)
 ->  Hash Full Join  (cost=2.23..4.81 rows=41 width=27)
[..]
 ->  Hash Full Join  (cost=2.45..9.57 rows=248 width=27)
[..]


However with the patch applied the plan with minimal cost is always
chosen ("22"):

explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
  QUERY PLAN
---
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]


set enable_partitionwise_join to on;
explain SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1_adv t1 LEFT JOIN
plt2_adv t2 ON (t1.c = t2.c)) FULL JOIN plt3_adv t3 ON (t1.c = t3.c) WHERE
coalesce(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
  QUERY PLAN
---
 Sort  (cost=22.02..22.58 rows=223 width=27)
   Sort Key: t1.c, t1.a, t2.a, t3.a
   ->  Hash Full Join  (cost=4.83..13.33 rows=223 width=27)
[..]

with the patch applied, the minimal cost (with toggle on or off) the
cost always stays the minimal from the available ones. We cannot
provide a reproducer for real performance regression, but for the
affected customer it took 530+s (with enable_partitionwise_join=on)
and without that GUC it it was

Re: GUC-ify walsender MAX_SEND_SIZE constant

2024-04-24 Thread Jakub Wartak

Hi,

> My understanding of Majid's use-case for tuning MAX_SEND_SIZE is that the
> bottleneck is storage, not network. The reason MAX_SEND_SIZE affects that is
> that it determines the max size passed to WALRead(), which in turn determines
> how much we read from the OS at once.  If the storage has high latency but
> also high throughput, and readahead is disabled or just not aggressive enough
> after crossing segment boundaries, larger reads reduce the number of times
> you're likely to be blocked waiting for read IO.
>
> Which is also why I think that making MAX_SEND_SIZE configurable is a really
> poor proxy for improving the situation.
>
> We're imo much better off working on read_stream.[ch] support for reading WAL.

Well then that would be a consistent message at least, because earlier
in [1] it was rejected to have prefetch the WAL segment but on the
standby side, where the patch was only helping in configurations
having readahead *disabled* for some reason.

Now Majid stated that he uses "RBD" - Majid, any chance to specify
what that RBD really is ? What's the tech? What fs? Any ioping or fio
results? What's the blockdev --report /dev/XXX output ? (you stated
"high" latency and "high" bandwidth , but it is relative, for me 15ms+
is high latency and >1000MB/s sequential, but it would help others in
future if you could specify it by the exact numbers please). Maybe
it's just a matter of enabling readahead (line in [1]) there and/or
using a higher WAL segment during initdb.

[1] - 
https://www.postgresql.org/message-id/flat/CADVKa1WsQMBYK_02_Ji%3DpbOFnms%2BCT7TV6qJxDdHsFCiC9V_hw%40mail.gmail.com

Re: GUC-ify walsender MAX_SEND_SIZE constant

2024-04-23 Thread Jakub Wartak

On Tue, Apr 23, 2024 at 2:24 AM Michael Paquier  wrote:
>
> On Mon, Apr 22, 2024 at 03:40:01PM +0200, Majid Garoosi wrote:
> > Any news, comments, etc. about this thread?
>
> FWIW, I'd still be in favor of doing a GUC-ification of this part, but
> at this stage I'd need more time to do a proper study of a case where
> this shows benefits to prove your point, or somebody else could come
> in and show it.
>
> Andres has objected to this change, on the ground that this was not
> worth it, though you are telling the contrary.  I would be curious to
> hear from others, first, so as we gather more opinions to reach a
> consensus.

I'm more with Andres on this one.I vaguely remember researching impact
of MAX_SEND_SIZE on independent two occasions (earlier async and more
recent sync case where I've investigated variety of ways to keep
latency down) and my summary would be:

First: it's very hard to get *reliable* replication setup for
benchmark, where one could demonstrate correlation between e.g.
increasing MAX_SEND_SIZE and observing benefits (in sync rep it is
easier, as you are simply stalled in pgbench). Part of the problem are
the following things:

a) workload can be tricky, for this purpose it needs to be trivial but bulky
b) it needs to be on isolated network and with guaranteed bandwidth
c) wal_init_zero impact should be ruled out
d) OS should be properly tuned autotuning TCP max(3rd value) + have
setup rmem_max/wmem_max properly
e) more serious TCP congestion should be used that the default one in OS
f) one should prevent any I/O stalls on walreceiver writeback during
huge WAL activity and restart points on standby (dirty_bytes and so
on, isolated pg_xlog, BDI max_ratio)

Second: once you perform above and ensure that there are no network or
I/O stalls back then I *think* I couldn't see any impact of playing
with MAX_SEND_SIZE from what I remember as probably something else is
saturated first.

I can offer help with trying to test this with artificial tests and
even injecting proper latency (WAN-like), but OP (Majid) I think needs
first describe his env much better (exact latency, bandwidth,
workload, TCP sysctl values, duration of the tests, no# of attempts
tried, exact commands used, iperf3 TCP results demonstrating hw used
and so on). So in short the patch is easy, but demonstrating the
effect and persuading others here would be hard.

-J.

Re: incremental backup breakage in BlockRefTableEntryGetBlocks

2024-04-05 Thread Jakub Wartak

On Thu, Apr 4, 2024 at 9:11 PM Tomas Vondra
 wrote:
>
> On 4/4/24 19:38, Robert Haas wrote:
> > Hi,
> >
> > Yesterday, Tomas Vondra reported to me off-list that he was seeing
> > what appeared to be data corruption after taking and restoring an
> > incremental backup. Overnight, Jakub Wartak further experimented with
> > Tomas's test case, did some initial analysis, and made it very easy to
> > reproduce. I spent this morning tracking down the problem, for which I
> > attach a patch.
> >
>
> Thanks, I can confirm this fixes the issue I've observed/reported. On
> master 10 out of 10 runs failed, with the patch no failures.

Same here, patch fixes it on recent master. I've also run pgbench for
~30mins and compared master and incremental and got 0 differences,
should be good.

> The test is very simple:
>
> 1) init pgbench

Tomas had magic fingers here - he used pgbench -i -s 100 which causes
bigger relations (it wouldn't trigger for smaller -s values as Robert
explained - now it makes full sense; in earlier tests I was using much
smaller -s , then transitioned to other workloads (mostly append
only), and final 100GB+/24h+ tests used mostly INSERTs rather than
UPDATEs AFAIR). The other interesting thing is that one of the animals
runs with configure --with-relsegsize= (so new
relations are full much earlier) and it was not catched there either -
Wouldn't it be good idea to to test in src/test/recover/ like that?

And of course i'm attaching reproducer with some braindump notes in
case in future one hits similiar issue and wonders where to even start
looking (it's very primitive though but might help).

-J.
#!/bin/bash
rm -rf tmp
mkdir /tmp/wals
pg_ctl -D tmp/data init
echo "summarize_wal = on" >> tmp/data/postgresql.conf
#optional
echo "autovacuum = off" >> tmp/data/postgresql.conf
#optional (gather WAL for analysis)
echo "archive_mode = on" >> tmp/data/postgresql.conf
echo "archive_command = 'cp %p /tmp/wals/%f'" >> tmp/data/postgresql.conf
#optional (no FPW in pg_waldump)
#echo "full_page_writes = off" >> tmp/data/postgresql.conf

#below steps by Tomas:
pg_ctl -D tmp/data -l tmp/pg.log start
createdb test
pgbench -i -s 100 test
pg_basebackup -D tmp/full -c fast
pgbench -t 1000 test
pg_basebackup -D tmp/increment-1 --increment=tmp/full/backup_manifest -c fast
pg_dumpall > tmp/dumpall-source.log
pg_ctl -D tmp/data -l tmp/pg.log stop
pg_verifybackup tmp/full
pg_verifybackup tmp/increment-1/
pg_combinebackup -d -o tmp/restored tmp/full tmp/increment-1/ > 
tmp/combinebackup 2>&1
pg_ctl -D tmp/restored/ -l tmp/pg.log start
pg_dumpall > tmp/dumpall-restored.log
#pg_ctl -D tmp/restored/ -l pg.tmp/log stop
diff -u tmp/dumpall-source.log tmp/dumpall-restored.log | wc -l
exit 0

-8<-8<-8<-8<-8<-

it gives:

postgres@test:5432 : 54750 # select ctid,aid,bid,abalance from pgbench_accounts 
where aid in (select aid from pgbench_accounts group by aid, bid having 
count(*) > 1 order by aid, bid limit 1);
ctid|   aid   | bid | abalance
+-+-+--
 (65690,55) | 4007145 |  41 |0
(1 row)

postgres@test:5432 : 54750 # set enable_bitmapscan to off;
SET
postgres@test:5432 : 54750 # set enable_indexscan to off;
SET
postgres@test:5432 : 54750 # select ctid,aid,bid,abalance from pgbench_accounts 
where aid in (select aid from pgbench_accounts group by aid, bid having 
count(*) > 1 order by aid, bid limit 1);
ctid |   aid   | bid | abalance
-+-+-+--
 (65690,55)  | 4007145 |  41 |0
 (163943,49) | 4007145 |  41 |  638
(2 rows)

postgres@test:5432 : 54750 # select relfilenode from pg_class where relname  
='pgbench_accounts';
  relfilenode
-
   16400
(1 row)

postgres@test:5432 : 54750 #

   
   
$ for f in tmp/data/pg_wal/summaries/* ; do >&2 echo "$f"; pg_walsummary -i $f  
; done | grep -e 65690 -e 163943
tmp/data/pg_wal/summaries/00010128010B1AA0.summary
tmp/data/pg_wal/summaries/0001010B1AA0014E6538.summary
tmp/data/pg_wal/summaries/0001014E6538014E6638.summary
tmp/data/pg_wal/summaries/0001014E6638014ED020.summary
tmp/data/pg_wal/summaries/0001014ED020014ED120.summary
tmp/data/pg_wal/summaries/0001014ED120014ED458.summary
tmp/data/pg_wal/summaries/0001014ED45822403570.summary
TS 1663, DB 16384, REL 16400, FORK main: block 65690
tmp/data/pg_wal/summaries/00012240357043431F50.summary
tmp/data/pg_wal/summaries/000143431F504FD8.summary

Re: pg_combinebackup --copy-file-range

2024-04-04 Thread Jakub Wartak

On Thu, Apr 4, 2024 at 12:56 AM Tomas Vondra
 wrote:
>
> Hi,
>
> Here's a much more polished and cleaned up version of the patches,
> fixing all the issues I've been aware of, and with various parts merged
> into much more cohesive parts (instead of keeping them separate to make
> the changes/evolution more obvious).

OK, so three runs of incrementalbackupstests - as stated earlier -
also passed with OK for v20240403 (his time even with
--enable-casserts)

pg_combinebackup flags tested were:
1) --copy-file-range --manifest-checksums=CRC32C
2) --copy-file-range --manifest-checksums=NONE
3) default, no flags (no copy-file-range)

> I changed how I think about this a bit - I don't really see the CoW copy
> methods as necessary faster than the regular copy (even though it can be
> with (5)). I think the main benefits are the space savings, enabled by
> patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
> that I don't think the performance is an issue - everything has a cost.

I take i differently: incremental backups without CoW fs would be clearly :
- inefficient in terms of RTO (SANs are often a key thing due to
having fast ability to "restore" the clone rather than copying the
data from somewhere else)
- pg_basebackup without that would be unusuable without space savings
(e.g. imagine daily backups @ 10+TB DWHs)

> On 4/3/24 15:39, Jakub Wartak wrote:
> > On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> >  wrote:
[..]
> Thanks. Would be great if you could run this on the attached version of
> the patches, ideally for each of them independently, so make sure it
> doesn't get broken+fixed somewhere on the way.

Those are semi-manual test runs (~30 min? per run), the above results
are for all of them applied at once. So my take is all of them work
each one does individually too.

FWIW, I'm also testing your other offlist incremental backup
corruption issue, but that doesnt seem to be related in any way to
copy_file_range() patches here.

> >> 2) prefetch
> >> ---
> > [..]
> >> I think this means we may need a "--prefetch" option, that'd force
> >> prefetching, probably both before pread and copy_file_range. Otherwise
> >> people on ZFS are doomed and will have poor performance.
> >
> > Right, we could optionally cover in the docs later-on various options
> > to get the performance (on XFS use $this, but without $that and so
> > on). It's kind of madness dealing with all those performance
> > variations.
> >
>
> Yeah, something like that. I'm not sure we want to talk too much about
> individual filesystems in our docs, because those things evolve over
> time too.

Sounds like Wiki then.

BTW, after a quick review: could we in 05 have something like common
value then (to keep those together via some .h?)

#defineBATCH_SIZE PREFETCH_TARGET ?

-J.

Re: pg_combinebackup --copy-file-range

2024-04-03 Thread Jakub Wartak

On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I've been running some benchmarks and experimenting with various stuff,
> trying to improve the poor performance on ZFS, and the regression on XFS
> when using copy_file_range. And oh boy, did I find interesting stuff ...

[..]

Congratulations on great results!

> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
> to finish recovery, run pg_checksums --check (to check the patches don't
> produce something broken)

I've performed some follow-up small testing on all patches mentioned
here  (1..7), with the earlier developed nano-incremental-backup-tests
that helped detect some issues for Robert earlier during original
development. They all went fine in both cases:
- no special options when using pg_combinebackup
- using pg_combinebackup --copy-file-range --manifest-checksums=NONE

Those were:
test_across_wallevelminimal.sh
test_full_pri__incr_stby__restore_on_pri.sh
test_full_pri__incr_stby__restore_on_stby.sh
test_full_stby__incr_stby__restore_on_pri.sh
test_full_stby__incr_stby__restore_on_stby.sh
test_incr_after_timelineincrease.sh
test_incr_on_standby_after_promote.sh
test_many_incrementals_dbcreate_duplicateOID.sh
test_many_incrementals_dbcreate_filecopy_NOINCR.sh
test_many_incrementals_dbcreate_filecopy.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_repro_assert_RP.sh
test_repro_assert.sh
test_standby_incr_just_backup.sh
test_stuck_walsum.sh
test_truncaterollback.sh
test_unlogged_table.sh

> Now to the findings 
>
>
> 1) block alignment

[..]

> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

If anything is on the table, then I think in the far future
pg_refresh_standby_using_incremental_backup_from_primary would be the
only other tool using the format ?

> 2) prefetch
> ---
[..]
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Right, we could optionally cover in the docs later-on various options
to get the performance (on XFS use $this, but without $that and so
on). It's kind of madness dealing with all those performance
variations.

Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?

-J.

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-04-03 Thread Jakub Wartak

Hi Andrey,

On Thu, Mar 28, 2024 at 1:09 PM Andrey M. Borodin  wrote:
>
>
>
> > On 8 Aug 2023, at 12:31, John Naylor  wrote:
> >
> > > > Also the shared counter is the cause of the slowdown, but not the 
> > > > reason for the numeric limit.
> > >
> > > Isn't it both? typedef Oid is unsigned int = 2^32, and according to 
> > > GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang 
> > > indefinitely which has the same semantics as "being impossible"/permanent 
> > > hang (?)
> >
> > Looking again, I'm thinking the OID type size is more relevant for the 
> > first paragraph, and the shared/global aspect is more relevant for the 
> > second.
> >
> > The last issue is how to separate the notes at the bottom, since there are 
> > now two topics.
>
> Jakub, do you have plans to address this feedback? Is the CF entry still 
> relevant?

Yes; I've forgotten about this one and clearly I had problems
formulating it in proper shape to be accepted. I've moved it to the
next CF now as this is not critical and I would prefer to help current
timesenistive CF. Anyone is welcome to help amend the patch...

-J.

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Jakub Wartak

On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
[..]
> v4 is rebased on top of v14 streaming read API changes.

Hi Nazir, so with streaming API committed, I gave a try to this patch.
With autovacuum=off and 30GB table on NVMe (with standard readahead of
256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
default) created using: create table t as select repeat('a', 100) || i
|| repeat('b', 500) as filler from generate_series(1, 4500) as i;

on master, effect of mainteance_io_concurency [default 10] is like
that (when resetting the fs cache after each ANALYZE):

m_io_c = 0:
Time: 3137.914 ms (00:03.138)
Time: 3094.540 ms (00:03.095)
Time: 3452.513 ms (00:03.453)

m_io_c = 1:
Time: 2972.751 ms (00:02.973)
Time: 2939.551 ms (00:02.940)
Time: 2904.428 ms (00:02.904)

m_io_c = 2:
Time: 1580.260 ms (00:01.580)
Time: 1572.132 ms (00:01.572)
Time: 1558.334 ms (00:01.558)

m_io_c = 4:
Time: 938.304 ms
Time: 931.772 ms
Time: 920.044 ms

m_io_c = 8:
Time: 666.025 ms
Time: 660.241 ms
Time: 648.848 ms

m_io_c = 16:
Time: 542.450 ms
Time: 561.155 ms
Time: 539.683 ms

m_io_c = 32:
Time: 538.487 ms
Time: 541.705 ms
Time: 538.101 ms

with patch applied:

m_io_c = 0:
Time: 3106.469 ms (00:03.106)
Time: 3140.343 ms (00:03.140)
Time: 3044.133 ms (00:03.044)

m_io_c = 1:
Time: 2959.817 ms (00:02.960)
Time: 2920.265 ms (00:02.920)
Time: 2911.745 ms (00:02.912)

m_io_c = 2:
Time: 1581.912 ms (00:01.582)
Time: 1561.444 ms (00:01.561)
Time: 1558.251 ms (00:01.558)

m_io_c = 4:
Time: 908.116 ms
Time: 901.245 ms
Time: 901.071 ms

m_io_c = 8:
Time: 619.870 ms
Time: 620.327 ms
Time: 614.266 ms

m_io_c = 16:
Time: 529.885 ms
Time: 526.958 ms
Time: 528.474 ms

m_io_c = 32:
Time: 521.185 ms
Time: 520.713 ms
Time: 517.729 ms

No difference to me, which seems to be good. I've double checked and
patch used the new way

acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
-> mdreadv -> FileReadV -> pg_preadv (inlined)
acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
-> ...

I gave also io_combine_limit to 32 (max, 256kB) a try and got those
slightly better results:

[..]
m_io_c = 16:
Time: 494.599 ms
Time: 496.345 ms
Time: 973.500 ms

m_io_c = 32:
Time: 461.031 ms
Time: 449.037 ms
Time: 443.375 ms

and that (last one) apparently was able to push it to ~50-60k still
random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
still reading random , so I assume no merging was done:

Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
nvme0n1   61212.00591.82 0.00   0.000.10 9.90
2.00  0.02 0.00   0.000.0012.000.00  0.00
0.00   0.000.00 0.000.000.006.28  85.20

So in short it looks good to me.

-Jakub Wartak.

elog/ereport VS misleading backtrace_function function address

2024-03-28 Thread Jakub Wartak

Hi -hackers,

While chasing some other bug I've learned that backtrace_functions
might be misleading with top elog/ereport() address.

Reproducer:

# using Tom's reproducer on master:
wget 
https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql
echo '' >> ri-collation-bug-example.sql
echo '\errverbose' >> ri-collation-bug-example.sql
-- based on grepping the source code locations many others could go in here:
psql -p 5437 -c "alter system set backtrace_functions =
'RI_FKey_cascade_del,get_collation_isdeterministic';"
psql -p 5437 -c "select pg_reload_conf();"
dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1
-f ri-collation-bug-example.sql

gives (note "get_collation_isdeterministic"):
psql:ri-collation-bug-example.sql:42: ERROR:  cache lookup failed
for collation 0
psql:ri-collation-bug-example.sql:44: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1088

and in log note the 0x14c6bb:
2024-03-27 14:39:13.065 CET [12899] postgres@test1 ERROR:  cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres@test1 BACKTRACE:
postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f)
[0x5565c5c2401f]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04)
[0x5565c5c25a04]
postgres: 16/main: postgres test1 [local]
DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918]
[..]
2024-03-27 14:39:13.065 CET [12899] postgres@test1 STATEMENT:  delete
from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da';

based on \errverbose it is OK (error matches the code, thanks to
Alvaro for this hint):

 9   bool
 8   get_collation_isdeterministic(Oid colloid)
 7   {
 6   │   HeapTuple>  tp;
 5   │   Form_pg_collation colltup;
 4   │   bool>   >   result;
 3   │
 2   │   tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
 1   │   if (!HeapTupleIsValid(tp))
  1088   │   │   elog(ERROR, "cache lookup failed for collation %u", colloid);
[..]


but based on backtrace address 0x14c6bb (!) and it resolves
differently which seems to be highly misleading (note the
"get_language_name.cold"):

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x0014c6bb
get_language_name.cold
./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181

When i disassemble the get_collation_isdeterministic() (and I know the
name from \errverbose):

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x005c7680 <+0>: push   %rbp
[..]
   0x005c7694 <+20>:call   0x5d63b0 
   0x005c7699 <+25>:test   %rax,%rax
   0x005c769c <+28>:je 0x14c686

   0x005c76a2 <+34>:mov%rax,%rdi
[..]
   0x005c76bf <+63>:leave
   0x005c76c0 <+64>:ret
Address range 0x14c686 to 0x14c6bb:
   0x0014c686 <-4698106>:   xor%esi,%esi
   0x0014c688 <-4698104>:   mov$0x15,%edi
   0x0014c68d <-4698099>:   call   0x14ec86 
   0x0014c692 <-4698094>:   mov%r12d,%esi
   0x0014c695 <-4698091>:   lea0x5028dc(%rip),%rdi
   # 0x64ef78
   0x0014c69c <-4698084>:   xor%eax,%eax
   0x0014c69e <-4698082>:   call   0x5df320 
   0x0014c6a3 <-4698077>:   lea0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x0014c6aa <-4698070>:   mov$0x440,%esi
   0x0014c6af <-4698065>:   lea0x630c8a(%rip),%rdi
   # 0x77d340
   0x0014c6b6 <-4698058>:   call   0x5df100 
   << NOTE the exact 0x14c6bb is even missing here(!)

notice the interesting thing here: according to GDB
get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686
 till 0x14c6bb (but without it)
without any stack setup for that .cold. But backtrace() just captured
the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take
that exact address from backtrace_functions output as it is
("DELETE(+0x14c6bb)"), it also shows WRONG function (just like
addr2line):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
   0x005c77ed <+109>:   ret
Address range 0x14c6bb to 0x14c6f0:
   0x0014c6bb <-4698309>:   xor%esi,%esi
[..]
   0x0014c6eb <-4698261>:   call   0x5df100 
End of assembler dump.

so this is wrong (as are failing in "get_collation_isdeterministic"
not in "get_language_name").

I was very confused at the beginning with the main question being: why
are we compiling elog/ereport() so that it is incompatible with
backtrace? When looking for it I've found two threads [1] by David
that were

Re: pg_combinebackup --copy-file-range

2024-03-27 Thread Jakub Wartak

On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra
 wrote:
[..]
>
> That's really strange.

Hi Tomas, but it looks like it's fixed now :)

> > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 
> > 27m23.887s
> > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> > loop-fix: 5m1.986s but it creates corruption as it stands
> >
>
> Thanks. I plan to do more similar tests, once my machines get done with
> some other stuff.

Please do so as I do not trust my fingers :-)

> > Issues:
> >
> > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> > compains about win32/mingw:
> >
[..]
> >
>
> Yup, missing break.

Now it's 
https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10
, reproducible with "make -s headerscheck
EXTRAFLAGS='-fmax-errors=10'":
/tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error:
unknown type name ‘CopyMode’ / CopyMode copy_mode);
to me it looks like reconstruct.h needs to include definition of
CopyMode which is in "#include "copy_file.h"

> > 2. I do not know what's the consensus between --clone and
> > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> > apply the same logic to the --help so that --clone is not visible
> > there (for consistency?). Also the "error: file cloning not supported
> > on this platform " is technically incorrect, Linux does support
> > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> > over another (so technically it is "available"). Nitpicking I know.
> >
>
> That's a good question, I'm not sure. But whatever we do, we should do
> the same thing in pg_upgrade. Maybe there's some sort of precedent?

Sigh, you are right... It's consistent hell.

> > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> > ENOSPACE spiral-of-death-bug symptoms are like that:
[..]
>
> Yeah, that retry logic is wrong. I ended up copying the check from the
> "regular" copy branch, which simply bails out if copy_file_range returns
> anything but the expected 8192.
>
> I wonder if this should deal with partial writes, though. I mean, it's
> allowed copy_file_range() copies only some of the bytes - I don't know
> how often / in what situations that happens, though ... And if we want
> to handle that for copy_file_range(), pwrite() needs the same treatment.

Maybe that helps?
https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427
, it's harder than I anticipated (we can ignore the sparse logic
though, I think)

> > 3. .. but onn startup I've got this after trying psql login: invalid
> > page in block 0 of relation base/5/1259  .
[..]
>
> Can you see if you can still reproduce this with the attached version?

Looks like it's fixed now and it works great (~3min, multiple times)!

BTW: I've tried to also try it over NFSv4 over loopback with XFS as
copy_file_range() does server-side optimization probably, but somehow
it was so slow there that's it is close to being unusable (~9GB out of
104GB reconstructed after 45mins) - maybe it's due to NFS mount opts,
i don't think we should worry too much. I think it's related to
missing the below optimization if that matters. I think it's too early
to warn users about NFS (I've spent on it just 10 mins), but on the
other hand people might complain it's broken...

> > Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> > wastes lots of time on issuing instead small syscalls but it could
> > let's say do single pread/write (or even copy_file_range()). I think
> > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> > had much smaller modifications in my incremental (think of 99% of
> > static data).
> >
>
> Yes, I've been thinking about exactly this optimization, but I think
> we're way past proposing this for PG17. The changes that would require
> in reconstruct_from_incremental_file are way too significant. Has to
> wait for PG18 ;-)

Sure thing!

> I do think there's more on the table, as mentioned by Thomas a couple
> days ago - maybe we shouldn't approach clone/copy_file_range merely as
> an optimization to save time, it might be entirely reasonable to do this
> simply to allow the filesystem to do CoW magic and save space (even if
> we need to read the data and recalculate the checksum, which now
> disables these copy methods).

Sure ! I think time will still be a priority though, as
pg_combinebackup duration impacts RTO while disk space is relatively
cheap.

One could argue that reconstructing 50TB will be a challenge though.
Now my tests indicate space saving is already happening with 0002
patch - 100GB DB / full backup stats look like that (so we are good I
think when using CoW - not so without using CoW) -- or i misunderstood
something?:

root@jw-test-1:/backups# du -sm /backups/
214612  /backups/
root@jw-test-1:/backups# du -sm *
106831

Re: pg_upgrade --copy-file-range

2024-03-26 Thread Jakub Wartak

On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
 wrote:

> On 3/23/24 14:47, Tomas Vondra wrote:
> > On 3/23/24 13:38, Robert Haas wrote:
> >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro  
> >> wrote:
[..]
> > Yeah, that's in write_reconstructed_file() and the patch does not touch
> > that at all. I agree it would be nice to use copy_file_range() in this
> > part too, and it doesn't seem it'd be that hard to do, I think.
> >
> > It seems we'd just need a "fork" that either calls pread/pwrite or
> > copy_file_range, depending on checksums and what was requested.
> >
>
> Here's a patch to use copy_file_range in write_reconstructed_file too,
> when requested/possible. One thing that I'm not sure about is whether to
> do pg_fatal() if --copy-file-range but the platform does not support it.
[..]

Hi Tomas, so I gave a go to the below patches today:
- v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
- v20240323-2-0002-write_reconstructed_file.patch

My assessment:

v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
looks like more or less good to go
v20240323-2-0002-write_reconstructed_file.patch - needs work and
without that clone/copy_file_range() good effects are unlikely

Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
tables with GiST and GIN indexes , just to see more WAL record types)
and with backups sizes in MB like that:

106831  full
2823incr.1 # captured after some time with pgbench -R 100
165 incr.2 # captured after some time with pgbench -R 100

Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
/proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
outtest full incr.1 incr.2

Test results of various copies on small I/O constrained XFS device:
normal copy: 31m47.407s
--clone copy: error: file cloning not supported on this platform (it's
due #ifdef of having COPY_FILE_RANGE available)
--copy-file-range: aborted, as it was taking too long , I was
expecting it to accelerate, but it did not... obviously this is the
transparent failover in case of calculating checksums...
--manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
up with ENOSPC [more on this later]
--manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
--manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
loop-fix: 5m1.986s but it creates corruption as it stands

Issues:

1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
compains about win32/mingw:

[15:47:27.184] In file included from copy_file.c:22:
[15:47:27.184] copy_file.c: In function ‘copy_file’:
[15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
this statement may fall through [-Werror=implicit-fallthrough=]
[15:47:27.184]   134 |   if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
[15:47:27.184]   |  ^
[15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
[15:47:27.184]96 | pg_log_debug("would copy \"%s\" to \"%s\"
(copy_file_range)",
[15:47:27.184]   | ^~~~
[15:47:27.184] copy_file.c:99:4: note: here
[15:47:27.184]99 |case COPY_MODE_COPYFILE:
[15:47:27.184]   |^~~~
[15:47:27.184] cc1: all warnings being treated as errors

2. I do not know what's the consensus between --clone and
--copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
apply the same logic to the --help so that --clone is not visible
there (for consistency?). Also the "error: file cloning not supported
on this platform " is technically incorrect, Linux does support
ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
over another (so technically it is "available"). Nitpicking I know.

3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
ENOSPACE spiral-of-death-bug symptoms are like that:

strace:
copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192

Notice that dest_off_t (poutoff) is NULL.

(gdb) where
#0  0x7f2cd56f6733 in copy_file_range (infd=8,
pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9,
poutoff=poutoff@entry=0x0,
length=length@entry=8192, flags=flags@entry=0) at

Re: pg_upgrade --copy-file-range

2024-03-20 Thread Jakub Wartak

Hi Tomas,

> I took a quick look at the remaining part adding copy_file_range to
> pg_combinebackup. The patch no longer applies, so I had to rebase it.
> Most of the issues were trivial, but I had to fix a couple missing
> prototypes - I added them to copy_file.h/c, mostly.
>
> 0001 is the minimal rebase + those fixes
>
> 0002 has a couple review comments in copy_file, and it also undoes a lot
> of unnecessary formatting changes (already pointed out by Peter a couple
> days ago).
>

Thank you very much for this! As discussed privately, I'm not in
position right now to pursue this further at this late stage (at least
for v17, which would require an aggressive schedule ). My plan was
more for v18 after Peter's email, due to other obligations. But if you
have cycles and want to continue, please do so without hesitation -
I'll try to chime in a long way to test and review for sure.

> A couple review comments:
>
> 1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.
>
> 2) I wonder if we even need opt_errinfo(). I'm not sure it actually
> makes anything simpler.

Yes, as it stands it's broken (somewhat I've missed gcc warning),
should be pg_malloc(). I hardly remember, but I wanted to avoid code
duplication. No strong opinion, maybe that's a different style, I'll
adapt as necessary.

> 3) I think it'd be nice to make CopyFileMethod more consistent with
> transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
> more consistent, it's probably not worth unifying this somehow).
>
> 4) I wonder how we came up with copying the files by 50 blocks, but I
> now realize it's been like this before this patch. I only noticed
> because the patch adds a comment before buffer_size calculation.

It looks like it was like that before pg_upgrade even was moved into
the core. 400kB is indeed bit strange value, so we can leave it as it
is or make the COPY_BUF_SIZ 128kb - see [1] (i've double checked cp(1)
uses still 128kB today), or maybe just stick to something like 256 or
512 kBs.

> 5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
> name is way more generic / less descriptive - it's clear it copies the
> file block by block (well, in chunks). pg_copyfile is pretty vague.
>
> 6) This leaves behind copy_file_copyfile, which is now unused.
>
> 7) The patch reworks how combinebackup deals with alternative copy
> implementations - instead of setting strategy_implementation and calling
> that, the decisions now happen in pg_copyfile_offload with a lot of
> conditions / ifdef / defined ... I find it pretty hard to understand and
> reason about. I liked the strategy_implementation approach, as it forces
> us to keep each method in a separate function.

Well some context (maybe it was my mistake to continue in this
./thread rather starting a new one): my plan was 3-in-1: in the
original proposal (from Jan) to provide CoW as generic facility for
other to use - in src/common/file_utils.c as per
v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to
unify & confine CoW methods and their quirkiness between
pg_combinebackup and pg_upgrade and other potential CoW uses too. That
was before Thomas M. pushed CoW just for pg_upgrade as
d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to
have pg_copyfile() [normal blocks copy] and
pg_copyfile_offload_supported(),
pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE ,
PG_COPYFILE_COPY_FILE_RANGE,
PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's
version of the patch it's local to pg_combinebackup, so it might make
no sense after all. If you look at the pg_upgrade and pg_combinebackup
they both have code duplication with lots of those ifs/IFs (assuming
user wants to have it as drop-in [--clone/--copy/--copyfile] and
platform may / may not have it). I've even considered
--cow=ficlone|copy_file_range to sync both tools from CLI arguments
point of view, but that would break backwards compatibility, so I did
not do that.

Also there's a problem with pg_combinebackup's strategy_implementation
that it actually cannot on its own decide (I think) which CoW to use
or not. There were some longer discussions that settled on one thing
(for pg_upgrade): it's the user who is in control HOW the copy gets
done (due to potential issues in OS CoW() implementations where e.g.
if NFS would be involved on one side). See pg_upgrade
--clone/--copy/--copy-file-range/--sync-method options. I wanted to
stick to that, so pg_combinebackup also needs to give the same options
to the user.

That's was for the historical context, now you wrote "it's probably
not worth unifying this somehow" few sentences earlier, so my take is
the following: we can just concentrate on getting the
copy_file_range() and ioctl_ficlone to pg_combinebackup at the price
of duplicating some complexity for now (in short to start with clear
plate , it doesn't necessary needs to be my patch as base if we think
it's worthwhile for v17 - or stick to your

Re: index prefetching

2024-03-05 Thread Jakub Wartak

On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
 wrote:
[..]
> TBH I don't have a clear idea what to do. It'd be cool to have at least
> some benefits in v17, but I don't know how to do that in a way that
> would be useful in the future.
>
> For example, the v20240124 patch implements this in the executor, but
> based on the recent discussions it seems that's not the right layer -
> the index AM needs to have some control, and I'm not convinced it's
> possible to improve it in that direction (even ignoring the various
> issues we identified in the executor-based approach).
>
> I think it might be more practical to do this from the index AM, even if
> it has various limitations. Ironically, that's what I proposed at pgcon,
> but mostly because it was the quick way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

> > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> > (allowing it to follow dynamically; the more prefetches the user wants
> > to perform, the more you spread them across shared LRUs and the more
> > memory for history is required?)
> >
> > + * XXX Maybe we could consider effective_cache_size when sizing the 
> > cache?
> > + * Not to size the cache for that, ofc, but maybe as a guidance of how 
> > many
> > + * heap pages it might keep. Maybe just a fraction fraction of the 
> > value,
> > + * say Max(8MB, effective_cache_size / max_connections) or something.
> > + */
> > +#definePREFETCH_LRU_SIZE8/* slots in one LRU */
> > +#definePREFETCH_LRU_COUNT128 /* number of LRUs */
> > +#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
> > PREFETCH_LRU_COUNT)
> >
>
> I don't see why would this be related to effective_io_concurrency? It's
> merely about how many recently accessed pages we expect to find in the
> page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

> > 7. in IndexPrefetchComputeTarget()
> >
> > + * XXX We cap the target to plan_rows, becausse it's pointless to 
> > prefetch
> > + * more than we expect to use.
> >
> > That's a nice fact that's already in patch, so XXX isn't needed?
> >
>
> Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

> But I think it's good to
> point this out - I'm not 100% convinced we should be using plan_rows
> like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95.  So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

> > Some further tests, given data:
> >
> > CREATE TABLE test (id bigint, val bigint, str text);
> > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> > 3000) FROM generate_series(1, 1) g;
> > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> > (10*random())::int), 3000) from (select 1 * random() as r from
> > generate_series(1, 1)) x;
> > VACUUM ANALYZE test;
> > CREATE INDEX on test (id) ;
> >
>
> It's not clear to me what's the purpose of this test? Can you explain?

It's just schema preparation for the tests below:

> >
> > 2. Prefetching for TOASTed heap seems to be not implemented at all,
> > correct? (Is my assumption that we should go like this:
> > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> > see the code path where it could be added - certainly it's not blocker
> > -- but maybe in commit message a list of improvements for future could
> > be listed?):
> >
>
> Yes, that's true. I haven't thought about TOAST very much, but with
> prefetching happening in executor, that does not work. There'd need to
> be some extra code for TOAST prefetching. I'm not sure how beneficial
> that would be, considering most TOAST values tend to be stored on
> consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of

Re: index prefetching

2024-03-01 Thread Jakub Wartak

\300\237t\0\200\237t\0"..., 8192, 417792) = 8192
fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192
fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192
fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192
fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/20240215201337.7amzw3hpvng7wphb%40awork3.anarazel.de
[2] - 
https://www.postgresql.org/message-id/777e981c-bf0c-4eb9-a9e0-42d677e94327%40enterprisedb.com

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-02-08 Thread Jakub Wartak

Hi Daniel,


On Tue, Jan 30, 2024 at 3:29 PM Daniel Verite  wrote:

> PFA a rebased version.

Thanks for the patch! I've tested it using my original reproducer and
it works great now against the original problem description. I've
taken a quick look at the patch, it looks good for me. I've tested
using -Werror for both gcc 10.2 and clang 11.0 and it was clean. I
have one slight doubt:

when I run with default pager (more or less):
\set FETCH_COUNT 1000
WITH data AS (SELECT  generate_series(1, 2000) as Total) select
repeat('a',100) || data.Total || repeat('b', 800) as total_pat from
data;
-- it enters pager, a skip couple of pages and then "q"

.. then - both backend and psql - go into 100% CPU as it were still
receiving (that doesn't happen e.g. with export PAGER=cat).  So I'm
not sure, maybe ExecQueryAndProcessResults() should somewhat faster
abort when the $PAGER is exiting normally(?).

And oh , btw, in v6-0001 (so if you would be sending v7 for any other
reason -- other reviewers -- maybe worth realigning it as detail):

+  int PQsetChunkedRowsMode(PGconn *conn,
+   int maxRows);

but the code has (so "maxRows" != "chunkSize"):

+PQsetChunkedRowsMode(PGconn *conn, int chunkSize)

-J.

Re: Make NUM_XLOGINSERT_LOCKS configurable

2024-01-15 Thread Jakub Wartak

On Fri, Jan 12, 2024 at 7:33 AM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 10, 2024 at 11:43 AM Tom Lane  wrote:
> >
> > Bharath Rupireddy  writes:
> > > On Wed, Jan 10, 2024 at 10:00 AM Tom Lane  wrote:
> > >> Maybe.  I bet just bumping up the constant by 2X or 4X or so would get
> > >> most of the win for far less work; it's not like adding a few more
> > >> LWLocks is expensive.  But we need some evidence about what to set it to.
> >
> > > I previously made an attempt to improve WAL insertion performance with
> > > varying NUM_XLOGINSERT_LOCKS. IIRC, we will lose what we get by
> > > increasing insertion locks (reduction in WAL insertion lock
> > > acquisition time) to the CPU overhead of flushing the WAL in
> > > WaitXLogInsertionsToFinish as referred to by the following comment.
> >
> > Very interesting --- this is at variance with what the OP said, so
> > we definitely need details about the test conditions in both cases.
> >
> > > Unfortunately, I've lost the test results, I'll run them up again and
> > > come back.
> >
> > Please.
>
> Okay, I'm back with some testing

[..]

> Results with varying NUM_XLOGINSERT_LOCKS (note that we can't allow it
> be more than MAX_SIMUL_LWLOCKS):
> LocksTPSWAL Insert Lock Acquire Time in MillisecondsWAL
> Wait for In-progress Inserts to Finish Time in Milliseconds
> 818669125328775
> 16180761064113491
> 3218034663513997
> 6417582393714718
> 12817782456320145
>
> Also, check the attached graph. Clearly there's an increase in the
> time spent in waiting for in-progress insertions to finish in
> WaitXLogInsertionsToFinish from 8.7 seconds to 20 seconds. Whereas,
> the time spent to acquire WAL insertion locks decreased from 12.5
> seconds to 4.5 seconds. Overall, this hasn't resulted any improvement
> in TPS, in fact observed slight reduction.

Hi, I've hastily tested using Bharath's patches too as I was thinking
it would be a fast win due to contention, however it seems that (at
least on fast NVMEs?) increasing NUM_XLOGINSERT_LOCKS doesn't seem to
help.

With pgbench -P 5 -c 32 -j 32 -T 30 and
- 64vCPU Lsv2 (AMD EPYC), on single NVMe device (with ext4) that can
do 100k RW IOPS@8kB (with fio/libaio, 4jobs)
- shared_buffers = '8GB', max_wal_size = '32GB', track_wal_io_timing = on
- maxed out wal_buffers = '256MB'

tpcb-like with synchronous_commit=off
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   3039324087   128
32 31205968   93

tpcb-like with synchronous_commit=on
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   12031   8472  10722
32 11957   1188   12563

tpcb-like with synchronous_commit=on and pgbench -c 64 -j 64
   TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8   25010   90620 68318
32 25976   18569 85319
// same, Bharath said , it shifted from insert_lock to
waiting_for_insert to finish

insertonly (largeinserts) with synchronous_commit=off (still -c 32 -j 32)
  TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8 3671914283
32   39387568

insertonly (largeinserts) with synchronous_commit=on (still -c 32 -j 32)
  TPSwal_insert_lock_acquire_time  wal_wait_for_insert_to_finish_time
8 32915950125
32   3102177  316

insertonly was := {
create sequence s1;
create table t (id bigint, t text) partition by hash (id);
create table t_h0 partition of t FOR VALUES WITH (modulus 8, remainder 0);
create table t_h1 partition of t FOR VALUES WITH (modulus 8, remainder 1);
create table t_h2 partition of t FOR VALUES WITH (modulus 8, remainder 2);
create table t_h3 partition of t FOR VALUES WITH (modulus 8, remainder 3);
create table t_h4 partition of t FOR VALUES WITH (modulus 8, remainder 4);
create table t_h5 partition of t FOR VALUES WITH (modulus 8, remainder 5);
create table t_h6 partition of t FOR VALUES WITH (modulus 8, remainder 6);
create table t_h7 partition of t FOR VALUES WITH (modulus 8, remainder 7);

and runtime pgb:
insert into t select nextval('s1'), repeat('A', 1000) from
generate_series(1, 1000);
}

it was truncated every time, DB was checkpointed, of course it was on master.
Without more details from Qingsong it is going to be hard to explain
the boost he witnessed.

-J.

Re: trying again to get incremental backup

2023-12-20 Thread Jakub Wartak

Hi Robert,

On Tue, Dec 19, 2023 at 9:36 PM Robert Haas  wrote:
>
> On Fri, Dec 15, 2023 at 5:36 AM Jakub Wartak
>  wrote:
> > I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
> > mismatch (of course they do differ after initdb), but i get this
> > instead:
> >
> >  $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
> > --incremental /tmp/incr1.before.upgrade/backup_manifest
> > WARNING:  aborting backup due to backend exiting before pg_backup_stop
> > was called
> > pg_basebackup: error: could not initiate base backup: ERROR:  timeline
> > 2 found in manifest, but not in this server's history
> > pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"
> >
> > Also in the manifest I don't see DBID ?
> > Maybe it's a nuisance and all I'm trying to see is that if an
> > automated cronjob with pg_basebackup --incremental hits a freshly
> > upgraded cluster, that error message without errhint() is going to
> > scare some Junior DBAs.
>
> Yeah. I think we should add the system identifier to the manifest, but
> I think that should be left for a future project, as I don't think the
> lack of it is a good reason to stop all progress here. When we have
> that, we can give more reliable error messages about system mismatches
> at an earlier stage. Unfortunately, I don't think that the timeline
> messages you're seeing here are going to apply in every case: suppose
> you have two unrelated servers that are both on timeline 1. I think
> you could use a base backup from one of those servers and use it as
> the basis for the incremental from the other, and I think that if you
> did it right you might fail to hit any sanity check that would block
> that. pg_combinebackup will realize there's a problem, because it has
> the whole cluster to work with, not just the manifest, and will notice
> the mismatching system identifiers, but that's kind of late to find
> out that you made a big mistake. However, right now, it's the best we
> can do.
>

OK, understood.

> > The incrementals are being generated , but just for the first (0)
> > segment of the relation?
>
> I committed the first two patches from the series I posted yesterday.
> The first should fix this, and the second relocates parse_manifest.c.
> That patch hasn't changed in a while and seems unlikely to attract
> major objections. There's no real reason to commit it until we're
> ready to move forward with the main patches, but I think we're very
> close to that now, so I did.
>
> Here's a rebase for cfbot.

the v15 patchset (posted yesterday) test results are GOOD:

1. make check-world - GOOD
2. cfbot was GOOD
3. the devel/master bug present in
parse_filename_for_nontemp_relation() seems to be gone (in local
testing)
4. some further tests:
test_across_wallevelminimal.sh - GOOD
test_incr_after_timelineincrease.sh - GOOD
test_incr_on_standby_after_promote.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_pending_2pc.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_repro_assert.sh
test_standby_incr_just_backup.sh - GOOD
test_stuck_walsum.sh - GOOD
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD
test_full_stby__incr_stby__restore_on_stby.sh - GOOD

5. the more real-world pgbench test with localized segment writes
usigng `\set aid random_exponential...` [1] indicates much greater
efficiency in terms of backup space use now, du -sm shows:

210229  /backups/backups/full
250 /backups/backups/incr.1
255 /backups/backups/incr.2
[..]
348 /backups/backups/incr.13
408 /backups/backups/incr.14 // latest(20th of Dec on 10:40)
6673/backups/archive/

The DB size was as reported by \l+ 205GB.
That pgbench was running for ~27h (19th Dec 08:39 -> 20th Dec 11:30)
with slow 100 TPS (-R), so no insane amounts of WAL.
Time to reconstruct 14 chained incremental backups was 45mins
(pg_combinebackup -o /var/lib/postgres/17/data /backups/backups/full
/backups/backups/incr.1 (..)  /backups/backups/incr.14).
DB after recovering was OK and working fine.

-J.

Re: trying again to get incremental backup

2023-12-15 Thread Jakub Wartak

Hi Robert,

On Wed, Dec 13, 2023 at 2:16 PM Robert Haas  wrote:
>
 >
> > > not even in case of an intervening
> > > timeline switch. So, all of the errors in this function are warning
> > > you that you've done something that you really should not have done.
> > > In this particular case, you've either (1) manually removed the
> > > timeline history file, and not just any timeline history file but the
> > > one for a timeline for a backup that you still intend to use as the
> > > basis for taking an incremental backup or (2) tried to use a full
> > > backup taken from one server as the basis for an incremental backup on
> > > a completely different server that happens to share the same system
> > > identifier, e.g. because you promoted two standbys derived from the
> > > same original primary and then tried to use a full backup taken on one
> > > as the basis for an incremental backup taken on the other.
> > >
> >
> > Okay, but please consider two other possibilities:
> >
> > (3) I had a corrupted DB where I've fixed it by running pg_resetwal
> > and some cronjob just a day later attempted to take incremental and
> > failed with that error.
> >
> > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
> > directory) the DB where I had cronjob that just failed with this error
> >
> > I bet that (4) is going to happen more often than (1), (2) , which
> > might trigger users to complain on forums, support tickets.
>
> Hmm. In case (4), I was thinking that you'd get a complaint about the
> database system identifier not matching. I'm not actually sure that's
> what would happen, though, now that you mention it.
>

I've played with with initdb/pg_upgrade (17->17) and i don't get DBID
mismatch (of course they do differ after initdb), but i get this
instead:

 $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432
--incremental /tmp/incr1.before.upgrade/backup_manifest
WARNING:  aborting backup due to backend exiting before pg_backup_stop
was called
pg_basebackup: error: could not initiate base backup: ERROR:  timeline
2 found in manifest, but not in this server's history
pg_basebackup: removing data directory "/tmp/incr2.after.upgrade"

Also in the manifest I don't see DBID ?
Maybe it's a nuisance and all I'm trying to see is that if an
automated cronjob with pg_basebackup --incremental hits a freshly
upgraded cluster, that error message without errhint() is going to
scare some Junior DBAs.

> > LGTM, all quick tests work from my end too. BTW: I have also scheduled
> > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
> > test. I'll let you know how it went.
>
> Awesome, thank you so much.

OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did
generate pretty large incrementals (so I had to abort it due to lack
of space, I was expecting to see smaller incrementals so it took too
much space). I initally suspected that the problem lies in the normal
distribution of `\set aid random(1, 10 * :scale)` for tpcbb that
UPDATEs on big pgbench_accounts.

$ du -sm /backups/backups/* /backups/archive/
216205  /backups/backups/full
215207  /backups/backups/incr.1
216706  /backups/backups/incr.2
102273  /backups/archive/

So I verified the recoverability yesterday anyway - the
pg_combinebackup "full incr.1 incr.2" took 44 minutes and later
archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan
foir sum(abalance) on the pgbench_accounts and other tables worked
fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys),
while performing mostly 60-80MB/s separately for both reads and writes
(it's slow, but it's due to maxed out sequence I/O of the Premium on a
small SSD on Azure).

So i've launched another improved test (to force more localized
UPDATEs) to see the more real-world space-effectiveness of the
incremental backup:

\set aid random_exponential(1, 10 * :scale, 8)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
INSERT INTO pgbench_history (tid
, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

But then... (and i have verified the low-IDs for :aid above).. same
has happened:

backups/backups$ du -sm /backups/backups/*
210229  /backups/backups/full
208299  /backups/backups/incr.1
208351  /backups/backups/incr.2

# pgbench_accounts has relfilenodeid 16486
postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173
174 175  ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L
./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ;
echo; done
005c6bbb40fca3c1a0a819376ef0e793  full/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.1/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  ./incr.2/base/5/16486.5
005c6bbb40fca3c1a0a819376ef0e793  /var/lib/postgres/17/data/base/5/16486.5

[.. all the checksums match (!) for the above $L..]

Re: trying again to get incremental backup

2023-12-13 Thread Jakub Wartak

Hi Robert,

On Mon, Dec 11, 2023 at 6:08 PM Robert Haas  wrote:
>
> On Fri, Dec 8, 2023 at 5:02 AM Jakub Wartak
>  wrote:
> > While we are at it, maybe around the below in PrepareForIncrementalBackup()
> >
> > if (tlep[i] == NULL)
> > ereport(ERROR,
> >
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("timeline %u found in
> > manifest, but not in this server's history",
> > range->tli)));
> >
> > we could add
> >
> > errhint("You might need to start a new full backup instead of
> > incremental one")
> >
> > ?
>
> I can't exactly say that such a hint would be inaccurate, but I think
> the impulse to add it here is misguided. One of my design goals for
> this system is to make it so that you never have to take a new
> incremental backup "just because,"

Did you mean take a new full backup here?

> not even in case of an intervening
> timeline switch. So, all of the errors in this function are warning
> you that you've done something that you really should not have done.
> In this particular case, you've either (1) manually removed the
> timeline history file, and not just any timeline history file but the
> one for a timeline for a backup that you still intend to use as the
> basis for taking an incremental backup or (2) tried to use a full
> backup taken from one server as the basis for an incremental backup on
> a completely different server that happens to share the same system
> identifier, e.g. because you promoted two standbys derived from the
> same original primary and then tried to use a full backup taken on one
> as the basis for an incremental backup taken on the other.
>

Okay, but please consider two other possibilities:

(3) I had a corrupted DB where I've fixed it by running pg_resetwal
and some cronjob just a day later attempted to take incremental and
failed with that error.

(4) I had pg_upgraded (which calls pg_resetwal on fresh initdb
directory) the DB where I had cronjob that just failed with this error

I bet that (4) is going to happen more often than (1), (2) , which
might trigger users to complain on forums, support tickets.

> > > I have a fix for this locally, but I'm going to hold off on publishing
> > > a new version until either there's a few more things I can address all
> > > at once, or until Thomas commits the ubsan fix.
> > >
> >
> > Great, I cannot get it to fail again today, it had to be some dirty
> > state of the testing env. BTW: Thomas has pushed that ubsan fix.
>
> Huzzah, the cfbot likes the patch set now. Here's a new version with
> the promised fix for your non-reproducible issue. Let's see whether
> you and cfbot still like this version.

LGTM, all quick tests work from my end too. BTW: I have also scheduled
the long/large pgbench -s 14000 (~200GB?) - multiple day incremental
test. I'll let you know how it went.

-J.

Re: trying again to get incremental backup

2023-12-08 Thread Jakub Wartak

On Thu, Dec 7, 2023 at 4:15 PM Robert Haas  wrote:

Hi Robert,

> On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
>  wrote:
> > Comment: I was wondering if it wouldn't make some sense to teach
> > pg_resetwal to actually delete all WAL summaries after any any
> > WAL/controlfile alteration?
>
> I thought that this was a good idea so I decided to go implement it,
> only to discover that it was already part of the patch set ... did you
> find some case where it doesn't work as expected? The code looks like
> this:

Ah, my bad, with a fresh mind and coffee the error message makes it
clear and of course it did reset the summaries properly.

While we are at it, maybe around the below in PrepareForIncrementalBackup()

if (tlep[i] == NULL)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));

we could add

errhint("You might need to start a new full backup instead of
incremental one")

?

> > test_pending_2pc.sh - getting GOOD on most recent runs, but several
> > times during early testing (probably due to my own mishaps), I've been
> > hit by Abort/TRAP. I'm still investigating and trying to reproduce
> > those ones. TRAP: failed Assert("summary_end_lsn >=
> > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
>
> I have a fix for this locally, but I'm going to hold off on publishing
> a new version until either there's a few more things I can address all
> at once, or until Thomas commits the ubsan fix.
>

Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.

-J.

Re: trying again to get incremental backup

2023-12-07 Thread Jakub Wartak

On Tue, Dec 5, 2023 at 7:11 PM Robert Haas  wrote:

[..v13 patchset]

The results with v13 patchset are following:

* - requires checkpoint on primary when doing incremental on standby
when it's too idle, this was explained by Robert in [1], something AKA
too-fast-incremental backup due to testing-scenario:

test_across_wallevelminimal.sh - GOOD
test_many_incrementals_dbcreate.sh - GOOD
test_many_incrementals.sh - GOOD
test_multixact.sh - GOOD
test_reindex_and_vacuum_full.sh - GOOD
test_standby_incr_just_backup.sh - GOOD*
test_truncaterollback.sh - GOOD
test_unlogged_table.sh - GOOD
test_full_pri__incr_stby__restore_on_pri.sh - GOOD
test_full_pri__incr_stby__restore_on_stby.sh - GOOD
test_full_stby__incr_stby__restore_on_pri.sh - GOOD*
test_full_stby__incr_stby__restore_on_stby.sh - GOOD*
test_incr_on_standby_after_promote.sh - GOOD*
test_incr_after_timelineincrease.sh (pg_ctl stop, pg_resetwal -l
0002000E ..., pg_ctl start, pg_basebackup
--incremental) - GOOD, I've got:
pg_basebackup: error: could not initiate base backup: ERROR:
timeline 1 found in manifest, but not in this server's history
Comment: I was wondering if it wouldn't make some sense to teach
pg_resetwal to actually delete all WAL summaries after any any
WAL/controlfile alteration?

test_stuck_walsummary.sh (pkill -STOP walsumm) - GOOD:

> This version also improves (at least, IMHO) the way that we wait for
> WAL summarization to finish. Previously, you either caught up fully
> within 60 seconds or you died. I didn't like that, because it seemed
> like some people would get timeouts when the operation was slowly
> progressing and would eventually succeed. So what this version does
> is:

WARNING:  still waiting for WAL summarization through 0/AD8
after 10 seconds
DETAIL:  Summarization has reached 0/828 on disk and 0/8F8
in memory.
[..]
pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summarization is not progressing
DETAIL:  Summarization is needed through 0/AD8, but is stuck
at 0/828 on disk and 0/8F8 in memory.
Comment2: looks good to me!

test_pending_2pc.sh - getting GOOD on most recent runs, but several
times during early testing (probably due to my own mishaps), I've been
hit by Abort/TRAP. I'm still investigating and trying to reproduce
those ones. TRAP: failed Assert("summary_end_lsn >=
WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940

Regards,
-J.

[1] -  
https://www.postgresql.org/message-id/CA%2BTgmoYuC27_ToGtTTNyHgpn_eJmdqrmhJ93bAbinkBtXsWHaA%40mail.gmail.com

Re: trying again to get incremental backup

2023-11-21 Thread Jakub Wartak

On Mon, Nov 20, 2023 at 4:43 PM Robert Haas  wrote:
>
> On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera  
> wrote:
> > I made a pass over pg_combinebackup for NLS.  I propose the attached
> > patch.
>
> This doesn't quite compile for me so I changed a few things and
> incorporated it. Hopefully I didn't mess anything up.
>
> Here's v11.
[..]

> I wish I had better ideas about how to thoroughly test this. [..]

Hopefully the below add some confidence, I've done some further
quick(?) checks today and results are good:

make check-world #GOOD
test_full_pri__incr_stby__restore_on_pri.sh #GOOD
test_full_pri__incr_stby__restore_on_stby.sh #GOOD*
test_full_stby__incr_stby__restore_on_pri.sh #GOOD
test_full_stby__incr_stby__restore_on_stby.sh #GOOD*
test_many_incrementals_dbcreate.sh #GOOD
test_many_incrementals.sh #GOOD
test_multixact.sh #GOOD
test_pending_2pc.sh #GOOD
test_reindex_and_vacuum_full.sh #GOOD
test_truncaterollback.sh #GOOD
test_unlogged_table.sh #GOOD
test_across_wallevelminimal.sh # GOOD(expected failure, that
walsummaries are off during walminimal and incremental cannot be
taken--> full needs to be taken after wal_level=minimal)

CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)

The remaining test idea is to have a longer running DB under stress
test in more real-world conditions and try to recover using chained
incremental backups (one such test was carried out on patchset v6 and
the result was good back then).

-J.

Re: trying again to get incremental backup

2023-11-15 Thread Jakub Wartak

Hi Robert,

[..spotted the v9 patchset..]

so I've spent some time playing still with patchset v8 (without the
6/6 testing patch related to wal_level=minimal), with the exception of
- patchset v9 - marked otherwise.

1. On compile time there were 2 warnings to shadowing variable (at
least with gcc version 10.2.1), but on v9 that is fixed:

blkreftable.c: In function ‘WriteBlockRefTable’:
blkreftable.c:520:24: warning: declaration of ‘brtentry’ shadows a
previous local [-Wshadow=compatible-local]
walsummarizer.c: In function ‘SummarizeWAL’:
walsummarizer.c:833:36: warning: declaration of ‘private_data’ shadows
a previous local [-Wshadow=compatible-local]

2. Usability thing: I hit the timeout hard: "This backup requires WAL
to be summarized up to 0/9D8, but summarizer has only reached
0/0." with summarize_wal=off (default) but apparently this in TODO.
Looks like an important usability thing.

3. I've verified that if the DB was in wal_level=minimal even
temporarily (and thus summarization was disabled) it is impossible to
take incremental backup:

pg_basebackup: error: could not initiate base backup: ERROR:  WAL
summaries are required on timeline 1 from 0/7D8 to 0/1028, but
the summaries for that timeline and LSN range are incomplete
DETAIL:  The first unsummarized LSN is this range is 0/D04AE88.

4. As we have discussed off list, there's is (was) this
pg_combinebackup bug in v8's reconstruct_from_incremental_file() where
it was unable to realize that - in case of combining multiple
incremental backups - it should stop looking for the previous instance
of the full file (while it was fine with v6 of the patchset). I've
checked it on v9 - it is good now.

5. On v8 i've finally played a little bit with standby(s) and this
patchset with couple of basic scenarios while mixing source of the
backups:

a. full on standby, incr1 on standby, full db restore (incl. incr1) on standby
# sometimes i'm getting spurious error like those when doing
incrementals on standby with -c fast :
2023-11-15 13:49:05.721 CET [10573] LOG:  recovery restart point
at 0/A28
2023-11-15 13:49:07.591 CET [10597] WARNING:  aborting backup due
to backend exiting before pg_backup_stop was called
2023-11-15 13:49:07.591 CET [10597] ERROR:  manifest requires WAL
from final timeline 1 ending at 0/AF8, but this backup starts at
0/A28
2023-11-15 13:49:07.591 CET [10597] STATEMENT:  BASE_BACKUP (
INCREMENTAL,  LABEL 'pg_basebackup base backup',  PROGRESS,
CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
# when you retry the same pg_basebackup it goes fine (looks like
CHECKPOINT on standby/restartpoint <-> summarizer disconnect, I'll dig
deeper tomorrow. It seems that issuing "CHECKPOINT; pg_sleep(1);"
against primary just before pg_basebackup --incr on standby
workarounds it)

b. full on primary, incr1 on standby, full db restore (incl. incr1) on
standby # WORKS
c. full on standby, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*
d. full on primary, incr1 on standby, full db restore (incl. incr1) on
primary # WORKS*

* - needs pg_promote() due to the controlfile having standby bit +
potential fiddling with postgresql.auto.conf as it is having
primary_connstring GUC.

6. Sci-fi-mode-on: I was wondering about the dangers of e.g. having
more recent pg_basebackup (e.g. from pg18 one day) running against
pg17 in the scope of having this incremental backups possibility. Is
it going to be safe? (currently there seems to be no safeguards
against such use) or should those things (core, pg_basebackup) should
be running in version lock step?

Regards,
-J.

Re: trying again to get incremental backup

2023-11-01 Thread Jakub Wartak

On Mon, Oct 30, 2023 at 6:46 PM Robert Haas  wrote:
>
> On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
>  wrote:
> > If that is still an area open for discussion: wouldn't it be better to
> > just specify LSN as it would allow resyncing standby across major lag
> > where the WAL to replay would be enormous? Given that we had
> > primary->standby where standby would be stuck on some LSN, right now
> > it would be:
> > 1) calculate backup manifest of desynced 10TB standby (how? using
> > which tool?)  - even if possible, that means reading 10TB of data
> > instead of just putting a number, isn't it?
> > 2) backup primary with such incremental backup >= LSN
> > 3) copy the incremental backup to standby
> > 4) apply it to the impaired standby
> > 5) restart the WAL replay
>
> As you may be able to tell from the flurry of posts and new patch
> sets, I'm trying hard to sort out the remaining open items that
> pertain to this patch set, and I'm now back to thinking about this
> one.
>
> TL;DR: I think the idea has some potential, but there are some
> pitfalls that I'm not sure how to address.
>
> I spent some time looking at how we currently use the data from the
> backup manifest. Currently, we do two things with it. First, when
> we're backing up each file, we check whether it's present in the
> backup manifest and, if not, we back it up in full. This actually
> feels fairly poor. If it makes any difference at all, then presumably
> the underlying algorithm is buggy and needs to be fixed. Maybe that
> should be ripped out altogether or turned into some kind of sanity
> check that causes a big explosion if it fails. Second, we check
> whether the WAL ranges reported by the client match up with the
> timeline history of the server (see PrepareForIncrementalBackup). This
> set of sanity checks seems fairly important to me, and I'd regret
> discarding them. I think there's some possibility that they might
> catch user error, like where somebody promotes multiple standbys and
> maybe they even get the same timeline on more than one of them, and
> then confusion might ensue.
[..]

> Another practical problem here is that, right now, pg_combinebackup
> doesn't have an in-place mode. It knows how to take a bunch of input
> backups and write out an output backup, but that output backup needs
> to go into a new, fresh directory (or directories plural, if there are
> user-defined tablespaces). I had previously considered adding such a
> mode, but the idea I had at the time wouldn't have worked for this
> case. I imagined that someone might want to run "pg_combinebackup
> --in-place full incr" and clobber the contents of the incr directory
> with the output, basically discarding the incremental backup you took
> in favor of a full backup that could have been taken at the same point
> in time.
[..]

Thanks for answering! It all sounds like this
resync-standby-using-primary-incrbackup idea isn't fit for the current
pg_combinebackup, but rather for a new tool hopefully in future. It
could take the current LSN from stuck standby, calculate manifest on
the lagged and offline standby (do we need to calculate manifest
Checksum in that case? I cannot find code for it), deliver it via
"UPLOAD_MANIFEST" to primary and start fetching and applying the
differences while doing some form of copy-on-write from old & incoming
incrbackup data to "$relfilenodeid.new" and then durable_unlink() old
one and durable_rename("$relfilenodeid.new", "$relfilenodeid". Would
it still be possible in theory? (it could use additional safeguards
like rename controlfile when starting and just before ending to
additionally block startup if it hasn't finished). Also it looks as
per comment nearby struct IncrementalBackupInfo.manifest_files that
even checksums are just more for safeguarding rather than core
implementation (?)

What I've meant in the initial idea is not to hinder current efforts,
but asking if the current design will not stand in a way for such a
cool new addition in future ?

> One thing I also realized when thinking about this is that you could
> probably hose yourself with the patch set as it stands today by taking
> a full backup, downgrading to wal_level=minimal for a while, doing
> some WAL-skipping operations, upgrading to a higher WAL-level again,
> and then taking an incremental backup. I think the solution to that is
> probably for the WAL summarizer to refuse to run if wal_level=minimal.
> Then there would be a gap in the summary files which an incremental
> backup attempt would detect.

As per earlier test [1], I've already tried to simulate that in
incrbackuptests-0.1.tgz/test_across_wallevelminimal.sh , but that
worked (but that was with CTAS-wal-minimal-optimizati

Re: trying again to get incremental backup

2023-10-20 Thread Jakub Wartak

Hi Robert,

On Wed, Oct 4, 2023 at 10:09 PM Robert Haas  wrote:
>
> On Tue, Oct 3, 2023 at 2:21 PM Robert Haas  wrote:
> > Here's a new patch set, also addressing Jakub's observation that
> > MINIMUM_VERSION_FOR_WAL_SUMMARIES needed updating.
>
> Here's yet another new version.[..]

Okay, so another good news - related to the patch version #4.
Not-so-tiny stress test consisting of pgbench run for 24h straight
(with incremental backups every 2h, with base of initial full backup),
followed by two PITRs (one not using incremental backup and one using
to to illustrate the performance point - and potentially spot any
errors in between). In both cases it worked fine. Pgbench has this
behaviour that it doesn't cause space growth over time - it produces
lots of WAL instead. Some stats:

START DBSIZE: ~3.3GB (pgbench -i -s 200 --partitions=8)
END DBSIZE: ~4.3GB
RUN DURATION: 24h (pgbench -P 1 -R 100 -T 86400)
WALARCHIVES-24h: 77GB
FULL-DB-BACKUP-SIZE: 3.4GB
INCREMENTAL-BACKUP-11-SIZE: 3.5GB
Env: Azure VM D4s (4VCPU), Debian 11, gcc 10.2, normal build (asserts
and debug disabled)
The increments were taken every 2h just to see if they would fail for
any reason - they did not.

PITR RTO RESULTS (copy/pg_combinebackup time + recovery time):
1. time to restore from fullbackup (+ recovery of 24h WAL[77GB]): 53s
+ 4640s =~ 78min
2. time to restore from fullbackup+incremental backup from 2h ago (+
recovery of 2h WAL [5.4GB]): 68s + 190s =~ 4min18s

I could probably pre populate the DB with 1TB cold data (not touched
to be touched pgbench at all), just for the sake of argument, and that
would probably could be demonstrated how space efficient the
incremental backup can be, but most of time would be time wasted on
copying the 1TB here...

> - I would like some feedback on the generation of WAL summary files.
> Right now, I have it enabled by default, and summaries are kept for a
> week. That means that, with no additional setup, you can take an
> incremental backup as long as the reference backup was taken in the
> last week.

I've just noticed one thing when recovery is progress: is
summarization working during recovery - in the background - an
expected behaviour? I'm wondering about that, because after freshly
restored and recovered DB, one would need to create a *new* full
backup and only from that point new summaries would have any use?
Sample log:

2023-10-20 11:10:02.288 UTC [64434] LOG:  restored log file
"000100020022" from archive
2023-10-20 11:10:02.599 UTC [64434] LOG:  restored log file
"000100020023" from archive
2023-10-20 11:10:02.769 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/139B1130 to 2/239B1518
2023-10-20 11:10:02.923 UTC [64434] LOG:  restored log file
"000100020024" from archive
2023-10-20 11:10:03.193 UTC [64434] LOG:  restored log file
"000100020025" from archive
2023-10-20 11:10:03.345 UTC [64432] LOG:  restartpoint starting: wal
2023-10-20 11:10:03.407 UTC [64446] LOG:  summarized WAL on TLI 1 from
2/239B1518 to 2/25B609D0
2023-10-20 11:10:03.521 UTC [64434] LOG:  restored log file
"000100020026" from archive
2023-10-20 11:10:04.429 UTC [64434] LOG:  restored log file
"000100020027" from archive

>
> - On a related note, I haven't yet tested this on a standby, which is
> a thing that I definitely need to do. I don't know of a reason why it
> shouldn't be possible for all of this machinery to work on a standby
> just as it does on a primary, but then we need the WAL summarizer to
> run there too, which could end up being a waste if nobody ever tries
> to take an incremental backup. I wonder how that should be reflected
> in the configuration. We could do something like what we've done for
> archive_mode, where on means "only on if this is a primary" and you
> have to say always if you want it to run on standbys as well ... but
> I'm not sure if that's a design pattern that we really want to
> replicate into more places. I'd be somewhat inclined to just make
> whatever configuration parameters we need to configure this thing on
> the primary also work on standbys, and you can set each server up as
> you please. But I'm open to other suggestions.

I'll try to play with some standby restores in future, stay tuned.

Regards,
-J.

Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-10-02 Thread Jakub Wartak

On Fri, Sep 29, 2023 at 4:00 AM Michael Paquier  wrote:
>
> On Thu, Sep 28, 2023 at 11:01:14AM +0200, Jakub Wartak wrote:
> > v3 attached. I had a problem coming out with a better error message,
> > so suggestions are welcome.  The cast still needs to be present as per
> > above suggestion as 3GB is still valid buf size and still was causing
> > integer overflow. We just throw an error on >= 4GB with v3.
>
> +/* Safety net to prevent requesting huge memory by each query to 
> pg_stat_activity */
> +#define PGSTAT_MAX_ACTIVITY_BUF_SIZE 4 * 1024 * 1024 * 1024L
>
> -   size = add_size(size,
> -   
> mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
> +   pgstat_track_size = mul_size(pgstat_track_activity_query_size,
> +   NumBackendStatSlots);
> +   if(pgstat_track_size >= PGSTAT_MAX_ACTIVITY_BUF_SIZE)
> +   elog(FATAL, "too big Backend Activity Buffer allocation of 
> %zu bytes", pgstat_track_size);
> +   size = add_size(size, pgstat_track_size);
>
> That should be enough to put in a hardcoded 4GB safety limit, while
> mul_size() detects it at a higher range.  Note, however, that elog()
> is only used for internal errors that users should never face, but
> this one can come from a misconfiguration.  This would be better as an
> ereport(), with ERRCODE_CONFIG_FILE_ERROR as errcode, I guess.
>
> "Backend Activity Buffer" is the name of the internal struct.  Sure,
> it shows up on the system views for shmem, but I wouldn't use this
> term in a user-facing error message.  Perhaps something like "size
> requested for backend status is out of range" would be cleaner.  Other
> ideas are welcome.

Hi Michael,

I've attached v4 that covers your suggestions.

-J.


v4-0001-Introduce-memory-limit-for-BackendActivityBuffer-.patch
Description: Binary data

Re: trying again to get incremental backup

2023-09-28 Thread Jakub Wartak

On Wed, Aug 30, 2023 at 4:50 PM Robert Haas  wrote:
[..]

I've played a little bit more this second batch of patches on
e8d74ad625f7344f6b715254d3869663c1569a51 @ 31Aug (days before wait
events refactor):

test_across_wallevelminimal.sh
test_many_incrementals_dbcreate.sh
test_many_incrementals.sh
test_multixact.sh
test_pending_2pc.sh
test_reindex_and_vacuum_full.sh
test_truncaterollback.sh
test_unlogged_table.sh

all those basic tests had GOOD results. Please find attached. I'll try
to schedule some more realistic (in terms of workload and sizes) test
in a couple of days + maybe have some fun with cross-backup-and
restores across standbys. As per earlier doubt: raw wal_level =
minimal situation, shouldn't be a concern, sadly because it requires
max_wal_senders==0, while pg_basebackup requires it above 0 (due to
"FATAL:  number of requested standby connections exceeds
max_wal_senders (currently 0)").

I wanted to also introduce corruption onto pg_walsummaries files, but
later saw in code that is already covered with CRC32, cool.

In v07:
> +#define MINIMUM_VERSION_FOR_WAL_SUMMARIES 16

17 ?

> A related design question is whether we should really be sending the
> whole backup manifest to the server at all. If it turns out that we
> don't really need anything except for the LSN of the previous backup,
> we could send that one piece of information instead of everything. On
> the other hand, if we need the list of files from the previous backup,
> then sending the whole manifest makes sense.

If that is still an area open for discussion: wouldn't it be better to
just specify LSN as it would allow resyncing standby across major lag
where the WAL to replay would be enormous? Given that we had
primary->standby where standby would be stuck on some LSN, right now
it would be:
1) calculate backup manifest of desynced 10TB standby (how? using
which tool?)  - even if possible, that means reading 10TB of data
instead of just putting a number, isn't it?
2) backup primary with such incremental backup >= LSN
3) copy the incremental backup to standby
4) apply it to the impaired standby
5) restart the WAL replay

> - We only know how to operate on directories, not tar files. I thought
> about that when working on pg_verifybackup as well, but I didn't do
> anything about it. It would be nice to go back and make that tool work
> on tar-format backups, and this one, too. I don't think there would be
> a whole lot of point trying to operate on compressed tar files because
> you need random access and that seems hard on a compressed file, but
> on uncompressed files it seems at least theoretically doable. I'm not
> sure whether anyone would care that much about this, though, even
> though it does sound pretty cool.

Also maybe it's too early to ask, but wouldn't it be nice if we could
have an future option in pg_combinebackup to avoid double writes when
used from restore hosts (right now we need to first to reconstruct the
original datadir from full and incremental backups on host hosting
backups and then TRANSFER it again and on target host?). So something
like that could work well from restorehost: pg_combinebackup
/tmp/backup1 /tmp/incbackup2 /tmp/incbackup3 -O tar -o - | ssh
dbserver 'tar xvf -C /path/to/restored/cluster - ' . The bad thing is
that such a pipe prevents parallelism from day 1 and I'm afraid I do
not have a better easy idea on how to have both at the same time in
the long term.

-J.

incrbackuptests-0.1.tgz
Description: application/compressed

Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-28 Thread Jakub Wartak

On Thu, Sep 28, 2023 at 12:53 AM Michael Paquier  wrote:
>
> On Wed, Sep 27, 2023 at 10:29:25AM -0700, Andres Freund wrote:
> > I don't think going for size_t is a viable path for fixing this. I'm pretty
> > sure the initial patch would trigger a type mismatch from guc_tables.c - we
> > don't have infrastructure for size_t GUCs.
>
> Nothing marked as PGDLLIMPORT uses size_t in the tree currently, FWIW.
>
> > Perhaps we ought to error out (in BackendStatusShmemSize() or such) if
> > pgstat_track_activity_query_size * MaxBackends >= 4GB?
>
> Yeah, agreed that putting a check like that could catch errors more
> quickly.

Hi,

v3 attached. I had a problem coming out with a better error message,
so suggestions are welcome.  The cast still needs to be present as per
above suggestion as 3GB is still valid buf size and still was causing
integer overflow. We just throw an error on >= 4GB with v3.

-J.


v3-0001-Introduce-memory-limit-for-BackendActivityBuffer-.patch
Description: Binary data

Re: pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Jakub Wartak

On Wed, Sep 27, 2023 at 10:08 AM Michael Paquier  wrote:
>
> On Wed, Sep 27, 2023 at 08:41:55AM +0200, Jakub Wartak wrote:
> > Attached patch adjusts pgstat_track_activity_query_size to be of
> > size_t from int and fixes the issue.
>
> This cannot be backpatched, and using size_t is not really needed as
> track_activity_query_size is capped at 1MB.  Why don't you just tweak
> the calculation done in pgstat_read_current_status() instead, say with
> a cast?

Thanks Michael, sure, that is probably a better alternative. I've attached v2.

BTW: CF entry is https://commitfest.postgresql.org/45/4592/

Regards,
-Jakub Wartak.


v2-0001-Cast-MemoryContextAllocHuge-calculations-in-pg_st.patch
Description: Binary data

pg_stat_get_activity(): integer overflow due to (int) * (int) for MemoryContextAllocHuge()

2023-09-27 Thread Jakub Wartak

Hi -hackers,

We've got customer report that high max_connections (3k) with high
pgstat_track_activity_query_size (1MB) ends up with:

postgres=# select * from pg_stat_get_activity(NULL);
ERROR:  invalid memory alloc request size 18446744072590721024
postgres=# select version();
   version
--
 PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (Debian
10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

it's in:
#0  errstart (elevel=elevel@entry=21, domain=domain@entry=0x0) at elog.c:358
#1  0x55971cafc9a8 in errstart_cold (elevel=elevel@entry=21,
domain=domain@entry=0x0) at elog.c:333
#2  0x55971cb018b7 in MemoryContextAllocHuge (context=, size=18446744072590721024) at mcxt.c:1594
#3  0x55971ce2fd59 in pgstat_read_current_status () at backend_status.c:767
#4  0x55971ce30ab1 in pgstat_read_current_status () at backend_status.c:1167
#5  pgstat_fetch_stat_numbackends () at backend_status.c:1168
#6  0x55971ceccc7f in pg_stat_get_activity (fcinfo=0x55971e239ff8)
at pgstatfuncs.c:308

It seems to be integer overflow due to (int) * (int), while
MemoryContextAllocHuge() allows taking Size(size_t) as parameter. I
get similar behaviour with:
size_t val = (int)1048576 * (int)3022;

Attached patch adjusts pgstat_track_activity_query_size to be of
size_t from int and fixes the issue.

Regards,
-Jakub Wartak.


0001-Adjust-pgstat_track_activity_query_size-to-be-of-siz.patch
Description: Binary data

Re: pg_stat_statements and "IN" conditions

2023-09-21 Thread Jakub Wartak

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

I've tested the patched on 17devel/master and it is my feeling - especially 
given the proliferation of the ORMs - that we need such thing in pgss. Thread 
already took almost 3 years, so it would be pity to waste so much development 
time of yours. Cfbot is green, and patch works very well for me. IMVHO 
commitfest status should be even set to ready-for-comitter.

Given the:
SET query_id_const_merge = on;
SELECT pg_stat_statements_reset();
SELECT * FROM test WHERE a IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 11);
SELECT * FROM test WHERE a IN (1, 2, 3);
SELECT * FROM test WHERE a = ALL('{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12}');
SELECT * FROM test WHERE a = ANY (ARRAY[11,10,9,8,7,6,5,4,3,2,1]);

The patch results in:
  q  | calls
-+---
 SELECT * FROM test WHERE a = ALL($1)| 1
 SELECT pg_stat_statements_reset()   | 1
 SELECT * FROM test WHERE a IN ($1, $2, $3)  | 1
 SELECT * FROM test WHERE a IN (... [10-99 entries]) | 2

Of course it's pity it doesn't collapse the below ones:

SELECT * FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11)) 
AS t (num);
INSERT INTO dummy VALUES(1, 'text 1'),(2, 'text 2'),(3, 'text 3'),(4, 'text 
3'),(5, 'text 3'),(6, 'text 3'),(7, 'text 3'),(8, 'text 3'),(9, 'text 3'),(10, 
'text 3') ON CONFLICT (id) DO NOTHING;
PREPARE s3(int[], int[], int[], int[], int[], int[], int[], int[], int[], 
int[], int[]) AS SELECT * FROM test WHERE 
a = ANY ($1::int[]) OR 
a = ANY ($2::int[]) OR
[..]
a = ANY ($11::int[]) ;

but given the convoluted thread history, it's understandable and as you stated 
- maybe in future.

There's one additional benefit to this patch: the pg_hint_plan extension seems 
to borrow pgss's generate_normalized_query(). So if that's changed in next 
major release, the pg_hint_plan hint table (transparent plan rewrite using 
table) will automatically benefit from generalization of the query string here 
(imagine fixing plans for ORM that generate N {1,1024} number of IN() array 
elements; today that would be N number of entries in the "hint_plan.hints" 
table).

The new status of this patch is: Needs review

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-11 Thread Jakub Wartak

On Mon, Jul 10, 2023 at 6:24 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well.  It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.

I haven't seen this thread [1] "Question on slow fallocate", from XFS
mailing list being mentioned here (it was started by Masahiko), but I
do feel it contains very important hints even challenging the whole
idea of zeroing out files (or posix_fallocate()). Please especially
see Dave's reply. He also argues that posix_fallocate() !=
fallocate().  What's interesting is that it's by design and newer
kernel versions should not prevent such behaviour, see my testing
result below.

All I can add is that this those kernel versions (4.18.0) seem to very
popular across customers (RHEL, Rocky) right now and that I've tested
on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using
Masahiko test.c and still got 6-7x slower time when using XFS on that
kernel. After installing kernel-ml (6.4.2) the test.c result seems to
be the same (note it it occurs only when 1st allocating space, but of
course it doesnt if the same file is rewritten/"reallocated"):

[root@rockyora ~]# uname -r
6.4.2-1.el8.elrepo.x86_64
[root@rockyora ~]# time ./test test.0 0
total   20
fallocate   0
filewrite   20

real0m0.405s
user0m0.006s
sys 0m0.391s
[root@rockyora ~]# time ./test test.0 1
total   20
fallocate   20
filewrite   0

real0m0.137s
user0m0.005s
sys 0m0.132s
[root@rockyora ~]# time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m0.968s
user0m0.020s
sys 0m0.928s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m6.059s
user0m0.000s
sys 0m0.788s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m0.598s
user0m0.003s
sys 0m0.225s
[root@rockyora ~]#

iostat -x reports during first "time ./test test.2 2" (as you can see
w_awiat is not that high but it accumulates):
Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda  0.00 15394.00  0.00122.02 0.0013.00
0.00   0.080.000.05   0.75 0.00 8.12   0.06 100.00
dm-0 0.00 15407.00  0.00122.02 0.00 0.00
0.00   0.000.000.06   0.98 0.00 8.11   0.06 100.00

So maybe that's just a hint that you should try on slower storage
instead? (I think that on NVMe this issue would be hardly noticeable
due to low IO latency, not like here)

-J.

[1] - https://www.spinics.net/lists/linux-xfs/msg73035.html

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-07-05 Thread Jakub Wartak

On Tue, Jun 13, 2023 at 10:20 AM John Naylor 
wrote:

Hi John,

v3 is attached for review.

> >
> >- 
> >+ see note below on TOAST
>
> Maybe:
> "further limited by the number of TOAST-ed values; see note below"

Fixed.

> > I've wrongly put it, I've meant that pg_largeobject also consume OID
> > and as such are subject to 32TB limit.
> No, OID has nothing to do with the table size limit, they have to do with
the max number of LOs in a DB.

Clearly I needed more coffee back then...

> Also, perhaps the LO entries should be split into a separate patch. Since
they are a special case and documented elsewhere, it's not clear their
limits fit well here. Maybe they could.

Well, but those are *limits* of the engine and they seem to be pretty
widely chosen especially in migration scenarios (because they are the only
ones allowed to store over 1GB). I think we should warn the dangers of
using pg_largeobjects.

> > + 
> > +  For every TOAST-ed column (that is for field values wider than
TOAST_TUPLE_TARGET
> > +  [2040 bytes by default]), due to internal PostgreSQL implementation
of using one
> > +  shared global OID counter - today you cannot have more than
4,294,967,296 out-of-line
> > +  values in a single table.
> > + 
> > +
> > + 

> "column" != "field value". (..)"Today" is irrelevant. Needs polish.

Fixed.

> Also the shared counter is the cause of the slowdown, but not the reason
for the numeric limit.

Isn't it both? typedef Oid is unsigned int = 2^32, and according to
GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang
indefinitely which has the same semantics as "being impossible"/permanent
hang (?)

> +  out-of-line value (The search for free OIDs won't stop until it finds
the free OID).

> Still too much detail, and not very illuminating. If it *did* stop, how
does that make it any less of a problem?

OK I see your point - so it's removed. As for the question: well, maybe we
could document that one day in known-performance-cliffs.sgml (or via Wiki)
instead of limits.sgml.

> +  Therefore, the OID shortages will eventually cause slowdowns to the
> +  INSERTs/UPDATE/COPY statements.

> Maybe this whole sentence is better as "This will eventually cause
slowdowns for INSERT, UPDATE, and COPY statements."

Yes, it flows much better that way.

> > > +  Please note that that the limit of 2^32
> > > +  out-of-line TOAST values applies to the sum of both visible and
invisible tuples.
> > >
> > > We didn't feel the need to mention this for normal tuples...
> >
> > Right, but this somewhat points reader to the queue-like scenario
> > mentioned by Nikita.

> That seems to be in response to you mentioning "especially to steer
people away from designing very wide non-partitioned tables". In any case,
I'm now thinking that everything in this sentence and after doesn't belong
here. We don't need to tell people to vacuum, and we don't need to tell
them about partitioning as a workaround -- it's a workaround for the table
size limit, too, but we are just documenting the limits here.

OK, I've removed the visible/invisible fragments and workaround techniques.

-J.


v3-0001-doc-Add-some-OID-TOAST-related-limitations-to-the.patch
Description: Binary data

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-03 Thread Jakub Wartak

Hi Masahiko,

Out of curiosity I've tried and it is reproducible as you have stated : XFS
@ 4.18.0-425.10.1.el8_7.x86_64:

[root@rockyora ~]# time ./test test.1 1
total   20
fallocate   20
filewrite   0

real0m5.868s
user0m0.035s
sys 0m5.716s
[root@rockyora ~]# time ./test test.2 2
total   20
fallocate   10
filewrite   10

real0m25.858s
user0m0.108s
sys 0m3.596s
[root@rockyora ~]# time ./test test.3 2
total   20
fallocate   10
filewrite   10

real0m25.927s
user0m0.091s
sys 0m3.621s
[root@rockyora ~]# time ./test test.4 1
total   20
fallocate   20
filewrite   0

real0m3.044s
user0m0.043s
sys 0m2.934s

According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
the XFS issues sync writes while ext4 does not, xfs looks like constant
loop of sync writes (D) by kworker/2:1H-kblockd:
[..]
  8,023417224.115364875   312  D  WS 44624928 + 16
[kworker/2:1H]
  8,023417324.115482679 0  C  WS 44624928 + 16 [0]
  8,023417424.115548251  6501  A  WS 42525760 + 16 <- (253,0)
34225216
  8,023417524.115548660  6501  A  WS 44624960 + 16 <- (8,2)
42525760
  8,023417624.115549111  6501  Q  WS 44624960 + 16 [test]
  8,023417724.115551351  6501  G  WS 44624960 + 16 [test]
  8,023417824.115552111  6501  I  WS 44624960 + 16 [test]
  8,023417924.115559713   312  D  WS 44624960 + 16
[kworker/2:1H]
  8,023418024.115677217 0  C  WS 44624960 + 16 [0]
  8,023418124.115743150  6501  A  WS 42525792 + 16 <- (253,0)
34225248
  8,023418224.115743502  6501  A  WS 44624992 + 16 <- (8,2)
42525792
  8,023418324.115743949  6501  Q  WS 44624992 + 16 [test]
  8,023418424.115746175  6501  G  WS 44624992 + 16 [test]
  8,023418524.115746918  6501  I  WS 44624992 + 16 [test]
  8,023418624.115754492   312  D  WS 44624992 + 16
[kworker/2:1H]

So it looks like you are onto something.

Regards,
-J.

Re: memory leak in trigger handling (since PG12)

2023-05-24 Thread Jakub Wartak

Hi, just two cents:

On Tue, May 23, 2023 at 8:01 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > Could it help to have a mode where the executor shutdown hook checks how 
> > > much
> > > memory is allocated in ExecutorState and warns if its too much?
> >
> > It'd be very hard to set a limit for what's "too much", since the amount
> > of stuff created initially will depend on the plan size.
>
> I was thinking of some limit that should really never be reached outside of a
> leak or work_mem based allocations, say 2GB or so.

RE: instrumentation subthread:
if that helps then below technique can work somewhat good on normal
binaries for end users (given there are debug symbols installed), so
maybe we don't need that much infrastructure added in to see the hot
code path:

perf probe -x /path/to/postgres 'palloc' 'size=%di:u64' # RDI on
x86_64(palloc size arg0)
perf record -avg --call-graph dwarf -e probe_postgres:palloc -aR -p
 sleep 3 # cannot be longer, huge overhead (~3s=~2GB)

it produces:
50.27%  (563d0e380670) size=24
|
---palloc
   bms_copy
   ExecUpdateLockMode
   ExecBRUpdateTriggers
   ExecUpdate
[..]

49.73%  (563d0e380670) size=16
|
---palloc
   bms_copy
   RelationGetIndexAttrBitmap
   ExecUpdateLockMode
   ExecBRUpdateTriggers
   ExecUpdate
[..]

Now we know that those small palloc() are guilty, but we didn't know
at the time with Tomas. The problem here is that we do not know in
palloc() - via its own arguments for which MemoryContext this is going
to be allocated for. This is problematic for perf, because on RHEL8, I
was not able to generate an uprobe that was capable of reaching a
global variable (CurrentMemoryContext) at that time.

Additionally what was even more frustrating on diagnosing that case on
the customer end system, was that such OOMs were crashing other
PostgreSQL clusters on the same OS. Even knowing the exact guilty
statement it was impossible to limit RSS memory usage of that
particular backend. So, what you are proposing makes a lot of sense.
Also it got me thinking of implementing safety-memory-net-GUC
debug_query_limit_backend_memory=X MB that would inject
setrlimit(RLIMIT_DATA) through external extension via hook(s) and
un-set it later, but the man page states it works for mmap() only
after Linux 4.7+ so it is future proof but won't work e.g. on RHEL7 -
maybe that's still good enough?; Or, well maybe try to hack a palloc()
a little, but that has probably too big overhead, right? (just
thinking loud).

-Jakub Wartak.

Re: In-placre persistance change of a relation

2023-04-27 Thread Jakub Wartak

On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi
 wrote:
>
> Rebased.
>
> I fixed some code comments and commit messages. I fixed the wrong
> arrangement of some changes among patches.  Most importantly, I fixed
> the a bug based on a wrong assumption that init-fork is not resides on
> shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
> to be removed.
>
> The new fourth patch is a temporary fix for recently added code, which
> will soon be no longer needed.
>

Hi Kyotaro,

I've retested v28 of the patch with everything that came to my mind
(basic tests, --enable-tap-tests, restarts/crashes along adding the
data, checking if there were any files left over and I've checked for
stuff that earlier was causing problems: GiST on geometry[PostGIS]).
The only thing I've not tested this time were the performance runs
done earlier. The patch passed all my very limited tests along with
make check-world. Patch looks good to me on the surface from a
usability point of view. I haven't looked at the code, so the patch
might still need an in-depth review.

Regards,
-Jakub Wartak.

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-04-27 Thread Jakub Wartak

Hi John,

Thanks for your review. Here's v2 attached.

> -partition keys
> -32
> -can be increased by recompiling 
> PostgreSQL
> + partition keys
> + 32
> + can be increased by recompiling 
> PostgreSQL
>
> Spurious whitespace.

Hopefully fixed, I've tried to align with the other entries tags.

> - limited by the number of tuples that can fit onto 4,294,967,295 
> pages
> - 
> + limited by the number of tuples that can fit onto 4,294,967,295 
> pages or using up to 2^32 OIDs for TOASTed values
> + please see discussion below about OIDs
>
> I would keep the first as is, and change the second for consistency to "see 
> note below on TOAST".

Fixed.

> Also, now that we have more than one note, we should make them more separate. 
> That's something to discuss, no need to do anything just yet.

OK.

> The new note needs a lot of editing to fit its new home. For starters:
>
> + 
> +  For every TOAST-ed columns
>
> column

Fixed.

> + (that is for field values wider than TOAST_TUPLE_TARGET
> +  [2040 bytes by default]), due to internal PostgreSQL implementation of 
> using one
> +  shared global OID counter - today you cannot have more than
>
> + 2^32
>
> Perhaps it should match full numbers elsewhere in the page.

Fixed.

>
> +(unsigned integer;
>
> True but irrelevant.
>
> +  4 billion)
>
> Imprecise and redundant.

Removed both.

> + out-of-line values in a single table, because there would have to be
> +  duplicated OIDs in its TOAST table.
>
> The part after "because" should be left off.

Removed.

> +  Please note that that the limit of 2^32
> +  out-of-line TOAST values applies to the sum of both visible and invisible 
> tuples.
>
> We didn't feel the need to mention this for normal tuples...

Right, but this somewhat points reader to the queue-like scenario
mentioned by Nikita.

> +  It is therefore crucial that the autovacuum manages to keep up with 
> cleaning the
> +  bloat and free the unused OIDs.
> + 
>
> Out of place.

I have somewhat reworded it, again just to reference to the above.

> + 
> +  In practice, you want to have considerably less than that many TOASTed 
> values
> +  per table, because as the OID space fills up the system might spend large
> +  amounts of time searching for the next free OID when it needs to generate 
> a new
> +  out-of-line value.
>
> s/might spend large/will spend larger/ ?

Fixed.

> + After 100 failed attempts to get a free OID, a first log
> +  message is emitted "still searching for an unused OID in relation", but 
> operation
> +  won't stop and will try to continue until it finds the free OID.
>
> Too much detail?

OK - partially removed.

> + Therefore,
> +  the OID shortages may (in very extreme cases) cause slowdowns to the
> +  INSERTs/UPDATE/COPY statements.
>
> s/may (in very extreme cases)/will eventually/

Fixed.

> + It's also worth emphasizing that
>
> Unnecessary.

Removed.

> + only field
> +  values wider than 2KB
>
> TOAST_TUPLE_TARGET

Good catch, fixed.

> + will consume TOAST OIDs in this way. So, in practice,
> +  reaching this limit would require many terabytes of data in a single table,
>
> It may be worth mentioning what Nikita said above about updates.

I've tried (with the above statement with visible and invisible tuples).

> +  especially if you have a wide range of value widths.
>
> I never understood this part.

I've changed it, but I wonder if the new "large number of wide
columns" isn't too ambiguous due to "large" (?)

> +   
> + large objects size
> + subject to the same limitations as single relation 
> size
> + LOs are stored in single pg_largeobjects relation
> +   
>
> Are you under the impression that we can store a single large object up to 
> table size? The limit is 4TB, as documented elsewhere.

I've wrongly put it, I've meant that pg_largeobject also consume OID
and as such are subject to 32TB limit.

>
> +   
> + large objects number
>
> "large objects per database"

Fixed.

> + subject to the same limitations as rows per 
> table
>
> That implies table size is the only factor. Max OID is also a factor, which 
> was your stated reason to include LOs here in the first place.

Exactly..

Regards,
-Jakub Wartak.


v2-0001-doc-Add-some-OID-TOAST-related-limitations-to-the.patch
Description: Binary data

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-04-26 Thread Jakub Wartak

Hi,

>> These 2 discussions show that it's a painful experience to run into
>> this problem, and that the hackers have ideas on how to fix it, but
>> those fixes haven't materialized for years. So I would say that, yes,
>> this info belongs in the hard-limits section, because who knows how
>> long it'll take this to be fixed.
>>
>> Please submit a patch.
>>
> This is a production case for large databases with high update rates, but is 
> mistaken
> with reaching table size limit, although size limit is processed correctly.
>
> The note on TOAST limitation does not mention that TOAST values are not 
> actually
> updated on UPDATE operation - old value is marked as dead and new one is 
> inserted,
> and dead values should be vacuumed before value OID could be reused. The worst
> is that the INSERT/UPDATE clause does not fail if there is no OID available - 
> it is
> looped in an infinite loop of sorting out OIDs.

OK, so here is the documentation patch proposal. I've also added two
rows touching the subject of pg_largeobjects, as it is also related to
the OIDs topic. Please feel free to send adjusted patches.

Regards,
-J.


v1-0001-doc-Add-some-OID-TOAST-related-limitations-to-the.patch
Description: Binary data

Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-04-21 Thread Jakub Wartak

Hi -hackers,

I would like to ask if it wouldn't be good idea to copy the
https://wiki.postgresql.org/wiki/TOAST#Total_table_size_limit
discussion (out-of-line OID usage per TOAST-ed columns / potential
limitation) to the official "Appendix K. PostgreSQL Limits" with also
little bonus mentioning the "still searching for an unused OID in
relation" notice. Although it is pretty obvious information for some
and from commit 7fbcee1b2d5f1012c67942126881bd492e95077e and the
discussion in [1], I wonder if the information shouldn't be a little
more well known via the limitation (especially to steer people away
from designing very wide non-partitioned tables).

Regards,
-J.

[1] - 
https://www.postgresql.org/message-id/flat/16722-93043fb459a41073%40postgresql.org

Re: doc: mentioned CREATE+ATTACH PARTITION as an alternative to CREATE TABLE..PARTITION OF

2023-03-14 Thread Jakub Wartak

Hi, I've tested the attached patch by Justin and it applied almost
cleanly to the master, but there was a tiny typo and make
postgres-A4.pdf didn't want to run:
Note that creating a partition using PARTITION OF
=> (note lack of closing literal) =>
Note that creating a partition using PARTITION OF

Attached is version v0002 that contains this fix. @Justin maybe you
could set the status to Ready for Comitter (
https://commitfest.postgresql.org/42/3790/ ) ?

On Thu, Jan 19, 2023 at 10:58 PM Justin Pryzby  wrote:
>
> On Thu, Jan 19, 2023 at 04:47:59PM -0500, Robert Treat wrote:
> > I think all of that feedback is useful, I guess the immediate question
> > becomes if Justin wants to try to proceed with his patch implementing
> > the change, or if adjusting the documentation for the current
> > implementation is the right move for now.
>
> The docs change is desirable in any case, since it should be
> backpatched, and any patch to change CREATE..PARTITION OF would be for
> v17+ anyway.
>
> --
> Justin
>
>


v0002-Justin-doc_mention_CREATE+ATTACH_PARTITION.patch
Description: Binary data

Re: Syncrep and improving latency due to WAL throttling

2023-02-02 Thread Jakub Wartak

On Thu, Feb 2, 2023 at 11:03 AM Tomas Vondra
 wrote:

> > I agree that some other concurrent backend's
> > COMMIT could fsync it, but I was wondering if that's sensible
> > optimization to perform (so that issue_fsync() would be called for
> > only commit/rollback records). I can imagine a scenario with 10 such
> > concurrent backends running - all of them with this $thread-GUC set -
> > but that would cause 20k unnecessary fsyncs (?) -- (assuming single
> > HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
> > would be wasted close to 400s just due to local fsyncs?). I don't have
> > a strong opinion or in-depth on this, but that smells like IO waste.
> >
>
> Not sure what optimization you mean,

Let me clarify, let's say something like below (on top of the v3) just
to save IOPS:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2340,6 +2340,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
if (sync_method != SYNC_METHOD_OPEN &&
sync_method != SYNC_METHOD_OPEN_DSYNC)
{
+   bool openedLogFile = false;
if (openLogFile >= 0 &&
!XLByteInPrevSeg(LogwrtResult.Write,
openLogSegNo,

wal_segment_size))
@@ -2351,9 +2352,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)
openLogTLI = tli;
openLogFile = XLogFileOpen(openLogSegNo, tli);
ReserveExternalFD();
+   openedLogFile = true;
}

-   issue_xlog_fsync(openLogFile, openLogSegNo, tli);
+   /* can we bypass fsyncing() XLOG from the backend if
+* we have been called without commit request?
+* usually the feature will be off here
(XLogDelayPending=false)
+*/
+   if(openedLogFile == true || XLogDelayPending == false)
+   issue_xlog_fsync(openLogFile,
openLogSegNo, tli);
}

+ maybe some additional logic to ensure that this micro-optimization
for saving IOPS would be not enabled if the backend is calling that
XLogFlush/Write() for actual COMMIT record

> But I think the backends still have to sleep at some point, so that they
> don't queue too much unflushed WAL - that's kinda the whole point, no?

Yes, but it can be flushed to standby, flushed locally but not fsynced
locally (?) - provided that it was not COMMIT - I'm just wondering
whether it makes sense (Question 1)

> The issue is more about triggering the throttling too early, before we
> hit the bandwidth limit. Which happens simply because we don't have a
> very good way to decide whether the latency is growing, so the patch
> just throttles everything.

Maximum TCP bandwidth limit seems to be fluctuating in the real world
I suppose, so it couldn't be a hard limit. On the other hand I can
imagine operators setting
"throttle-those-backends-if-global-WALlatencyORrate>XXX"
(administrative decision). That would be cool to have but yes it would
require WAL latency and rate measurement first (on its own that would
make a very nice addition to the pg_stat_replication). But one thing
to note would be that there could be many potential latencies (& WAL
throughput rates) to consider (e.g. quorum of 3 standby sync having
different latencies) - which one to choose?

(Question 2) I think we have reached simply a decision point on
whether the WIP/PoC is good enough as it is (like Andres wanted and
you +1 to this) or it should work as you propose or maybe keep it as
an idea for the future?

-J.

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak

On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
 wrote:

> > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> > we are flushing just because of WAL thortting ?) Would that still be
> > safe?
>
> It's not clear to me how could this work and still be safe. I mean, we
> *must* flush the local WAL first, otherwise the replica could get ahead
> (if we send unflushed WAL to replica and then crash). Which would be
> really bad, obviously.

Well it was just a thought: in this particular test - with no other
concurrent activity happening - we are fsyncing() uncommitted
Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
record comes into play. I agree that some other concurrent backend's
COMMIT could fsync it, but I was wondering if that's sensible
optimization to perform (so that issue_fsync() would be called for
only commit/rollback records). I can imagine a scenario with 10 such
concurrent backends running - all of them with this $thread-GUC set -
but that would cause 20k unnecessary fsyncs (?) -- (assuming single
HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
would be wasted close to 400s just due to local fsyncs?). I don't have
a strong opinion or in-depth on this, but that smells like IO waste.

-J.

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak

On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
 wrote:

Hi Bharath, thanks for reviewing.

> I think measuring the number of WAL flushes with and without this
> feature that the postgres generates is great to know this feature
> effects on IOPS. Probably it's even better with variations in
> synchronous_commit_flush_wal_after or the throttling limits.

It's the same as point 19, so I covered it there.

> Although v2 is a WIP patch, I have some comments:
> 1. Coding style doesn't confirm to the Postgres standard:

Fixed all thoses that you mentioned and I've removed elog() and code
for timing. BTW: is there a way to pgindent only my changes (on git
diff?)

> 2. It'd be better to add a TAP test hitting the WAL throttling.

TODO, any hints on how that test should look like are welcome
(checking pg_stat_wal?) I've could spot only 1 place for it --
src/test/recovery/007_sync_rep.pl

> 3. We generally don't need timings to be calculated explicitly when we
> emit before and after log messages. If needed one can calculate the
> wait time from timestamps of the log messages. If it still needs an
> explicit time difference which I don't think is required, because we
> have a special event and before/after log messages, I think it needs
> to be under track_wal_io_timing to keep things simple.

Removed, it was just debugging aid to demonstrate the effect in the
session waiting.

> 4. XLogInsertRecord() can be a hot path for high-write workload, so
> effects of adding anything in it needs to be measured. So, it's better
> to run benchmarks with this feature enabled and disabled.

When the GUC is off(0 / default), in my tests the impact is none (it's
just set of simple IFs), however if the feature is enabled then the
INSERT is slowed down (I've repeated the initial test from 1st post
and single-statement INSERT for 50MB WAL result is the same 4-5s and
~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
that's intentional).

> 5. Missing documentation of this feature and the GUC. I think we can
> describe extensively why we'd need a feature of this kind in the
> documentation for better adoption or usability.

TODO, I'm planning on adding documentation when we'll be a little bit
closer to adding to CF.

> 6. Shouldn't the max limit be MAX_KILOBYTES?
> +_commit_flush_wal_after,
> +0, 0, 1024L*1024L,

Fixed.

> 7. Can this GUC name be something like
> synchronous_commit_wal_throttle_threshold to better reflect what it
> does for a backend?
> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
> REPLICATION_SENDING,

Done.

> 8. A typo - s/confiration/confirmation
[..]
> 9. This
> "Sets the maximum logged WAL in kbytes, after which wait for sync
> commit confiration even without commit "
> better be something like below?
> "Sets the maximum amount of WAL in kilobytes a backend generates after
> which it waits for synchronous commit confiration even without commit
> "

Fixed as you have suggested.

> 10. I think there's nothing wrong in adding some assertions in
> HandleXLogDelayPending():
> Assert(synchronous_commit_flush_wal_after > 0);
> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);

Sure, added.

> 11. Specify the reason in the comments as to why we had to do the
> following things:
> Here:
> +/* flush only up to the last fully filled page */
> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
> Talk about how this avoids multiple-flushes for the same page.
>
> Here:
> + * Called from ProcessMessageInterrupts() to avoid waiting while
> being in critical section
> + */
> +void HandleXLogDelayPending()
> Talk about how waiting in a critical section, that is in
> XLogInsertRecord() causes locks to be held longer durations and other
> effects.

Added.

> Here:
> +/* WAL throttling */
> +backendWalInserted += rechdr->xl_tot_len;
> Be a bit more verbose about why we try to throttle WAL and why only
> for sync replication, the advantages, effects etc.

Added.

> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
> clutter server logs.
> +/* XXX Debug for now */
> +elog(WARNING, "throttling WAL down on this session
> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
> +backendWalInserted,
> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>
> +elog(WARNING, "throttling WAL down on this session - end (%f
> ms)", timediff);

OK, that's entirely removed.

> 13. BTW, we don't need to hold interrupts while waiting for sync
> replication ack as it may block query cancels or proc die pendings.
> You can remove these, unless there's a strong reason.
> +//HOLD_INTERRUPTS();
> +//RESUME_INTERRUPTS();

Sure, removed. However, one problem I've discovered is that we were
hitting Assert(InterruptHoldoffCount > 0) in

Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Jakub Wartak

Hi Bharath,

On Fri, Jan 27, 2023 at 12:04 PM Bharath Rupireddy
 wrote:
>
> On Fri, Jan 27, 2023 at 2:03 PM Alvaro Herrera  
> wrote:
> >
> > On 2023-Jan-27, Bharath Rupireddy wrote:
> >
> > > Looking at the patch, the feature, in its current shape, focuses on
> > > improving replication lag (by throttling WAL on the primary) only when
> > > synchronous replication is enabled. Why is that? Why can't we design
> > > it for replication in general (async, sync, and logical replication)?
> > >
> > > Keeping replication lag under check enables one to provide a better
> > > RPO guarantee

Sorry for not answering earlier; although the title of the thread goes
for the SyncRep-only I think everyone would be open to also cover the
more advanced async scenarios that Satyanarayana proposed in those
earlier threads (as just did Simon much earlier). I was  proposing
wal_throttle_larger_transactions=<..> (for the lack of better name),
however v2 of the patch from today right now contains again reference
to syncrep (it could be changed of course). It's just the code that is
missing that could be also added on top of v2, so we could combine our
efforts. It's just the competency and time that I lack on how to
implement such async-scenario code-paths (maybe Tomas V. has something
in mind with his own words [1])  so also any feedback from senior
hackers is more than welcome ...

-J.

[1] - "The solution I've imagined is something like autovacuum
throttling - do some accounting of how much "WAL bandwidth" each
process consumed, and then do the delay/sleep in a suitable place. "

Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Jakub Wartak

Hi,

v2 is attached.

On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:

> Huh? Why did you remove the GUC?

After reading previous threads, my optimism level of getting it ever
in shape of being widely accepted degraded significantly (mainly due
to the discussion of wider category of 'WAL I/O throttling' especially
in async case, RPO targets in async case and potentially calculating
global bandwidth). I've assumed that it is a working sketch, and as
such not having GUC name right now (just for sync case) would still
allow covering various other async cases in future without breaking
compatibility with potential name GUC changes (see my previous
"wal_throttle_larger_transactions=" proposal ). Although
I've restored the synchronous_commit_flush_wal_after GUC into v2
patch, sticking to such a name removes the way of using the code to
achieve async WAL throttling in future.

> Clearly this isn't something we can default to on..

Yes, I agree. It's never meant to be on by default.

> I don't like the resets around like this. Why not just move it into
> XLogFlush()?

Fixed.

> Hm - wonder if it'd be better to determine the LSN to wait for in
> XLogInsertRecord(). There'll be plenty cases where we'll insert multiple (but
> bounded number of) WAL records before processing interrupts. No need to flush
> more aggressively than necessary...

Fixed.

> SyncRepWaitForLSN() has this comment:
>  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this LSN
>  * represents a commit record.  If it doesn't, then we wait only for the WAL
>  * to be flushed if synchronous_commit is set to the higher level of
>  * remote_apply, because only commit records provide apply feedback.

Hm, not sure if I understand: are you saying that we should (in the
throttled scenario) have some special feedback msgs or not --
irrespective of the setting? To be honest the throttling shouldn't
wait for the standby full setting, it's just about slowdown fact (so
IMHO it would be fine even in remote_write/remote_apply scenario if
the remote walreceiver just received the data, not necessarily write
it into file or wait for for applying it). Just this waiting for a
round-trip ack about LSN progress would be enough to slow down the
writer (?). I've added some timing log into the draft and it shows
more or less constantly solid RTT even as it stands:

psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CB70B8 flushingTo=2/A6CB6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.500052 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6CF7C08 flushingTo=2/A6CF6000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.655370 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D385E0 flushingTo=2/A6D38000)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session -
end (10.627334 ms)
psql:insertBIG.sql:2: WARNING:  throttling WAL down on this session
(backendWalInserted=262632, LSN=2/A6D78FA0 flushingTo=2/A6D78000)
[..]

 > I think we'd want a distinct wait event for this.

Added and tested that it shows up.

> > +volatile sig_atomic_t XLogDelayPending = false;
> I don't think the new variable needs to be volatile, or even
> sig_atomic_t. We'll just manipulate it from outside signal handlers.

Changed to bool, previously I wanted it to "fit" the above ones.

-J.

v2-0001-WIP-Syncrep-and-improving-latency-due-to-WAL-thro.patch
Description: Binary data

Re: Syncrep and improving latency due to WAL throttling

2023-01-26 Thread Jakub Wartak

> On 1/25/23 20:05, Andres Freund wrote:
> > Hi,
> >
> > Such a feature could be useful - but I don't think the current place of
> > throttling has any hope of working reliably:
[..]
> > You're blocking in the middle of an XLOG insertion.
[..]
> Yeah, I agree the sleep would have to happen elsewhere.

Fixed.

> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> >
>
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.
>

By the time you replied I've already tried what Andres has recommended.

[..]
>> At the very least this'd
> > have to flush only up to the last fully filled page.
> >
> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse. But if needed, we can simply
> backoff to the last page boundary, so that we only flush the complete
> page. That would work too - the goal is not to flush everything, but to
> reduce how much of the lag is due to the current process (i.e. to wait
> for sync replica to catch up).

I've introduced the rounding to the last written page (hopefully).

> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> >
>
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

Fixed.

> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> >
>
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.
>
> I agree this is crude and we'd probably need some sort of "balancing"
> logic that distributes WAL bandwidth between backends, and throttles
> backends producing a lot of WAL.

OK - that's not included (yet?), as it would make this much more complex.

In summary:  Attached is a slightly reworked version of this patch.
1. Moved logic outside XLogInsertRecord() under ProcessInterrupts()
2. Flushes up to the last page boundary, still uses SyncRepWaitForLSN()
3. Removed GUC for now (always on->256kB); potentially to be restored
4. Resets backendWal counter on commits

It's still crude, but first tests indicate that it behaves similarly
(to the initial one with GUC = 256kB).

Also from the Bharath email, I've found another patch proposal by
Simon [1] and I would like to avoid opening the Pandora's box again,
but to stress this: this feature is specifically aimed at solving OLTP
latency on *sync*rep (somewhat some code could be added/generalized
later on and this feature could be extended to async case, but this
then opens the question of managing the possible WAL throughput
budget/back throttling - this was also raised in first thread here [2]
by Konstantin).

IMHO it could implement various strategies under user-settable GUC
"wal_throttle_larger_transactions=(sync,256kB)" or
"wal_throttle_larger_transactions=off" , and someday later someone
could teach the code the async case (let's say under
"wal_throttle_larger_transactions=(asyncMaxRPO, maxAllowedLag=8MB,
256kB)"). Thoughts?

[1] - 
https://www.postgresql.org/message-id/flat/CA%2BU5nMLfxBgHQ1VLSeBHYEMjHXz_OHSkuFdU6_1quiGM0RNKEg%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/71f3e6fb-2fca-a798-856a-f23c8ede2333%40garret.ru

0001-WIP-Syncrep-and-improving-latency-due-to-WAL-throttl.patch
Description: Binary data

Syncrep and improving latency due to WAL throttling

2023-01-25 Thread Jakub Wartak

v 4.320, 0 failed, lag 0.063 ms
progress: 4.0 s, 44.0 tps, lat 14.429 ms stddev 4.715, 0 failed, lag 0.063 ms
progress: 5.0 s, 49.0 tps, lat 18.111 ms stddev 8.536, 0 failed, lag 0.062 ms
progress: 6.0 s, 58.0 tps, lat 17.929 ms stddev 8.198, 0 failed, lag 0.060 ms
progress: 7.0 s, 65.0 tps, lat 20.186 ms stddev 12.973, 0 failed, lag 0.059 ms
progress: 8.0 s, 47.0 tps, lat 16.174 ms stddev 6.508, 0 failed, lag 0.061 ms
progress: 9.0 s, 45.0 tps, lat 14.485 ms stddev 4.736, 0 failed, lag 0.061 ms
progress: 10.0 s, 53.0 tps, lat 16.879 ms stddev 8.783, 0 failed, lag 0.061 ms
progress: 11.0 s, 42.0 tps, lat 13.711 ms stddev 4.464, 0 failed, lag 0.062 ms
progress: 12.0 s, 49.0 tps, lat 13.252 ms stddev 4.082, 0 failed, lag 0.062 ms
progress: 13.0 s, 48.0 tps, lat 14.179 ms stddev 6.238, 0 failed, lag 0.058 ms
progress: 14.0 s, 43.0 tps, lat 12.210 ms stddev 2.993, 0 failed, lag 0.060 ms
progress: 15.0 s, 34.0 tps, lat 14.811 ms stddev 6.544, 0 failed, lag 0.062 ms
transaction type: 
[..]
latency average = 15.454 ms
latency stddev = 7.354 ms

session2 output (notice the INSERT took much longer but did NOT affect
the pgbench's stddev at all):
postgres=# set synchronous_commit_flush_wal_after = 256;
SET
postgres=# \i insertBIG.sql
Timing is on.
[..]
INSERT 0 50
Time: 22737.729 ms (00:22.738)

Without this feature (or with synchronous_commit_flush_wal_after=0)
the TCP's SendQ on socket walsender-->walreceiver is growing and as
such any next sendto() by OLTP backends/walwriter ends being queued
too much causing stalls of activity.

-Jakub Wartak.


0001-GUC-synchronous_commit_flush_wal_after.patch
Description: Binary data

Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-10 Thread Jakub Wartak

On Wed, Jan 4, 2023 at 6:38 PM Robert Haas  wrote:
>
> On Wed, Jan 4, 2023 at 11:36 AM Tom Lane  wrote:
> > As you well know, psql's FETCH_COUNT mechanism is far older than
> > single-row mode.  I don't think anyone's tried to transpose it
> > onto that.  I agree that it seems like a good idea to try.
> > There will be more per-row overhead, but the increase in flexibility
> > is likely to justify that.
>
> Yeah, I was vaguely worried that there might be more per-row overhead,
> not that I know a lot about this topic. I wonder if there's a way to
> mitigate that. I'm a bit suspicious that what we want here is really
> more of an incremental mode than a single-row mode i.e. yeah, you want
> to fetch rows without materializing the whole result, but maybe not in
> batches of exactly size one.

Given the low importance and very low priority of this, how about
adding it as a TODO wiki item then and maybe adding just some warning
instead? I've intentionally avoided parsing grammar and regexp so it's
not perfect (not that I do care about this too much either, as web
crawlers already have indexed this $thread). BTW I've found two
threads if know what are you looking for [1][2]

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/flat/a0a854b6-563c-4a11-bf1c-d6c6f924004d%40manitou-mail.org
[2] - 
https://www.postgresql.org/message-id/flat/1274761885.4261.233.camel%40minidragon


0001-psql-warn-about-CTE-queries-to-be-executed-without-u.patch
Description: Binary data

psql's FETCH_COUNT (cursor) is not being respected for CTEs

2023-01-04 Thread Jakub Wartak

Hi -hackers,

I've spent some time fighting against "out of memory" errors coming
out of psql when trying to use the cursor via FETCH_COUNT. It might be
a not so well known fact (?) that CTEs are not executed with cursor
when asked to do so, but instead silently executed with potential huge
memory allocation going on. Patch is attached. My one doubt is that
not every statement starting with "WITH" is WITH(..) SELECT of course.

Demo (one might also get the "out of memory for query result"):

postgres@hive:~$ psql -Ant --variable='FETCH_COUNT=100' -c "WITH data
AS (SELECT  generate_series(1, 2000) as Total) select repeat('a',
100) || data.Total || repeat('b', 800) as total_pat from data;"
Killed
postgres@hive:~$ tail -4 /var/log/postgresql/postgresql-14-main.log
[..]
2023-01-04 12:46:20.193 CET [32936] postgres@postgres LOG:  could not
send data to client: Broken pipe
[..]
2023-01-04 12:46:20.195 CET [32936] postgres@postgres FATAL:
connection to client lost

With the patch:
postgres@hive:~$ /tmp/psql16-with-patch -Ant
--variable='FETCH_COUNT=100' -c "WITH data AS (SELECT
generate_series(1, 2000) as Total) select repeat('a', 100) ||
data.Total || repeat('b', 800) as total_pat from data;" | wc -l
2000
postgres@hive:~$

Regards,
-Jakub Wartak.


0001-psql-allow-CTE-queries-to-be-executed-also-using-cur.patch
Description: Binary data

Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-25 Thread Jakub Wartak

Hi hackers,

Mr Lane, thank you for backporting this also to version 13. It seems
to be occuring in the wild (without debug_discard_caches) for real
user too when doing a lot of "CREATE  INDEX i ON
unlogged_table_truncated_after_crash (x,y)" which sometimes (rarely)
results in SIGSEGV11. I've reproduced it also on 13.9 recently thanks
to "break *btbuildempty / call InvalidateSystemCaches()".

I'm leaving partial stack trace so that other might find it (note the:
smgrwrite reln=0x0):

#0  smgrwrite (reln=0x0, forknum=INIT_FORKNUM, blocknum=0,
buffer=0xeef828 "", skipFsync=true) at smgr.c:516
#1  0x004e5492 in btbuildempty (index=0x7f201fc3c7e0) at nbtree.c:178
#2  0x005417f4 in index_build
(heapRelation=heapRelation@entry=0x7f201fc49dd0,
indexRelation=indexRelation@entry=0x7f201fc3c7e0,
indexInfo=indexInfo@entry=0x1159dd8,
#3  0x00542838 in index_create
(heapRelation=heapRelation@entry=0x7f201fc49dd0,
indexRelationName=indexRelationName@entry=0x1159f38 "",
indexRelationId=y..)
#4  0x005db9c8 in DefineIndex
(relationId=relationId@entry=1804880199, stmt=stmt@entry=0xf2fab8,
indexRelationId=indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0

-Jakub Wartak.

On Fri, Nov 25, 2022 at 9:48 AM Tomas Vondra
 wrote:
>
>
>
> On 11/18/22 15:43, Tom Lane wrote:
> > David Geier  writes:
> >> On a different note: are we frequently running our tests suites with
> >> debug_discard_caches=1 enabled?
> >> It doesn't seem like.
> >
> > Hmm.   Buildfarm members avocet and trilobite are supposed to be
> > doing that, but their runtimes of late put the lie to it.
> > Configuration option got lost somewhere?
> >
>
> Yup, my bad - I forgot to tweak CPPFLAGS when upgrading the buildfarm
> client to v12. Fixed, next run should be with
>
> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS',
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
>

Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-22 Thread Jakub Wartak

Hi all,

apologies the patch was rushed too quickly - my bad. I'm attaching a
fixed one as v0004 (as it is the 4th patch floating around here).

-Jakub Wartak

On Mon, Nov 21, 2022 at 9:55 PM Robert Haas  wrote:
>
> On Mon, Nov 21, 2022 at 1:17 PM Andres Freund  wrote:
> > On November 21, 2022 9:37:34 AM PST, Robert Haas  
> > wrote:
> > >On Mon, Nov 21, 2022 at 12:30 PM Andres Freund  wrote:
> > >> This can't quite be right - isn't this only applying the limit if we 
> > >> found a
> > >> visible tuple?
> > >
> > >It doesn't look that way to me, but perhaps I'm just too dense to see
> > >the problem?
> >
> > The earlier version didn't have the issue, but the latest seems to only 
> > limit after a visible tuple has been found. Note the continue; when 
> > fetching a heap tuple fails.
>
> Oh, that line was removed in Simon's patch but not in Jakub's version,
> I guess. Jakub's version also leaves out the last_block = block line
> which seems pretty critical.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


0001-Damage-control-for-planner-s-get_actual_variable_end_v0004.patch
Description: Binary data

Re: Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Jakub Wartak

Hi,

Draft version of the patch attached (it is based on Simon's)
I would be happier if we could make that #define into GUC (just in
case), although I do understand the effort to reduce the number of
various knobs (as their high count causes their own complexity).

-Jakub Wartak.

On Mon, Nov 21, 2022 at 4:35 PM Robert Haas  wrote:
>
> On Mon, Nov 21, 2022 at 10:32 AM Tom Lane  wrote:
> > Robert Haas  writes:
> > > Is there any reason to tie this into page costs? I'd be more inclined
> > > to just make it a hard limit on the number of pages. I think that
> > > would be more predictable and less prone to surprising (bad) behavior.
> >
> > Agreed, a simple limit of N pages fetched seems appropriate.
> >
> > > And to be honest I would be inclined to make it quite a small number.
> > > Perhaps 5 or 10. Is there a good argument for going any higher?
> >
> > Sure: people are not complaining until it gets into the thousands.
> > And you have to remember that the entire mechanism exists only
> > because of user complaints about inaccurate estimates.  We shouldn't
> > be too eager to resurrect that problem.
> >
> > I'd be happy with a limit of 100 pages.
>
> OK.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


0001-Damage-control-for-planner-s-get_actual_variable_end.patch
Description: Binary data

Damage control for planner's get_actual_variable_endpoint() runaway

2022-11-21 Thread Jakub Wartak

Hi hackers,

the case of planner's
src/backend/utils/adt/selfuncs.c:get_actual_variable_endpoint()
spending literally seconds seems to be well known fact across hackers
(in the extreme wild case it can be over 1+ hour on VLDBs). For those
unfamiliar it is planner estimation that tries to read real table
index (including deadrows) until min/max. It is blackbox mechanism
that works without any warning which often is hugely affected by
number of dead tuples in indexes and there's no on/off switch or
built-in limitation of how far it can go. It was discussed on pgsql
mailing lists several times [1]-[5]. It almost seems like it works
fine in 99.9% cases, until it doesn't and blows up big time on larger
systems and from there operator doesn't have a lot of choices [a lot
of time being already wasted on identifying the root-cause being the
planner]:
1) one can properly VACUUM (which everybody seem to agree is the
proper way to go, but it is often problematic due to other various
circumstances, especially on big tables without serious partitioning
strategies) - again this might be very time consuming
2) one cannot trade a lot CPU/IO burning on planning (actually
fetching indexes on multi-TB tables) to less accurate plans, and
realistically speaking rewriting queries is often impossible
3) application might not support enable prepared statements and even
if then simple queries/reports are also affected
4) there is no visibility into how much activity is spent on btree
index get_actual_variable_endpoint() alone, so one cannot estimate the
system-wide impact

I would like to trigger the discussion on how to give at least partial
control to the end-user of what the optimizer performs. I was thinking
about several things and each of those has pros and cons:

a) the attached quick patch (by Simon Riggs) that put maximum allowed
cost constraints on the index-lookup machinery as a safeguard (that
#define is debatable; in my testing it reduced the hit from ~850ms to
0.6ms +/- 0.3ms at the current value of 20).
b) I was wondering about creating a new wait class "Planner" with the
event "ReadingMinMaxIndex" (or similar). The obvious drawback is the
naming/categorization as wait events are ... well.. as the name "WAIT"
implies, while those btree lookups could easily be ON-CPU activity.
c) Any other idea, e.g. see [3] or [5] (cache was being proposed).
d) For completeness : a new GUC/knob to completely disable the
functionality (debug_optimizer_minmax_est), but that's actually
trimmed functionality of the patch.
e) I was also wondering about some DEBUG/NOTICE elog() when taking
more than let's say arbitrary 10s, but that could easily spam the log
file

Reproducer on a very small dataset follows. Please note the reproducer
here shows the effect on 1st run EXPLAIN, however in real observed
situation (multi-TB unpartitioned table) each consecutive planner
operation (just EXPLAIN) on that index was affected (I don't know why
LP_DEAD/hints cleaning was not kicking in, but maybe it was, but given
the scale of the problem it was not helping much).

-Jakub Wartak.

[1] - 
https://www.postgresql.org/message-id/flat/54446AE2.6080909%40BlueTreble.com#f436bb41cf044b30eeec29472a13631e
[2] - 
https://www.postgresql.org/message-id/flat/db7111f2-05ef-0ceb-c013-c34adf4f4121%40gmail.com
[3] - 
https://www.postgresql.org/message-id/flat/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru
(SnapshotAny vs SnapshotDirty discussions between Tom and Robert)
[4] - 
https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com
[5] - https://postgrespro.com/list/thread-id/2436130 (cache)

s1:
=# drop table test;
=# create table test (id bigint primary key) with (autovacuum_enabled = 'off');
=# insert into test select generate_series(1,1000); -- ~310MB
table, ~210MB index

s2/start the long running transaction:
=# begin;
=# select min(id) from test;

s1:
=# delete from test where id>100;
=# analyze test;
=# set enable_indexonlyscan = 'off'; -- just in case to force
BitmapHeapScans which according to backend/access/nbtree/README
won'tset LP_DEAD, but my bt_page_items() tests indicate that it does
(??)
=# set enable_indexscan = 'off';
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared hit=9155 read=55276 dirtied=55271
written=53617 / Time: 851.761 ms
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared read=24594 / Time: 49.172 ms
=# vacuum (verbose) test; => index scan needed: 39824 pages from table
(90.00% of total) had 900 dead item identifiers removed
=# explain (buffers, verbose) select * from test where id > 1100;
=>  Planning: Buffers: shared hit=14 read=3 / Time: 0.550 ms

with patch, the results are:
p=# explain (buffers, verbose) select * from test where id > 1100;
=> Planning: / Buffers: shared hit=17 read=6 dirtied=3 writt

RE: Use fadvise in wal replay

2022-06-23 Thread Jakub Wartak

Hey Andrey,

> > 23 июня 2022 г., в 13:50, Jakub Wartak 
> написал(а):
> >
> > Thoughts?
> The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra
> branch for 120KB after 1st block when readOff==0?
> Or maybe do
> + posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK,
> POSIX_FADV_WILLNEED);
> instead of
> + posix_fadvise(readFile, readOff + RACHUNK, RACHUNK,
> POSIX_FADV_WILLNEED);
> ?

> > Notes:
> > - no GUC, as the default/identical value seems to be the best
> I think adding this performance boost on most systems definitely worth 1 
> syscall
> per 16 pages. And I believe 128KB to be optimal for most storages. And having
> no GUCs sounds great.
> 
> But storage systems might be different, far beyond benchmarks.
> All in all, I don't have strong opinion on having 1 or 0 GUCs to configure 
> this.
> 
> I've added patch to the CF.

Cool. As for GUC I'm afraid there's going to be resistance of adding yet 
another GUC (to avoid many knobs). Ideally it would be nice if we had some 
advanced/deep/hidden parameters , but there isn't such thing.
Maybe another option would be to use (N * maintenance_io_concurrency * 
XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB (pretty close to default value, 
and still can be tweaked by enduser). Let's wait what others say?

-J.

RE: Use fadvise in wal replay

2022-06-23 Thread Jakub Wartak

>> > On 21 Jun 2022, at 16:59, Jakub Wartak  wrote:
>> Oh, wow, your benchmarks show really impressive improvement.
>> 
>> > I think that 1 additional syscall is not going to be cheap just for
>> > non-standard OS configurations
>> Also we can reduce number of syscalls by something like
>> 
>> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
>> if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
>> posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8,
>> POSIX_FADV_WILLNEED); #endif
>> 
>> and maybe define\reuse the some GUC to control number of prefetched pages
>> at once.

Hi, I was thinking the same, so I got the patch (attached) to the point it gets 
the identical performance with and without readahead enabled:

baseline, master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master+patched, readahead disabled:
34.338, 0.528
34.568, 0.575
34.007, 1.136

master+patched, readahead enabled (as default):
33.935, 0.523
34.109, 0.501
33.408, 0.557

Thoughts?

Notes:
- no GUC, as the default/identical value seems to be the best
- POSIX_FADV_SEQUENTIAL is apparently much slower and doesn't seem to have 
effect from xlogreader.c at all while _WILLNEED does (testing again contradicts 
"common wisdom"?)

-J.


0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patch
Description: 0001-Use-fadvise-to-prefetch-WAL-in-xlogrecovery.patch

RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak

> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak 
> wrote:
> > > > Maybe the important question is why would be readahead mechanism
> > > > be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ
> which is not cheap either. The code is already hot and there is example from 
> the
> past where syscalls were limiting the performance [1]. Maybe it could be
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?
> 
> I've always thought we'd want to tell it about the *next* segment file, to
> smooth the transition from one file to the next, something like the attached 
> (not
> tested).

Hey Thomas! 

Apparently it's false theory. Redo-bench [1] results (1st is total recovery 
time in seconds, 3.1GB pgdata (out of which 2.6 pg_wals/166 files). Redo-bench 
was slightly hacked to drop fs caches always after copying so that there is 
nothing in fscache (both no pgdata and no pg_wals; shared fs). M_io_c is at 
default (10), recovery_prefetch same (try; on by default)

master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master, blockdev --setra 0 /dev/nvme0n1: 
53.151, 0.603
58.329, 0.525
52.435, 0.536

master, with yours patch (readaheads disabled) -- double checked, calls to 
fadvise64(offset=0 len=0) were there
58.063, 0.593
51.369, 0.574
51.716, 0.59

master, with Kirill's original patch (readaheads disabled)
38.25, 1.134
36.563, 0.582
37.711, 0.584

I've noted also that in both cases POSIX_FADV_SEQUENTIAL is being used instead 
of WILLNEED (?). 
I haven't quantified the tradeoff of master vs Kirill's with readahead, but I 
think that 1 additional syscall is not going to be cheap just for non-standard 
OS configurations (?)

-J.

[1] - https://github.com/macdice/redo-bench

RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak

> > Maybe the important question is why would be readahead mechanism be
> disabled in the first place via /sys | blockdev ?
> 
> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which 
is not cheap either. The code is already hot and there is example from the past 
where syscalls were limiting the performance [1]. Maybe it could be prefetching 
in larger batches (128kB? 1MB? 16MB?)  ?

-J.

[1] - https://commitfest.postgresql.org/28/2606/

RE: Use fadvise in wal replay

2022-06-21 Thread Jakub Wartak

>> > On 21 Jun 2022, at 12:35, Amit Kapila  wrote:
>> >
>> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
>> > help your case?
>> 
>> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to
>> prefetch WAL itself before reading it. Kirill is trying to solve the problem 
>> of
>> reading WAL segments that are our of OS page cache.

It seems that it is always by default set to 128 (kB) by default, another thing 
is that having (default) 16MB WAL segments might also hinder the readahead 
heuristics compared to having configured the bigger WAL segment size.

Maybe the important question is why would be readahead mechanism be disabled in 
the first place via /sys | blockdev ?

-J.

RE: pgcon unconference / impact of block size on performance

2022-06-10 Thread Jakub Wartak

> On 6/9/22 13:23, Jakub Wartak wrote:
> >>>>>>> The really
> >>>>>> puzzling thing is why is the filesystem so much slower for
> >>>>>> smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
> >>>>>> Why would a filesystem have such effect?
> >>>>>
> >>>>> Ha! I don't care at this point as 1 or 2kB seems too small to
> >>>>> handle many real world scenarios ;)
> >>> [..]
> >>>> Independently of that, it seems like an interesting behavior and it
> >>>> might tell us something about how to optimize for larger pages.
> >>>
> >>> OK, curiosity won:
> >>>
> >>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches
> >>> ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS
> >>> [with maxdepth=128] With randwrite on ext4 directio using 1kb the
> >>> avgqu-sz is just
> >> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
> >> something is being locked thus preventing queue to build up With
> >> randwrite on
> >> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is
> >> queued) and I'm also getting ~70k IOPS with minimal possible
> >> maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
> >>>
> >>> The 1kB (slow) profile top function is aio_write() ->  ->
> >>> iov_iter_get_pages()
> >> -> internal_get_user_pages_fast() and there's sadly plenty of "lock"
> >> -> keywords
> >> inside {related to memory manager, padding to full page size, inode
> >> locking} also one can find some articles / commits related to it [1]
> >> which didn't made a good feeling to be honest as the fio is using
> >> just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x
> >> files and numjobs=4 and got easily 60k IOPS, contention solved
> >> whatever it was :) So I would assume PostgreSQL (with it's splitting
> >> data files by default on 1GB boundaries and multiprocess
> >> architecture) should be relatively safe from such ext4 inode(?)/mm(?)
> contentions even with smallest 1kb block sizes on Direct I/O some day.
> >>>
> >>
> >> Interesting. So what parameter values would you suggest?
> >
> > At least have 4x filename= entries and numjobs=4
> >
> >> FWIW some of the tests I did were on xfs, so I wonder if that might
> >> be hitting similar/other bottlenecks.
> >
> > Apparently XFS also shows same contention on single file for 1..2kb 
> > randwrite,
> see [ZZZ].
> >
> 
> I don't have any results yet, but after thinking about this a bit I find this 
> really
> strange. Why would there be any contention with a single fio job?  Doesn't 
> contention 
> imply multiple processes competing for the same resource/lock etc.?

Maybe 1 job throws a lot of concurrent random I/Os that contend against the 
same inodes / pages (?) 
 
> Isn't this simply due to the iodepth increase? IIUC with multiple fio jobs, 
> each
> will use a separate iodepth value. So with numjobs=4, we'll really use 
> iodepth*4,
> which can make a big difference.

I was thinking the same (it should be enough to have big queue depth), but 
apparently one needs many files (inodes?) too:

On 1 file I'm not getting a lot of IOPS on small blocksize (even with numjobs), 
< 20k IOPS always:
numjobs=1/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.5k, 
BW=13.2MiB/s (13.8MB/s)(396MiB/30008msec); 0 zone resets
numjobs=1/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=49.1k, 
BW=192MiB/s (201MB/s)(5759MiB/30001msec); 0 zone resets
numjobs=1/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=16.8k, 
BW=16.4MiB/s (17.2MB/s)(494MiB/30001msec); 0 zone resets
numjobs=1/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=62.5k, 
BW=244MiB/s (256MB/s)(7324MiB/30001msec); 0 zone resets
numjobs=1/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=14.7k, 
BW=14.3MiB/s (15.0MB/s)(429MiB/30008msec); 0 zone resets
numjobs=1/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=46.4k, 
BW=181MiB/s (190MB/s)(5442MiB/30002msec); 0 zone resets
numjobs=1/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=22.3k, 
BW=21.8MiB/s (22.9MB/s)(654MiB/30001msec); 0 zone resets
numjobs=1/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=59.6k, 
BW=233MiB/s (244MB/s)(6988MiB/30001msec); 0 zone resets
numjobs=4/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.9k, 
BW=13.6MiB/s (14.2MB/s)(407MiB/30035msec); 0 zone resets [FAIL 4*qdepth]
numjobs=4/ext4/io_uring

RE: pgcon unconference / impact of block size on performance

2022-06-09 Thread Jakub Wartak

> >  The really
>  puzzling thing is why is the filesystem so much slower for smaller
>  pages. I mean, why would writing 1K be 1/3 of writing 4K?
>  Why would a filesystem have such effect?
> >>>
> >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
> >>> many real world scenarios ;)
> > [..]
> >> Independently of that, it seems like an interesting behavior and it
> >> might tell us something about how to optimize for larger pages.
> >
> > OK, curiosity won:
> >
> > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
> > (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
> > maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
> something is being locked thus preventing queue to build up With randwrite on
> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
> I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
> just need to split the lock contention by 4.
> >
> > The 1kB (slow) profile top function is aio_write() ->  -> 
> > iov_iter_get_pages()
> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
> inside {related to memory manager, padding to full page size, inode locking}
> also one can find some articles / commits related to it [1] which didn't made 
> a
> good feeling to be honest as the fio is using just 1 file (even while I'm on 
> kernel
> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
> contention solved whatever it was :) So I would assume PostgreSQL (with it's
> splitting data files by default on 1GB boundaries and multiprocess 
> architecture)
> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
> smallest 1kb block sizes on Direct I/O some day.
> >
> 
> Interesting. So what parameter values would you suggest?

At least have 4x filename= entries and numjobs=4

> FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
> similar/other bottlenecks.

Apparently XFS also shows same contention on single file for 1..2kb randwrite, 
see [ZZZ]. 

[root@x ~]# mount|grep /mnt/nvme
/dev/nvme0n1 on /mnt/nvme type xfs 
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

# using 1 fio job and 1 file
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=5779.1MB, bw=196573KB/s, 
iops=196573, runt= 30109msec
libaio/nvme/randread/128/2k/1.txt:  read : io=10335MB, bw=352758KB/s, 
iops=176379, runt= 30001msec
libaio/nvme/randread/128/4k/1.txt:  read : io=0MB, bw=758408KB/s, 
iops=189601, runt= 30001msec
libaio/nvme/randread/128/8k/1.txt:  read : io=28914MB, bw=986896KB/s, 
iops=123361, runt= 30001msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=694856KB, bw=23161KB/s, 
iops=23161, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/2k/1.txt:  write: io=1370.7MB, bw=46782KB/s, 
iops=23390, runt= 30001msec [ZZZ]
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8261.3MB, bw=281272KB/s, 
iops=70318, runt= 30076msec [OK]
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11598MB, bw=394320KB/s, 
iops=49289, runt= 30118msec

# but it's all ok using 4 fio jobs and 4 files
[root@x ~]# grep -r -e 'read :' -e 'write:' libaio
libaio/nvme/randread/128/1k/1.txt:  read : io=6174.6MB, bw=210750KB/s, 
iops=210750, runt= 30001msec
libaio/nvme/randread/128/2k/1.txt:  read : io=12152MB, bw=413275KB/s, 
iops=206637, runt= 30110msec
libaio/nvme/randread/128/4k/1.txt:  read : io=24382MB, bw=832116KB/s, 
iops=208028, runt= 30005msec
libaio/nvme/randread/128/8k/1.txt:  read : io=29281MB, bw=985831KB/s, 
iops=123228, runt= 30415msec
libaio/nvme/randwrite/128/1k/1.txt:  write: io=1692.2MB, bw=57748KB/s, 
iops=57748, runt= 30003msec
libaio/nvme/randwrite/128/2k/1.txt:  write: io=3601.9MB, bw=122940KB/s, 
iops=61469, runt= 30001msec
libaio/nvme/randwrite/128/4k/1.txt:  write: io=8470.8MB, bw=285857KB/s, 
iops=71464, runt= 30344msec
libaio/nvme/randwrite/128/8k/1.txt:  write: io=11449MB, bw=390603KB/s, 
iops=48825, runt= 30014msec
 

> >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot
> >>> understand
> >> how having direct I/O enabled (which disables caching) achieves
> >> better read IOPS on ext4 than on raw device... isn't it contradiction?
> >>>
> >>
> >> Thanks for the clarification. Not sure what might be causing this.
> >> Did you use the same parameters (e.g. iodepth) in both cases?
> >
> > Explanation: it's the CPU scheduler migrations mixing the performance result
> during the runs of fio  (as you have in your framework). Various VCPUs seem to
> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to 
> be
> unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
> some VCPUs [ marker] don't reach 100% CPU reaching almost twice the
> result; while cores 0, 3 do reach 100% and lack CPU

RE: pgcon unconference / impact of block size on performance

2022-06-08 Thread Jakub Wartak

Hi, got some answers!

TL;DR for fio it would make sense to use many stressfiles (instead of 1) and 
same for numjobs ~ VCPU to avoid various pitfails.

> >> The really
> >> puzzling thing is why is the filesystem so much slower for smaller
> >> pages. I mean, why would writing 1K be 1/3 of writing 4K?
> >> Why would a filesystem have such effect?
> >
> > Ha! I don't care at this point as 1 or 2kB seems too small to handle
> > many real world scenarios ;)
[..]
> Independently of that, it seems like an interesting behavior and it might 
> tell us
> something about how to optimize for larger pages.

OK, curiosity won:

With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close 
to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm 
getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion: something is 
being locked thus preventing queue to build up
With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so 
something is queued) and I'm also getting ~70k IOPS with minimal possible 
maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.

The 1kB (slow) profile top function is aio_write() ->  -> 
iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly plenty 
of "lock" keywords inside {related to memory manager, padding to full page 
size, inode locking} also one can find some articles / commits related to it 
[1] which didn't made a good feeling to be honest as the fio is using just 1 
file (even while I'm on kernel 5.10.x). So I've switched to 4x files and 
numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) So I 
would assume PostgreSQL (with it's splitting data files by default on 1GB 
boundaries and multiprocess architecture) should be relatively safe from such 
ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on Direct 
I/O some day.  

[1] - https://www.phoronix.com/scan.php?page=news_item=EXT4-DIO-Faster-DBs

> > Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
> how having direct I/O enabled (which disables caching) achieves better read
> IOPS on ext4 than on raw device... isn't it contradiction?
> >
> 
> Thanks for the clarification. Not sure what might be causing this. Did you 
> use the
> same parameters (e.g. iodepth) in both cases?

Explanation: it's the CPU scheduler migrations mixing the performance result 
during the runs of fio  (as you have in your framework). Various VCPUs seem to 
be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be 
unaware of it. At least on 1kB and 4kB blocksize this happens also notice that 
some VCPUs [ marker] don't reach 100% CPU reaching almost twice the result; 
while cores 0, 3 do reach 100% and lack CPU power to perform more. The only 
thing that I don't get is that it doesn't make sense from extened lscpu output 
(but maybe it's AWS XEN mixing real CPU mappings, who knows). 

[root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.ext4 | 
grep -e 'read :' -e 'cpu '; done
0:
  read : io=2416.8MB, bw=123730KB/s, iops=123730, runt= 20001msec
  cpu  : usr=42.98%, sys=56.52%, ctx=2317, majf=0, minf=41 [: 100% 
cpu bottleneck and just 123k IOPS]
1:
  read : io=4077.9MB, bw=208774KB/s, iops=208773, runt= 20001msec
  cpu  : usr=29.47%, sys=51.43%, ctx=2993, majf=0, minf=42  [, some 
idle power and 208k IOPS just by switching to core1...]
2:
  read : io=4036.7MB, bw=206636KB/s, iops=206636, runt= 20001msec
  cpu  : usr=31.00%, sys=52.41%, ctx=2815, majf=0, minf=42 []
3:
  read : io=2398.4MB, bw=122791KB/s, iops=122791, runt= 20001msec
  cpu  : usr=44.20%, sys=55.20%, ctx=2522, majf=0, minf=41
[root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.raw | 
grep -e 'read :' -e 'cpu '; done
0:
  read : io=2512.3MB, bw=128621KB/s, iops=128620, runt= 20001msec
  cpu  : usr=47.62%, sys=51.58%, ctx=2365, majf=0, minf=42
1:
  read : io=4070.2MB, bw=206748KB/s, iops=206748, runt= 20159msec
  cpu  : usr=29.52%, sys=42.86%, ctx=2808, majf=0, minf=42 []
2:
  read : io=4101.3MB, bw=209975KB/s, iops=209975, runt= 20001msec
  cpu  : usr=28.05%, sys=45.09%, ctx=3419, majf=0, minf=42 []
3:
  read : io=2519.4MB, bw=128985KB/s, iops=128985, runt= 20001msec
  cpu  : usr=46.59%, sys=52.70%, ctx=2371, majf=0, minf=41

[root@x ~]# lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZMINMHZ
0   00  00:0:0:0   yes3000. 1200.
1   00  11:1:1:0   yes3000. 1200.
2   00  00:0:0:0   yes3000. 1200.
3   00  11:1:1:0   yes3000. 1200.
[root@x ~]# lscpu | grep -e ^Model -e ^NUMA -e ^Hyper
NUMA node(s):1
Model:   79
Model name:  Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

RE: effective_io_concurrency and NVMe devices

2022-06-08 Thread Jakub Wartak

> >> The attached patch is a trivial version that waits until we're at
> >> least
> >> 32 pages behind the target, and then prefetches all of them. Maybe give it 
> >> a
> try?
> >> (This pretty much disables prefetching for e_i_c below 32, but for an
> >> experimental patch that's enough.)
> >
> > I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults
> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier
> inspection with strace just to be sure//so please don't compare times).
> >
> > run:
> > a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k
> > IOPS (~122k IOPS practical max here for libaio)
> > b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled
> > prefetching, fadvise() not occurring
> > c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms,
> > 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even
> > peaked for a second on ~122k)
> > d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms
> > 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks
> > to 90..100k IOPS)
> >
> > ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is
> > worth throwing that patch onto more advanced / complete performance
> > test farm too ? (although it's only for bitmap heap scans)

I hope you have some future plans for this patch :)

> Yes, kernel certainly does it's own read-ahead, which works pretty well for
> sequential patterns. What does
> 
>blockdev --getra /dev/...
> 
> say?

It's default, 256 sectors (128kb) so it matches.

-J.

RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak

Hi,

> The really
> puzzling thing is why is the filesystem so much slower for smaller pages. I 
> mean,
> why would writing 1K be 1/3 of writing 4K?
> Why would a filesystem have such effect?

Ha! I don't care at this point as 1 or 2kB seems too small to handle many real 
world scenarios ;)

> > b) Another thing that you could also include in testing is that I've 
> > spotted a
> couple of times single-threaded fio might could be limiting factor (numjobs=1 
> by
> default), so I've tried with numjobs=2,group_reporting=1 and got this below
> ouput on ext4 defaults even while dropping caches (echo 3) each loop 
> iteration -
> - something that I cannot explain (ext4 direct I/O caching effect? how's that
> even possible? reproduced several times even with numjobs=1) - the point being
> 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some
> caching effect because for randwrite it does not happen. I've triple-checked 
> with
> iostat -x... it cannot be any internal device cache as with direct I/O that 
> doesn't
> happen:
> >
> > [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
> > nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s,
> > iops=206643, runt= 60001msec [b]
> > nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s,
> > iops=160604, runt= 60001msec [b]
> > nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s,
> > iops=157802, runt= 60001msec [b]
> > nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s,
> > iops=122115, runt= 60128msec
> > nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s,
> > iops=17683, runt= 60001msec [a, as before]
> > nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s,
> > iops=17263, runt= 60001msec [a, as before]
> > nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s,
> > iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
> > nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s,
> > iops=48979, runt= 60099msec
> >
> 
> No idea what might be causing this. BTW so you're not using direct-io to 
> access
> the raw device? Or am I just misreading this?

Both scenarios (raw and fs) have had direct=1 set. I just cannot understand how 
having direct I/O enabled (which disables caching) achieves better read IOPS on 
ext4 than on raw device... isn't it contradiction?

-J.

RE: effective_io_concurrency and NVMe devices

2022-06-07 Thread Jakub Wartak

Hi Tomas,

> > I have a machine here with 1 x PCIe 3.0 NVMe SSD and also 1 x PCIe 4.0
> > NVMe SSD. I ran a few tests to see how different values of
> > effective_io_concurrency would affect performance. I tried to come up
> > with a query that did little enough CPU processing to ensure that I/O
> > was the clear bottleneck.
> >
> > The test was with a 128GB table on a machine with 64GB of RAM.  I
> > padded the tuples out so there were 4 per page so that the aggregation
> > didn't have much work to do.
> >
> > The query I ran was: explain (analyze, buffers, timing off) select
> > count(p) from r where a = 1;
 
> The other idea I had while looking at batching a while back, is that we should
> batch the prefetches. The current logic interleaves prefetches with other 
> work -
> prefetch one page, process one page, ... But once reading a page gets
> sufficiently fast, this means the queues never get deep enough for
> optimizations. So maybe we should think about batching the prefetches, in some
> way. Unfortunately posix_fadvise does not allow batching of requests, but we
> can at least stop interleaving the requests.

.. for now it doesn't, but IORING_OP_FADVISE is on the long-term horizon.  

> The attached patch is a trivial version that waits until we're at least
> 32 pages behind the target, and then prefetches all of them. Maybe give it a 
> try?
> (This pretty much disables prefetching for e_i_c below 32, but for an
> experimental patch that's enough.)

I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults 
s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier 
inspection with strace just to be sure//so please don't compare times). 

run:
a) master (e_i_c=10)  181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k IOPS 
(~122k IOPS practical max here for libaio)
b) patched(e_i_c=10)  237774ms, 236326ms, ..as you stated it disabled 
prefetching, fadvise() not occurring
c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms, 81432ms 
(mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even peaked for a second 
on ~122k)
d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms 99939ms 
(mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks to 90..100k IOPS)

~16% benefit sounds good (help me understand: L1i cache?). Maybe it is worth 
throwing that patch onto more advanced / complete performance test farm too ? 
(although it's only for bitmap heap scans)

run a: looked interleaved as you said:
fadvise64(160, 1064157184, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\200\303/_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
\230\300\17@\220\300\17"..., 8192, 1064009728) = 8192
fadvise64(160, 1064173568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(160, "@\0\0\0\0\0040_\0\0\4\0(\0\200\0\0 \4 \0\0\0\0 
\230\300\17@\220\300\17"..., 8192, 1064026112) = 8192
[..]

BTW: interesting note, for run b, the avgrq-sz from extended iostat jumps is 
flipping between 16(*512=8kB) to ~256(*512=~128kB!) as if kernel was doing some 
own prefetching heuristics on and off in cycles, while when calling 
e_i_c/fadvise() is in action then it seems to be always 8kB requests. So with 
disabled fadivse() one IMHO might have problems deterministically benchmarking 
short queries as kernel voodoo might be happening (?)

-J.

RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak

[..]
>I doubt we could ever
> make the default smaller than it is today as it would nobody would be able to
> insert rows larger than 4 kilobytes into a table anymore. 

Add error "values larger than 1/3 of a buffer page cannot be indexed" to that 
list...

-J.

RE: pgcon unconference / impact of block size on performance

2022-06-07 Thread Jakub Wartak

Hi Tomas,

> Well, there's plenty of charts in the github repositories, including the 
> charts I
> think you're asking for:

Thanks.

> I also wonder how is this related to filesystem page size - in all the 
> benchmarks I
> did I used the default (4k), but maybe it'd behave if the filesystem page 
> matched
> the data page.

That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:

[root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, 
runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, 
runt= 60001msec [b] 
nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, 
runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, 
runt= 60001msec
nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, 
runt= 60366msec [a]
nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, 
runt= 60002msec [a]
nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, 
runt= 60248msec
nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, 
runt= 60047msec

So, I've found out two interesting things while playing with raw vs ext4:
a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as 
expected, this was ext4 4kb default fs page size impact as you was thinking 
about when fio 1k was hitting ext4 4kB block)
b) Another thing that you could also include in testing is that I've spotted a 
couple of times single-threaded fio might could be limiting factor (numjobs=1 
by default), so I've tried with numjobs=2,group_reporting=1 and got this below 
ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration 
-- something that I cannot explain (ext4 direct I/O caching effect? how's that 
even possible? reproduced several times even with numjobs=1) - the point being 
206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some 
caching effect because for randwrite it does not happen. I've triple-checked 
with iostat -x... it cannot be any internal device cache as with direct I/O 
that doesn't happen:

[root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, 
runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, 
runt= 60001msec [b]
nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, 
runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, 
runt= 60128msec
nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, 
runt= 60001msec [a, as before]
nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, 
runt= 60001msec [a, as before]
nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, 
runt= 60311msec [reproduced benefit, as per earlier email]
nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, 
runt= 60099msec

> > One way or another it would be very nice to be able to select the
> > tradeoff using initdb(1) without the need to recompile, which then
> > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
> > DB page size, ...).> Do you envision any plans for this we still in a
> > need to gather more info exactly why this happens? (perf reports?)
> >
> 
> Not sure I follow. Plans for what? Something that calibrates cost parameters?
> That might be useful, but that's a rather separate issue from what's discussed
> here - page size, which needs to happen before initdb (at least with how 
> things
> work currently).
[..]

Sorry, I was too far teched and assumed you guys were talking very long term. 

-J.

RE: pgcon unconference / impact of block size on performance

2022-06-06 Thread Jakub Wartak

Hi Tomas,

> Hi,
> 
> At on of the pgcon unconference sessions a couple days ago, I presented a
> bunch of benchmark results comparing performance with different data/WAL
> block size. Most of the OLTP results showed significant gains (up to 50%) with
> smaller (4k) data pages.

Nice. I just saw this 
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you 
have any plans for publishing those other graphs too (e.g. WAL block size 
impact)?

> This opened a long discussion about possible explanations - I claimed one of 
> the
> main factors is the adoption of flash storage, due to pretty fundamental
> differences between HDD and SSD systems. But the discussion concluded with an
> agreement to continue investigating this, so here's an attempt to support the
> claim with some measurements/data.
> 
> Let me present results of low-level fio benchmarks on a couple different HDD
> and SSD drives. This should eliminate any postgres-related influence (e.g. 
> FPW),
> and demonstrates inherent HDD/SSD differences.
> All the SSD results show this behavior - the Optane and Samsung nicely show
> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
> worse.
> 
[..]
Can you share what Linux kernel version, what filesystem , it's mount options 
and LVM setup were you using if any(?)

I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS 
i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: 
runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. 

> Now, compare this to the SSD. There are some differences between the models, 
> manufacturers, interface etc. but the impact of page size on IOPS is pretty 
> clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung 
> it's even more, etc. This means that workloads dominated by random I/O get 
> significant benefit from smaller pages.

Yup, same here, reproduced, 1.42x faster on writes:
[root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
[root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
1k/1.txt: bw=24162KB/s, iops=24161,
2k/1.txt: bw=47164KB/s, iops=23582,
4k/1.txt: bw=280450KB/s, iops=70112, <<<
8k/1.txt: bw=393082KB/s, iops=49135,
16k/1.txt: bw=393103KB/s, iops=24568,
32k/1.txt: bw=393283KB/s, iops=12290,
BTW it's interesting to compare to your's Optane 900P result (same two high 
bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so 
it behaves more like Samsung 860 in your case

# 1.41x on randreads
[root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
[root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
1k/1.txt: bw=169938KB/s, iops=169937,
2k/1.txt: bw=376653KB/s, iops=188326,
4k/1.txt: bw=691529KB/s, iops=172882, <<<
8k/1.txt: bw=976916KB/s, iops=122114,
16k/1.txt: bw=990524KB/s, iops=61907,
32k/1.txt: bw=974318KB/s, iops=30447,

I think that the above just a demonstration of device bandwidth saturation: 
32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for 
app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could 
only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most 
real systems are really OLTP but with some reporting/data exporting needs). One 
way or another it would be very nice to be able to select the tradeoff using 
initdb(1) without the need to recompile, which then begs for some initdb 
--calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...).

Do you envision any plans for this we still in a need to gather more info 
exactly why this happens? (perf reports?)

Also have you guys discussed on that meeting any long-term future plans on 
storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware 
sector size, wouldn't it be possible to win also disabling FPWs in the longer 
run using uring (assuming O_DIRECT | O_ATOMIC one day?)
I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that 
pretty nicely in [1] 

[1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO


libaio-2022-06-06.tgz
Description: libaio-2022-06-06.tgz

RE: effective_io_concurrency and NVMe devices

2022-06-02 Thread Jakub Wartak

Hi Nathan,

> > NVMe devices have a maximum queue length of 64k:
[..]
> > but our effective_io_concurrency maximum is 1,000:
[..]
> > Should we increase its maximum to 64k?  Backpatched?  (SATA has a
> > maximum queue length of 256.)
> 
> If there are demonstrable improvements with higher values, this seems
> reasonable to me.  I would even suggest removing the limit completely so
> this doesn't need to be revisited in the future.

Well, are there any? I remember playing with this (although for ANALYZE 
Stephen's case [1]) and got quite contrary results [2] -- see going to 16 from 
8 actually degraded performance.
I somehow struggle to understand how 1000+ fadvise() syscalls would be a net 
benefit on storage with latency of ~0.1.. 0.3ms as each syscall on it's own is 
overhead (quite contrary, it should help on very slow one?) 
Pardon if I'm wrong (I don't have time to lookup code now), but maybe Bitmap 
Scans/fadvise() logic would first need to perform some fadvise() offset/length 
aggregations to bigger fadvise() syscalls and in the end real hardware 
observable I/O concurrency would be bigger (assuming that fs/LVM/dm/mq layer 
would split that into more parallel IOs).

-J.

[1] - https://commitfest.postgresql.org/30/2799/
[2] - 
https://www.postgresql.org/message-id/flat/vi1pr0701mb69603a433348edcf783c6ecbf6...@vi1pr0701mb6960.eurprd07.prod.outlook.com

RE: strange slow query - lost lot of time somewhere

2022-05-05 Thread Jakub Wartak

Hi Pavel,

> I have not debug symbols, so I have not more details now
> Breakpoint 1 at 0x7f557f0c16c0
> (gdb) c
> Continuing.

> Breakpoint 1, 0x7f557f0c16c0 in mmap64 () from /lib64/libc.so.6
> (gdb) bt
> #0  0x7f557f0c16c0 in mmap64 () from /lib64/libc.so.6
> #1  0x7f557f04dd91 in sysmalloc () from /lib64/libc.so.6
> #2  0x7f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
> #3  0x7f557f04fb1e in malloc () from /lib64/libc.so.6
> #4  0x00932134 in AllocSetAlloc ()
> #5  0x009376cf in MemoryContextAllocExtended ()
> #6  0x006ad915 in ExecInitMemoize ()

Well the PGDG repo have the debuginfos (e.g. postgresql14-debuginfo) rpms / 
dpkgs(?) so I hope you are basically 1 command away of being able to debug it 
further what happens in ExecInitMemoize()
Those packages seem to be safe as they modify only /usr/lib/debug so should not 
have any impact on production workload.

-J.

RE: strange slow query - lost lot of time somewhere

2022-05-04 Thread Jakub Wartak

> I do agree that the perf report does indicate that the extra time is taken 
> due to
> some large amount of memory being allocated. I just can't quite see how that
> would happen in Memoize given that
> estimate_num_groups() clamps the distinct estimate as the number of input
> rows, which is 91 in both cases in your problem query.
> 
> Are you able to run the Memoize query in psql with \watch 0.1 for a few 
> seconds
> while you do:
> 
> perf record --call-graph dwarf --pid  sleep 2
> 
> then send along the perf report.
> 
> I locally hacked build_hash_table() in nodeMemoize.c to make the hashtable 100
> million elements and I see my perf report for a trivial Memoize query come up
> as:
> 
[..]
> 
> Failing that, are you able to pg_dump these tables and load them into a
> PostgreSQL instance that you can play around with and patch?
> Provided you can actually recreate the problem on that instance.
> 

+1 to what David says, we need a reproducer. In [1] Pavel wrote that he's 
having a lot of clear_page_erms(), so maybe this will be a little help: I 
recall having similar issue having a lot of minor page faults and high %sys 
when raising work_mem. For me it was different issue some time ago, but it was 
something like build_hash_table() being used by UNION recursive calls -> 
BuildTupleHashTable() -> .. malloc() -> mmap64().  When mmap() is issued with 
MAP_ANONYMOUS the kernel will zero out the memory (more memory -> potentially 
bigger CPU waste visible as minor page faults; erms stands for "Enhanced REP 
MOVSB/STOSB"; this is on kernel side). The culprit was planner allocating 
something that wouldn't be used later.

Additional three ways to figure that one (all are IMHO production safe):
a) already mentioned perf with --call-graph dwarf -p PID
b) strace -p PID -e 'mmap' # verify if mmap() NULL is not having MAP_ANONYMOUS 
flag, size of mmap() request will somehow match work_mem sizing
c) gdb -p PID and then breakpoint for mmap and verify each mmap() # check 
MAP_ANONYMOUS as above

[1] - 
https://www.postgresql.org/message-id/CAFj8pRAo5CrF8mpPxMvnBYFSqu4HYDqRsQnLqGphckNHkHosFg%40mail.gmail.com

-J.

Re: In-placre persistance change of a relation

2022-01-11 Thread Jakub Wartak

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

I've retested v15 of the patch with everything that came to my mind. The patch 
passes all my tests (well, there's this just windows / cfbot issue). Patch 
looks good to me. I haven't looked in-depth at the code, so patch might still 
need review.

FYI, about potential usage of this patch: the most advanced test that I did was 
continually bouncing wal_level - it works. So chain of :
1. wal_level=replica->minimal
2. alter table set unlogged and load a lot of data, set logged
3. wal_level=minimal->replica
4. barman incremental backup # rsync(1) just backups changed files in steps 2 
and 3 (not whole DB)
5. some other (logged) work
The idea is that when performing mass-alterations to the DB (think nightly 
ETL/ELT on TB-sized DBs), one could skip backing up most of DB and then just 
quickly backup only the changed files - during the maintenance window - e.g. 
thanks to local-rsync barman mode. This is the output of barman show-backups 
after loading data to unlogged table each such cycle:
mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 
16.0 KiB
mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 
80.2 KiB
mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 
496.3 KiB
And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 
36.1 GiB (-74.96% deduplication ratio).  

The only thing I've found out that bouncing wal_level also forces 
max_wal_senders=X -> 0 -> X which in turn requires dropping replication slot 
for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). 
I have tested the restore using barman recover afterwards to backup 
20220110T094905  and indeed it worked OK using this patch too.

The new status of this patch is: Needs review

RE: In-placre persistance change of a relation

2021-12-22 Thread Jakub Wartak

Hi Kyotaro,

> At Tue, 21 Dec 2021 13:07:28 +0000, Jakub Wartak
>  wrote in
> > So what's suspicious is that 122880 -> 0 file size truncation. I've
> > investigated WAL and it seems to contain TRUNCATE records after logged
> FPI images, so when the crash recovery would kick in it probably clears this
> table (while it shouldn't).
> 
> Darn..  It is too silly that I wrongly issued truncate records for the target
> relation of the function (rel) instaed of the relation on which we're 
> currently
> operating at that time (r).
> 
> [..]
> The following fix works.

Cool, I have verified basic stuff that was coming to my mind, now even cfbot is 
happy with v11, You should happy too I hope :)

> I made another change in this version. Previously only btree among all index
> AMs was processed in the in-place manner.  In this version we do that all
> AMs except GiST.  Maybe if gistGetFakeLSN behaved the same way for
> permanent and unlogged indexes, we could skip index rebuild in exchange of
> some extra WAL records emitted while it is unlogged.

I think there's slight omission:

-- unlogged table -> logged with GiST:
DROP TABLE IF EXISTS testcase;
CREATE UNLOGGED TABLE testcase(geom geometry not null);
CREATE INDEX idx_testcase_gist ON testcase USING gist(geom);
INSERT INTO testcase(geom) SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(-1.0, 
2.0),4326), 0.0001);
ALTER TABLE testcase SET LOGGED;

-- crashes with:
(gdb) where
#0  reindex_index (indexId=indexId@entry=65541, 
skip_constraint_checks=skip_constraint_checks@entry=true, 
persistence=persistence@entry=112 'p', params=params@entry=0x0) at index.c:3521
#1  0x0062f494 in RelationChangePersistence (tab=tab@entry=0x1947258, 
persistence=112 'p', lockmode=lockmode@entry=8) at tablecmds.c:5434
#2  0x00642819 in ATRewriteTables (context=0x7ffc19c04520, 
lockmode=, wqueue=0x7ffc19c04388, parsetree=0x1925ec8) at 
tablecmds.c:5644
[..]
#10 0x007f078f in exec_simple_query (query_string=0x1925340 "ALTER 
TABLE testcase SET LOGGED;") at postgres.c:1215

apparently reindex_index() params cannot be NULL - the same happens with 
switching persistent 
table to unlogged one too (with GiST). 

I'll also try to give another shot to the patch early next year - as we are 
starting long Christmas/holiday break here 
- with verifying WAL for GiST and more advanced setup (more crashes, and 
standby/archiving/barman to see 
how it's possible to use wal_level=minimal <-> replica transitions). 

-J.

RE: In-placre persistance change of a relation

2021-12-21 Thread Jakub Wartak

Hi Kyotaro,

> I took a bit too long detour but the patch gets to pass make-world for me.

Good news, v10 passes all the tests for me (including TAP recover ones). 
There's major problem I think:

drop table t6;
create unlogged table t6 (id bigint, t text);
create sequence s1;
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 
100);
alter table t6 set logged;
select pg_sleep(1);
<--optional checkpoint, more on this later.
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
select count(*) from t6; -- shows 0 rows

But If I perform checkpoint before crash, data is there..  apparently the 
missing steps done by checkpointer 
seem to help. If checkpoint is not done, then some peeking reveals that upon 
startup there is truncation (?!):

$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate 
stop
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
112723206  120 -rw---   1 postgres postgres   122880 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741
112723202   24 -rw---   1 postgres postgres24576 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741_fsm
$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start done
server started
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
1127232060 -rw---   1 postgres postgres0 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741
112723202   16 -rw---   1 postgres postgres16384 Dec 21 12:42 
/var/lib/pgsql/15/data/base/73740/73741_fsm

So what's suspicious is that 122880 -> 0 file size truncation. I've 
investigated WAL and it seems to contain TRUNCATE records
after logged FPI images, so when the crash recovery would kick in it probably 
clears this table (while it shouldn't).

However if I perform CHECKPOINT just before crash the WAL stream contains just 
RUNNING_XACTS and CHECKPOINT_ONLINE 
redo records, this probably prevents truncating. I'm newbie here so please take 
this theory with grain of salt, it can be 
something completely different.

-J.

RE: In-placre persistance change of a relation

2021-12-20 Thread Jakub Wartak

Hi Kyotaro,

> At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi
>  wrote in
> > At Mon, 20 Dec 2021 07:59:29 +, Jakub Wartak
> >  wrote in
> > > BTW fast feedback regarding that ALTER patch  (there were 4 unlogged
> tables):
> > > #  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
> > > WARNING:  unrecognized node type: 349
> >
> > lol  I met a server crash. Will fix. Thanks!
> 
> That crash vanished after a recompilation for me and I don't see that error.  
> On
> my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which
> 0002 adds.  So perhaps make clean/make all would fix that.

The fastest I could - I've repeated the whole cycle about that one with fresh 
v9 (make clean, configure, make install, fresh initdb) and I've found two 
problems:

1) check-worlds seems OK but make -C src/test/recovery check shows a couple of 
failing tests here locally and in 
https://cirrus-ci.com/task/4699985735319552?logs=test#L807 :
t/009_twophase.pl  (Wstat: 256 Tests: 24 Failed: 1)
  Failed test:  21
  Non-zero exit status: 1
t/014_unlogged_reinit.pl   (Wstat: 512 Tests: 12 Failed: 2)
  Failed tests:  9-10
  Non-zero exit status: 2
t/018_wal_optimize.pl  (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 38 tests but ran 0.
t/022_crash_temp_files.pl  (Wstat: 7424 Tests: 6 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 9 tests but ran 6.

018 made no sense, I've tried to take a quick look with wal_level=minimal why 
it is failing , it is mystery to me as the sequence seems to be pretty basic 
but the outcome is not:
~> cat repro.sql
create tablespace tbs1 location '/tbs1';
CREATE TABLE moved (id int);
INSERT INTO moved VALUES (1);
BEGIN;
ALTER TABLE moved SET TABLESPACE tbs1;
CREATE TABLE originated (id int);
INSERT INTO originated VALUES (1);
CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1;
COMMIT;

~> psql -f repro.sql z3; sleep 1; /usr/pgsql-15/bin/pg_ctl -D 
/var/lib/pgsql/15/data -l logfile -m immediate stop
CREATE TABLESPACE
CREATE TABLE
INSERT 0 1
BEGIN
ALTER TABLE
CREATE TABLE
INSERT 0 1
CREATE INDEX
COMMIT
waiting for server to shut down done
server stopped
~> /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start done
server started
z3# select * from moved;
ERROR:  could not open file "pg_tblspc/32834/PG_15_202112131/32833/32838": No 
such file or directory
z3=# select * from originated;
ERROR:  could not open file "base/32833/32839": No such file or directory
z3=# \dt+
  List of relations
 Schema |Name| Type  |  Owner   | Persistence |  Size   | Description
++---+--+-+-+-
 public | moved  | table | postgres | permanent   | 0 bytes |
 public | originated | table | postgres | permanent   | 0 bytes |

This happens even without placing on tablespace at all {for originated table , 
but no for moved on}, some major mishap is there (commit should guarantee 
correctness) or I'm tired and having sloppy fingers.

2) minor one testcase, still something is odd.

drop tablespace tbs1;
create tablespace tbs1 location '/tbs1';
CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1;
CREATE TABLE t7 (a int) tablespace tbs1;
insert into t7 values (1);
insert into t5 values (1);
insert into t6 values (1);
\dt+
 List of relations
 Schema | Name | Type  |  Owner   | Persistence |Size| Description
+--+---+--+-++-
 public | t4   | table | postgres | unlogged| 0 bytes|
 public | t5   | table | postgres | unlogged| 8192 bytes |
 public | t6   | table | postgres | unlogged| 8192 bytes |
 public | t7   | table | postgres | permanent   | 8192 bytes |
(4 rows)

ALTER TABLE ALL IN TABLESPACE tbs1 set logged; 
==> STILL WARNING:  unrecognized node type: 349
\dt+
 List of relations
 Schema | Name | Type  |  Owner   | Persistence |Size| Description
+--+---+--+-++-
 public | t4   | table | postgres | permanent   | 0 bytes|
 public | t5   | table | postgres | permanent   | 8192 bytes |
 public | t6   | table | postgres | permanent   | 8192 bytes |
 public | t7   | table | postgres | permanent   | 8192 bytes |

So it did rewrite however this warning seems to be unfixed. I've tested on 
e2c52beecdea152ca680a22ef35c6a7da55aa30f.

-J.

RE: In-placre persistance change of a relation

2021-12-19 Thread Jakub Wartak

Hi Kyotaro, I'm glad you are still into this

> I didn't register for some reasons.

Right now in v8 there's a typo in ./src/backend/catalog/storage.c :

storage.c: In function 'RelationDropInitFork':
storage.c:385:44: error: expected statement before ')' token
pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much

> 1. I'm not sure that we want to have the new mark files.

I can't help with such design decision, but if there are doubts maybe then add 
checking return codes around:
a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark()
b) mdunlinkmark() inside mdunlinkmark()
and PANIC if something goes wrong?

> 2. Aside of possible bugs, I'm not confident that the crash-safety of
>  this patch is actually water-tight. At least we need tests for some
>  failure cases.
>
> 3. As mentioned in transam/README, failure in removing smgr mark files
>leads to immediate shut down.  I'm not sure this behavior is acceptable.

Doesn't it happen for most of the stuff already? There's even data_sync_retry 
GUC.

> 4. Including the reasons above, this is not fully functionally.
>For example, if we execute the following commands on primary,
>replica dones't work correctly. (boom!)
> 
>=# CREATE UNLOGGED TABLE t (a int);
>=# ALTER TABLE t SET LOGGED;
> 
> 
> The following fixes are done in the attched v8.
> 
> - Rebased. Referring to Jakub and Justin's work, I replaced direct
>   access to ->rd_smgr with RelationGetSmgr() and removed calls to
>   RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
>   TABLESPACE SET LOGGED/UNLOGGED" statement part.
> 
> - Fixed RelationCreate/DropInitFork's behavior for non-target
>   relations. (From Jakub's work).
> 
> - Fixed wording of some comments.
> 
> - As revisited, I found a bug around recovery. If the logged-ness of a
>   relation gets flipped repeatedly in a transaction, duplicate
>   pending-delete entries are accumulated during recovery and work in a
>   wrong way. sgmr_redo now adds up to one entry for a action.
> 
> - The issue 4 above is not fixed (yet).

Thanks again, If you have any list of crush tests ideas maybe I'll have some 
minutes 
to try to figure them out. Is there is any goto list of stuff to be checked to 
add confidence
to this patch (as per point #2) ?

BTW fast feedback regarding that ALTER patch  (there were 4 unlogged tables):
#  ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING:  unrecognized node type: 349

-J.

RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak

>  Justin wrote:
> On Fri, Dec 17, 2021 at 09:10:30AM +, Jakub Wartak wrote:
> > As the thread didn't get a lot of traction, I've registered it into current
> commitfest
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
> est.postgresql.org%2F36%2F3461%2Fdata=04%7C01%7CJakub.Wartak%
> 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
> 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> CI6Mn0%3D%7C3000sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
> NuBwiW4%3Dreserved=0 with You as the author and in 'Ready for
> review' state.
> 
> The patch is failing:
[..] 
> I think you ran "make check", but should run something like this:
> make check-world -j8 >check-world.log 2>&1 && echo Success

Hi Justin,

I've repeated the check-world and it says to me all is ok locally (also with 
--enable-cassert --enable-debug , at least on Amazon Linux 2) and also 
installcheck on default params seems to be ok  
I don't seem to understand why testfarm reports errors for tests like "path, 
polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) 
. 
Could someone point to me where to start looking/fixing?

-J.

RE: In-placre persistance change of a relation

2021-12-17 Thread Jakub Wartak

> Kyotaro wrote:
> > In this version, I got rid of the "CLEANUP FORK"s, and added a new
> > system "Smgr marks".  The mark files have the name of the
> > corresponding fork file followed by ".u" (which means Uncommitted.).
> > "Uncommited"-marked main fork means the same as the
> CLEANUP2_FORKNUM
> > and uncommitted-marked init fork means the same as the
> CLEANUP_FORKNUM
> > in the previous version.x
> >
> > I noticed that the previous version of the patch still leaves an
> > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> > revmoed at rollback.  In this version the responsibility to remove the
> > mark files is moved to SyncPostCheckpoint, where main fork files are
> > actually removed.
> 
> For the record, I noticed that basebackup could be confused by the mark files
> but I haven't looked that yet.
> 

Good morning Kyotaro,

the patch didn't apply clean (it's from March; some hunks were failing), so 
I've fixed it and the combined git format-patch is attached. It did conflict 
with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table 
rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long 
time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().

I'm especially worried if I didn't screw up something/forgot something related 
to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".

Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from 
generate_series(1, 1);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s

So basically one can - thanks to this patch - use his application (performing 
classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) 
and perform literally batch upload and then just switch the tables to LOGGED. 

Some more intensive testing also looks good, assuming table prepared to put 
pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash 
(id);
create unlogged table t_unlogged_h0 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged  FOR VALUES 
WITH (modulus 4, remainder 3);

Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and 
Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 
1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes 
generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), 
repeat('A', 1000) from generate_series(1, 1000); # according to 
pg_wal_stats.wal_bytes generates ~3kB of WAL

so using: pgbench -f .sql -T 30 -P 1 -c 32 -j 3 t 
with synchronous_commit =ON(default):
with t_logged.sql:   tps = 229 (lat avg = 138ms)
with t_unlogged.sql  tps = 283 (lat avg = 112ms) # almost all on 
LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql:   tps = 413 (lat avg = 77ms)
with t_unloged.sql:  tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition. 

As the thread didn't get a lot of traction, I've registered it into current 
commitfest https://commitfest.postgresql.org/36/3461/ with You as the author 
and in 'Ready for review' state. 
I think it behaves as almost finished one and apparently after reading all 
those discussions that go back over 10years+ time span about this feature, and 
lot of failed effort towards wal_level=noWAL I think it would be nice to 
finally start getting some of that of it into the core.

-Jakub Wartak.

v7-0001-In-place-table-persistence-change-with-new-comman.patch
Description:  v7-0001-In-place-table-persistence-change-with-new-comman.patch

RE: track_io_timing default setting

2021-12-10 Thread Jakub Wartak

> Can we change the default setting of track_io_timing to on?

+1 for better observability by default.

> I can't imagine a lot of people who care much about its performance impact 
> will be running the latest version of PostgreSQL on ancient/weird systems 
> that have slow clock access. (And the few who do can just turn it off for 
> their system).
> For systems with fast user-space clock access, I've never seen this setting 
> being turned on make a noticeable dent in performance.  Maybe I just never 
> tested enough in the most adverse scenario (which I guess would be a huge FS 
> cache, a small shared buffers, and a high CPU count with constant churning of 
> pages that hit the FS cache but miss shared buffers--not a system I have 
> handy to do a lot of tests with.)

Coincidently I have some quick notes for measuring the impact of changing the 
"clocksource"  on the Linux 5.10.x (real syscall vs vd.so optimization) on 
PgSQL 13.x as input to the discussion. The thing is that the slow "xen" 
implementation (at least on AWS i3, Amazon Linux 2) is default because 
apparently time with faster TSC/ RDTSC ones can potentially drift backwards 
e.g. during potential(?) VM live migration. I haven't seen better way to see 
what happens under the hood than strace and/or measuring huge no of calls. This 
only shows of course the impact to the whole PgSQL (with track_io_timing=on), 
not just impact between track_io_timing=on vs off. IMHO better knowledge (in 
explain analyze, autovacuum) is worth more than this potential degradation when 
using slow clocksources.

With /sys/bus/clocksource/devices/clocksource0/current_clocksource=xen  
(default on most AWS instances; ins.pgb = simple insert to table with PK only 
from sequencer.):
# time ./testclock # 10e9 calls of gettimeofday()
real0m58.999s
user0m35.796s
sys 0m23.204s

//pgbench 
transaction type: ins.pgb
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 2
duration: 100 s
number of transactions actually processed: 5511485
latency average = 0.137 ms
latency stddev = 0.034 ms
tps = 55114.743913 (including connections establishing)
tps = 55115.999449 (excluding connections establishing)

With /sys/bus/clocksource/devices/clocksource0/current_clocksource=tsc :
# time ./testclock # 10e9 calls of gettimeofday()
real0m2.415s
user0m2.415s
sys 0m0.000s # XXX: notice, userland only workload, no %sys part

//pgbench:
transaction type: ins.pgb
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 2
duration: 100 s
number of transactions actually processed: 6190406
latency average = 0.123 ms
latency stddev = 0.035 ms
tps = 61903.863938 (including connections establishing)
tps = 61905.261175 (excluding connections establishing)

In addition what could be done here - if that XXX note holds true on more 
platforms - is to measure via rusage() many gettimeofdays() during startup and 
log warning to consider checking OS clock implementation if it takes relatively 
too long and/or %sys part is > 0. I dunno what to suggest for the potential 
time going backwards , but changing track_io_timings=on doesn't feel like it is 
going to make stuff crash., so again I think it is good idea. 

-Jakub Wartak.

RE: prevent immature WAL streaming

2021-10-13 Thread Jakub Wartak

On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>> 
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>> 
>> I forgot to mention that I'll wait until 14.0 is tagged before getting 
>> anything
>> pushed.

Hi Alvaro, sorry for being late to the party, but to add some reassurance that 
v2-commited-fix this really fixes solves the initial production problem, I've 
done limited test on it (just like with the v1-patch idea earlier/ with using 
wal_keep_segments, wal_init_zero=on, archive_mode=on and 
archive_command='/bin/true')

- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 
tries and I've got: 2x "invalid contrecord length", 1x "there is no contrecord 
flag" on standby.

- On soon-to-be-become-12.9 REL_12_STABLE (with commit 
1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries, I've got 
beautiful insight into what happened:
LOG:  started streaming WAL from primary at 1/EC00 on timeline 1
LOG:  sucessfully skipped missing contrecord at 1/EBF8, overwritten at 
2021-10-13 11:22:37.48305+00
CONTEXT:  WAL redo at 1/EC28 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBF8; 
time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was 
tested too (m -> s1 -> s11) and both {s1,s11} did behave properly and log the 
above message. Also additional check proved that after simulating ENOSPC crash 
on master the data contents were identical everywhere (m1=s1=s11). 

Thank you Alvaro and also to everybody else who participated in solving this 
challenging and really edge-case nasty bug.

-J.

RE: prevent immature WAL streaming

2021-08-25 Thread Jakub Wartak

Hi Álvaro, -hackers, 

> I attach the patch with the change you suggested.

I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 
5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the 
v02 patch I was getting standby corruption always via simulation by having 
separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, 
archive_command set to rsync to different dir on same fs, wal_init_zero at 
default(true). 

Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries 
on standby. Probably the 2 failures were somehow my fault (?) or some rare 
condition (and in 1 of those 2 cases simply restarting standby did help). To be 
honest I've tried to force this error, but with v02 I simply cannot force this 
error anymore, so that's good! :)

> I didn't have a lot of luck with a reliable reproducer script.  I was able to
> reproduce the problem starting with Ryo Matsumura's script and attaching
> a replica; most of the time the replica would recover by restarting from a
> streaming position earlier than where the problem occurred; but a few
> times it would just get stuck with a WAL segment containing a bogus
> record.  

In order to get reliable reproducer and get proper the fault injection instead 
of playing with really filling up fs, apparently one could substitute fd with 
fd of /dev/full using e.g. dup2() so that every write is going to throw this 
error too:

root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root@hive:~# ls -l /proc/27296/fd/3
lrwx-- 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root@hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device

So I've also tried to be malicious while writing to the DB and inject ENOSPCE 
near places like:
 
a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) 
is true, one gets classic "PANIC:  could not write to file 
"pg_wal/xlogtemp.90670": No space left on device"
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC:  
could not write to log file 0001003B00A8 at offset 0, length 
15466496: No space left on device" (that would be possible with 
wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC:  
could not fdatasync file "000100430004": Invalid argument" that 
would pretty much mean same as above but with last possible offset near end of 
WAL? 

This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c: // 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it 
has current log as fd=3

After restarting master and inspecting standby - in all of those above 3 cases 
- the standby didn't inhibit the "invalid contrecord length" at least here, 
while without corruption this v02 patch it is notorious. So if it passes the 
worst-case code review assumptions I would be wondering if it shouldn't even be 
committed as it stands right now.

-J.

RE: Background writer and checkpointer in crash recovery

2021-08-02 Thread Jakub Wartak

> On Fri, Jul 30, 2021 at 4:00 PM Andres Freund  wrote:
> > I don't agree with that? If (user+system) << wall then it is very
> > likely that recovery is IO bound. If system is a large percentage of
> > wall, then shared buffers is likely too small (or we're replacing the
> > wrong
> > buffers) because you spend a lot of time copying data in/out of the
> > kernel page cache. If user is the majority, you're CPU bound.
> >
> > Without user & system time it's much harder to figure that out - at
> > least for me.
> 
> Oh, that's an interesting point. At least now I'll know why I am supposed to 
> care
> about that log line the next time I see it. I guess we could include both 
> things,
> though the line might get a little long.
> Or maybe there's some other subset that would make sense.

Hi Robert,

The email threads from [1] can serve as indication that having complete view of 
resource usage (user+system+elapsed) is advantageous in different situations 
and 
pretty platform-generic. Also as  Andres and Simon earlier pointed out - the 
performance
insight into crash recovery/replication performance is next to nothing, judging 
just by 
looking at currently emitted log messages. The more the there is, the better I 
think. 

BTW,  if you now there's this big push for refactoring StartupXLOG() then what 
frustrating^H^H^H^H^H could be done better - at least from end-user point of 
view -  
is that there is lack of near real time cyclic messages (every 1min?) about 
current status, 
performance and maybe even ETA (simplistic case; assuming it is linear). 

[1] - https://commitfest.postgresql.org/30/2799/

RE: Cosmic ray hits integerset

2021-07-07 Thread Jakub Wartak

Hi, Asking out of pure technical curiosity about "the rhinoceros" - what kind 
of animal is it ? Physical box or VM? How one could get dmidecode(1) / dmesg(1) 
/ mcelog (1) from what's out there (e.g. does it run ECC or not ?)

-J.

> -Original Message-
> From: Alvaro Herrera 
> Sent: Tuesday, June 22, 2021 4:21 PM
> To: Thomas Munro 
> Cc: pgsql-hackers 
> Subject: Re: Cosmic ray hits integerset
> 
> On 2021-Jun-22, Thomas Munro wrote:
> 
> > Hi,
> >
> > Here's a curious one-off failure in test_integerset:
> >
> > +ERROR:  iterate returned wrong value; got 519985430528, expected
> > +485625692160
> 
> Cosmic rays indeed.  The base-2 representation of the expected value is
> 11100010001000110001100
> and that of the actual value is
> 0010001000110001100
> 
> There's a single bit of difference.

RE: Use simplehash.h instead of dynahash in SMgr

2021-05-05 Thread Jakub Wartak

Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each 
> batch might be pretty small on a typical OLTP workload, so it might not help 
> much there, or it might hinder.

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, 
INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even 
say that INSERT-as-SELECT would be more suited for DWH-like processing.

> But having said that, I don't think any of those possibilities should stop us 
> speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big 
gains. I was able to squeeze like 300-400k function calls per second (WAL 
records/s), that was the point I think where I think smgropen() got abused. 

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
> 
> I'm not quite sure if I understand what you mean here.  Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both?  We were very concentrated on standby side, but on primary side 
one could also change how WAL records are generated:

1) Minimalization of records towards same repeated $block eg. Heap2 
table_multi_insert() API already does this and it matters to generate more 
optimal stream for replay:

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same 
block from Smgr perspective (this is especially visible on highly indexed 
tables)  =>  
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000108, prev 4/E7A0, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243284, lsn: 
4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heaplen (rec/tot): 63/63, tx:   17243284, lsn: 
4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer 
optimization that avoid repeated calls to smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with 
obvious elimination of unnecessary individual calls to smgopen() via one big 
MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2   len (rec/tot):210/   210, tx:   17243290, lsn: 
4/E928, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, 
blkref #0: rel 1663/16384/32801 blk 0
rmgr: Btree   len (rec/tot):102/   102, tx:   17243290, lsn: 
4/E9000100, prev 4/E928, desc: NEWROOT lev 0, blkref #0: rel 
1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree   len (rec/tot): 64/64, tx:   17243290, lsn: 
4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32804 blk 1
[..]
Here Btree it is very localized (at least when concurrent sessions are not 
generating WAL) and it enables Thomas's recent-buffer to kick in

DELETE is much more simple (thanks to not chewing out those Btree records) and 
also thanks to Thomas's recent-buffer should theoretically put much less stress 
on smgropen() already:
rmgr: Heaplen (rec/tot): 54/54, tx:

RE: Use simplehash.h instead of dynahash in SMgr

2021-05-05 Thread Jakub Wartak

Hi David, Alvaro, -hackers 

> Hi David,
> 
> You're probably aware of this, but just to make it explicit: Jakub Wartak was
> testing performance of recovery, and one of the bottlenecks he found in
> some of his cases was dynahash as used by SMgr.  It seems quite possible
> that this work would benefit some of his test workloads.

I might be a little bit out of the loop, but as Alvaro stated - Thomas did 
plenty of excellent job related to recovery performance in that thread. In my 
humble opinion and if I'm not mistaken (I'm speculating here) it might be *not* 
how Smgr hash works, but how often it is being exercised and that would also 
explain relatively lower than expected(?) gains here. There are at least two 
very important emails from him that I'm aware that are touching the topic of 
reordering/compacting/batching calls to Smgr:
https://www.postgresql.org/message-id/CA%2BhUKG%2B2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CA%2BhUKGK4StQ%3DeXGZ-5hTdYCmSfJ37yzLp9yW9U5uH6526H%2BUeg%40mail.gmail.com

Another potential option that we've discussed is that the redo generation 
itself is likely a brake of efficient recovery performance today (e.g. 
INSERT-SELECT on table with indexes, generates interleaved WAL records that 
touch often limited set of blocks that usually put Smgr into spotlight).

-Jakub Wartak.

RE: Improve the performance to create END_OF_RECOVERY checkpoint

2020-12-22 Thread Jakub Wartak

Hi Ray,

> So can we delete the limit of ArchiveRecoveryRequested, and enable launch 
> bgwriter in master node ?

Please take a look on https://commitfest.postgresql.org/29/2706/ and the 
related email thread.

-J.

RE: pg_preadv() and pg_pwritev()

2020-12-20 Thread Jakub Wartak



> > I'm drawing a blank on trivial candidate uses for preadv(), without
> > infrastructure from later patches.
> 
> Can't immediately think of something either.

This might be not that trivial , but maybe acquire_sample_rows() from analyze.c 
?

Please note however there's patch 
https://www.postgresql.org/message-id/20201109180644.GJ16415%40tamriel.snowman.net
 ( https://commitfest.postgresql.org/30/2799/ ) for adding fadvise, but maybe 
those two could be even combined so you would be doing e.g. 16x fadvise() and 
then grab 8 pages in one preadv() call ?  I'm find unlikely however that preadv 
give any additional performance benefit there after having fadvise, but clearly 
a potential place to test.

-J.

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Jakub Wartak


Greetings Stephen,

I saw up 410MB/s for a few seconds with this patch on NVMe, and that's 
huge ~5.2x improvement which is amazing for a such simple patch.

The system and data was identical like last time, so results are directly 
comparable 
to the previous post. The only change is that I've applied Yours patch on top 
of 
REL12_STABLE as this is cluster where I was having original data. I think it 
deserves
 it's own commitfest entry, right? Or do we combine those two?

Each test was conducted with 
1. alter system set effective_io_concurrency=$N; 
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 
restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.

no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s

readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s 
, baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU 
util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

I think one needs to be sure to restart, so that autovacuum 
workers get the new effective_io_concurrency, maybe that's a 
documentation thing only.

I have just one question, with this patch the strace I/O of analyze 
look like below:

pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0

If you highlight "738754560" in the output it appears to duplicate the 
syscalls issued until it preads() - in case of "738754560" offset it was 
asked for 3 times. Also I wouldn't  imagine in wildest dreams that 
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

-J.



From: Stephen Frost
Sent: Tuesday, November 3, 2020 6:47 PM
To: Jakub Wartak
Cc: pgsql-hackers
Subject: Re: automatic analyze: readahead - add "IO read time" log message 

Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
> 
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current 
> situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
> 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might 
> is different layer and might have different storage settings), VG on top of 
> dm-0, LV with stripe-size 8kB, ext4.
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
> output of PgSQL12: 
> INFO

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-03 Thread Jakub Wartak

Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick tests with blockdev --setra/setfra 0 after 
>> spending time looking at the smgr/md/fd API changes required to find 
>> shortcut, but I'm getting actually a little bit worse timings at least on 
>> "laptop DB tests". One thing that I've noticed is that needs to be only for 
>> automatic-analyze, but not for automatic-vacuum where apparently there is 
>> some boost due to readahead.

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation 
is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is 
different layer and might have different storage settings), VG on top of dm-0, 
LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
output of PgSQL12: 
INFO:  "x": scanned 150 of 22395442 pages, containing 112410444 live rows 
and 0 dead rows; 150 rows in sample, 1678321053 estimated total rows

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent 
heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as 
reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average 
request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, 
total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), 
total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, double-confirmed no readaheads [ 
pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are 
being 
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O 
layer waiting. 

54.67% 1.33%  postgres  postgres[.] FileRead
---FileRead
--53.33%--__pread_nocancel
   --50.67%--entry_SYSCALL_64_after_hwframe
 do_syscall_64
 sys_pread64
 |--49.33%--vfs_read
 |   --48.00%--__vfs_read
 | 
|--45.33%--generic_file_read_iter
 | |  
|--42.67%--ondemand_readahead
 | |  | 
 __do_page_cache_readahead
 | |  | 
 |--25.33%--ext4_mpage_readpages
 | |  | 
 |  |--10.67%--submit_bio
 | |  | 
 |  |  generic_make_request
 | |  | 
 |  |  |--8.00%--blk_mq_make_request
 | |  | 
 |  |  |  |--4.00%--blk_mq_get_request
 | |  | 
 |  |  |  |  |--1.33%--blk_mq_get_tag
 | |  | 
 |  |  |  |

Re: automatic analyze: readahead - add "IO read time" log message

2020-10-27 Thread Jakub Wartak

Hi Stephen, hackers,

> The analyze is doing more-or-less random i/o since it's skipping through
> the table picking out select blocks, not doing regular sequential i/o.
VS
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
>Not really sure what's interesting here, but it does look like we're
>skipping through the table as expected.

Yes, but not randomly in this case. I wanted to point out that this is 
incrementing block number, therefore I've included this debug output which 
might trigger readahead heuristics.
Perhaps this depends on how the table was built / vacuumed ? (in this case, 
pure-INSERT-only; I would expect the same in time series DBs and DWHs). 

> With all those 'readahead' calls it certainly makes one wonder if the
> Linux kernel is reading more than just the block we're looking for
> because it thinks we're doing a sequential read and will therefore want
> the next few blocks when, in reality, we're going to skip past them,
> meaning that any readahead the kernel is doing is likely just wasted
> I/O.

I've done some quick tests with blockdev --setra/setfra 0 after spending 
time looking at the smgr/md/fd API changes required to find shortcut, but I'm 
getting actually a little bit worse timings at least on "laptop DB tests". One 
thing that I've noticed is that needs to be only for automatic-analyze, but not 
for automatic-vacuum where apparently there is some boost due to readahead.

> That definitely seems like a useful thing to include and thanks for the
> patch!  Please be sure to register it in the commitfest app:
> https://commitfest.postgresql.org

Thank you! Thread is now registered.

> I would think that, ideally, we'd teach analyze.c to work in the same
> way that bitmap heap scans do- that is, use posix_fadvise to let the
> kernel know what pages we're going to want next instead of the kernel
> guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> lot of time poking, but it doesn't look like analyze.c tries to do any
> prefetching today.  In a similar vein, I wonder if VACUUM should be
> doing prefetching too today, at least when it's skipping through the
> heap based on the visibility map and jumping over all-frozen pages.

My only idea would be that a lot of those blocks could be read asynchronously 
in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe 
the the optimization is possible, but not until we'll have AIO ;)

> Haven't looked too closely at this but in general +1 on the idea and
> this approach looks pretty reasonable to me.  Only thing I can think of
> off-hand is to check how it compares to other places where we report IO
> read time and make sure that it looks similar.

Ok, I've changed the output in 002 version to include "avg read rate" just like 
in the autovacuum case but still maintaining single line output, e.g:  
automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s 
(read time: 2.52 s), system usage: CPU: user: 0.28 s, system: 0.26 s, elapsed: 
2.94 s

-J.diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..38d69d887e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+   PgStat_Counter startreadtime = 0;
+   longstartblksread = 0;
 
if (inh)
ereport(elevel,
@@ -347,6 +349,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
pg_rusage_init();
+   startreadtime = pgStatBlockReadTime;
+   startblksread = pgBufferUsage.shared_blks_read;
if (params->log_min_duration > 0)
starttime = GetCurrentTimestamp();
}
@@ -685,12 +689,27 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (params->log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, 
GetCurrentTimestamp(),
   
params->log_min_duration))
-   ereport(LOG,
-   (errmsg("automatic analyze of table 
\"%s.%s.%s\" system usage: %s",
-   
get_database_name(MyDatabaseId),
-

automatic analyze: readahead - add "IO read time" log message

2020-10-26 Thread Jakub Wartak

Greetings hackers,

I have I hope interesting observation (and nano patch proposal) on system where 
statistics freshness is a critical factor. Autovacuum/autogathering statistics 
was tuned to be pretty very aggressive:
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4

Some critical table partitions are configured with:  
autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=5 to 
force auto-analyze jobs pretty often. The interesting logs are like this:
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 
s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 
s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 
s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 
s, elapsed: 229.45 s

and this is root-cause of my question.  As you can see there is huge 3x-4x time 
discrepancy between "elapsed" and user+system which is strange at least for me 
as there should be no waiting (autovacuum_vacuum_cost_delay=0). According to 
various tools it is true: Time was wasted somewhere else, but not in the 
PostgreSQL analyze. The ps(1) and pidstat(1) also report the same for the 
worker:

06:56:12 AM   PID%usr %system  %guest%CPU   CPU  Command
06:56:13 AM1147748.00   10.000.00   18.0018  postgres
06:56:14 AM1147748.00   11.000.00   19.0015  postgres
06:56:15 AM1147745.00   13.000.00   18.0018  postgres

06:56:17 AM   PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:56:18 AM114774  63746.53  0.00  0.00  postgres
06:56:19 AM114774  62896.00  0.00  0.00  postgres
06:56:20 AM114774  62920.00  0.00  0.00  postgres

One could argue that such autoanalyze worker could perform 5x more of work 
(%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s 
reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although 
reporting high utilization I'm pretty sure it could push much more. There can 
be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for 
statistics collection alone).

According to various gdb backtraces, a lot of time is spent here:
#0  0x7f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1  0x00741a16 in pread (__offset=811253760, __nbytes=8192, 
__buf=0x7f9413ab7280, __fd=) at /usr/include/bits/unistd.h:84
#2  FileRead (file=, buffer=0x7f9413ab7280 "\037\005", 
amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
#3  0x00764b8f in mdread (reln=, forknum=, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
#4  0x0073d69b in ReadBuffer_common (smgr=, 
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902, 
mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
at bufmgr.c:897
#5  0x0073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, 
forkNum=MAIN_FORKNUM, blockNum=19890902, mode=, 
strategy=) at bufmgr.c:665
#6  0x004c2e2f in heapam_scan_analyze_next_block (scan=, 
blockno=, bstrategy=) at heapam_handler.c:998
#7  0x00597de1 in table_scan_analyze_next_block (bstrategy=, blockno=, scan=0x10c8098) at 
../../../src/include/access/tableam.h:1462
#8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, 
targrows=150, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at 
analyze.c:1048
#9  0x00596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, 
params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 , 
relpages=26763525, inh=false,
in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, 
tab=) at autovacuum.c:3118
[..]

The interesting thing that targrows=1.5mlns and that blocks are accessed (as 
expected) in sorted order:

Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, 
bstrategy=0x1102278) at heapam_handler.c:984

And probably this explains the discrepancy, perf with CPU usage reporting shows 
a lot of frames waiting on I/O on readaheads done by ext4, trimmed for clarity:

# Children  Self   sys   usr  Command   Shared Object   Symbol
63.64% 0.00% 0.00% 0.00%  postgres  [kernel.kallsyms]   [k] 
entry_SYSCALL_64_after_hwframe

Re: Parallelize stream replication process

2020-09-17 Thread Jakub Wartak

Li Japin wrote:

> If we can improve the efficiency of replay, then we can shorten the database 
> recovery time (streaming replication or database crash recovery). 
(..)
> For streaming replication, we may need to improve the transmission of WAL 
> logs to improve the entire recovery process.
> I’m not sure if this is correct.

Hi, 

If you are interested in increased efficiency of WAL replay internals/startup 
performance then you might be interested in following threads:

Cache relation sizes in recovery - 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNPZeEdLXAcNr%2Bw0YOZVb0Un0_MwTBpgmmVDh7No2jbg%40mail.gmail.com#feace7ccbb8e3df8b086d0a2217df91f
Faster compactify_tuples() - 
https://www.postgresql.org/message-id/flat/ca+hukgkmqfvpjr106grhwk6r-nxv0qoctrezuqzxgphesal...@mail.gmail.com
Handing off SLRU fsyncs to the checkpointer - 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLJ%3D84YT%2BNvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ%40mail.gmail.com
Optimizing compactify_tuples() - 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
Background bgwriter during crash recovery - 
https://www.postgresql.org/message-id/flat/ca+hukgj8nrsqgkzensnrc2mfrobv-jcnacbyvtpptk2a9yy...@mail.gmail.com
WIP: WAL prefetch (another approach) - 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Division in dynahash.c due to HASH_FFACTOR - 
https://www.postgresql.org/message-id/flat/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
[PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send - 
https://www.postgresql.org/message-id/flat/CACJqAM2uAUnEAy0j2RRJOSM1UHPdGxCr%3DU-HbqEf0aAcdhUoEQ%40mail.gmail.com
Unnecessary delay in streaming replication due to replay lag - 
https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com
WAL prefetching in future combined with AIO (IO_URING) - longer term future, 
https://anarazel.de/talks/2020-05-28-pgcon-aio/2020-05-28-pgcon-aio.pdf

Good way to start is to profile the system what is taking time during Your 
failover situation OR Your normal hot-standby behavior 
and then proceed to identifying and characterizing the main bottleneck - there 
can be many depending on the situation (inefficient single processes PostgreSQL 
code, 
CPU-bound startup/recovering, IOPS/VFS/ syscall/s / API limitations, single TCP 
stream limitations  single TCP stream latency impact in WAN, contention on 
locks in hot-standby case...) .

Some of the above are already commited in for 14/master, some are not and 
require further discussions and testing. 
Without real identification of the bottleneck and WAL stream statistics you are 
facing , it's hard to say how would parallel WAL recovery improve the situation.

-J.

Re: Optimising compactify_tuples()

2020-09-15 Thread Jakub Wartak

David Rowley wrote:

> I've attached patches in git format-patch format. I'm proposing to commit 
> these in about 48 hours time unless there's some sort of objection before 
> then.

Hi David, no objections at all, I've just got reaffirming results here, as per 
[1] (SLRU thread but combined results with qsort testing) I've repeated 
crash-recovery tests here again:

TEST0a: check-world passes
TEST0b: brief check: DB after recovery returns correct data which was present 
only into the WAL stream - SELECT sum(c) from sometable

TEST1: workload profile test as per standard TPC-B [2], with majority of 
records in WAL stream being Heap/HOT_UPDATE on same system with NVMe as 
described there.

results of master (62e221e1c01e3985d2b8e4b68c364f8486c327ab) @ 15/09/2020 as 
baseline:
15.487, 1.013
15.789, 1.033
15.942, 1.118

profile looks most of the similar:
17.14%  postgres  libc-2.17.so[.] __memmove_ssse3_back
---__memmove_ssse3_back
   compactify_tuples
   PageRepairFragmentation
   heap2_redo
   StartupXLOG
 8.16%  postgres  postgres[.] hash_search_with_hash_value
---hash_search_with_hash_value
   |--4.49%--BufTableLookup
[..]
--3.67%--smgropen

master with 2 patches by David 
(v8-0001-Optimize-compactify_tuples-function.patch + 
v8-0002-Report-resource-usage-at-the-end-of-recovery.patch): 
14.236, 1.02
14.431, 1.083
14.256, 1.02

so 9-10% faster in this simple verification check. If I had pgbench running the 
result would be probably better. Profile is similar:

13.88%  postgres  libc-2.17.so[.] __memmove_ssse3_back
---__memmove_ssse3_back
--13.47%--compactify_tuples

10.61%  postgres  postgres[.] hash_search_with_hash_value
---hash_search_with_hash_value
   |--5.31%--smgropen
[..]
--5.31%--BufTableLookup


TEST2: update-only test, just as you performed in [3] to trigger the hotspot, 
with table fillfactor=85 and update.sql (100% updates, ~40% Heap/HOT_UPDATE 
[N], ~40-50% [record sizes]) with slightly different amount of data.

results of master as baseline:
233.377, 0.727
233.233, 0.72
234.085, 0.729

with profile:
24.49%  postgres  postgres  [.] pg_qsort
17.01%  postgres  postgres  [.] PageRepairFragmentation
12.93%  postgres  postgres  [.] itemoffcompare
(sometimes I saw also a ~13% swapfunc)

results of master with above 2 patches, 2.3x speedup:
101.6, 0.709
101.837, 0.71
102.243, 0.712

with profile (so yup the qsort is gone, hurray!):

32.65%  postgres  postgres  [.] PageRepairFragmentation
---PageRepairFragmentation
   heap2_redo
   StartupXLOG
10.88%  postgres  postgres  [.] compactify_tuples
---compactify_tuples
 8.84%  postgres  postgres  [.] hash_search_with_hash_value

BTW: this message "redo done at 0/9749FF70 system usage: CPU: user: 13.46 s, 
system: 0.78 s, elapsed: 14.25 s" is priceless addition :) 

-J.

[1] - 
https://www.postgresql.org/message-id/flat/VI1PR0701MB696023DA7815207237196DC8F6570%40VI1PR0701MB6960.eurprd07.prod.outlook.com#188ad4e772615999ec427486d1066948
[2] - pgbench -i -s 100, pgbench -c8 -j8 -T 240, ~1.6GB DB with 2.3GB after 
crash in pg_wal to be replayed
[3] - 
https://www.postgresql.org/message-id/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q%40mail.gmail.com
 , in my case: pgbench -c 16 -j 16 -T 240 -f update.sql , ~1GB DB with 4.3GB 
after crash in pg_wal to be replayed

Re: Division in dynahash.c due to HASH_FFACTOR

2020-09-08 Thread Jakub Wartak

Hi Thomas, Tomas :), Alvaro, hackers,

>> > After removing HASH_FFACTOR PostgreSQL still compiles...  Would
>> > removing it break some external API/extensions ?
>>
>> FWIW, HASH_FFACTOR has *never* been used in Postgres core code.
[..]
>> I think if we know that there's a 4% performance increase by removing
>> the division that comes with an ability to use a custom fillfactor that
>> nobody uses or has ever used, it makes sense to remove that ability.

Thomas wrote:
>I think Tomas is right, and the correct fix for this would be to
>compute it up front every time the hash table size changes, but since
>apparently no one ever used it, +1 for just removing it and leaving it
>hard-coded.
[..]
>For what it's worth, for dshash.c I just hard coded it to double the
>hash table size whenever the number of elements in a partition
>exceeded 75%.  Popular hash tables in standard libraries seem to use
>either .75 or 1.0 as a default or only setting.  There's probably room
>for discussion about numbers in those ranges, (..)

Tomas also asked the same "what if you lower the fill factor, e.g. to 0.8? Does 
that improve performance or not?" and got some interesting links for avoiding 
bias. I couldn't find any other simple way to benchmark a real impact of it 
other than checking DB crash-recovery (if anyone has some artificial workload 
generator with  PinBuffer->hash_search() please let me know).

So I've  been testing it using WAL crash-recovery using Thomas's approach [1]: 
always the same workload to reply, always on the same machine, same env: same 
13GB WAL to recover, 8GB db, 24GB shared buffers on top of 128GB RAM, NVMe with 
fsync off to put dynahash as primary bottleneck (BufTableLookup()+smgropen()), 
append-only workload, gcc -O3: Results are <1st time is the WAL recovery 
timing, 2nd time is checkpoint before opening DB>. There were four measurements 
per each scenario, first one is cut off to avoid noise:

1. DEF_FFACTOR=1 idiv on
103.971, 3.719
103.937, 3.683
104.934, 3.691

2. DEF_FFACTOR=1 idiv off
100.565, 3.71
101.595, 3.807
100.794, 3.691

3. DEF_FFACTOR=1.2 idiv on // some proof that nobody uses it
1599552729.041 postmaster 94510 FATAL:  failed to initialize hash table 
"SERIALIZABLEXID hash"

4. DEF_FFACTOR=0.8 idiv on // another proof ? The reason for below is that 
ffactor is long and as such cannot handle floats, if it would be float

Program terminated with signal 8, Arithmetic exception.
#0  0x009a4119 in init_htab (nelem=4, hashp=0x17cd238) at dynahash.c:677
677 nbuckets = next_pow2_int((nelem - 1) / hctl->ffactor + 1);
Missing separate debuginfos, use: debuginfo-install 
glibc-2.17-292.180.amzn1.x86_64

#0  0x009a4119 in init_htab (nelem=4, hashp=0x17cd238) at dynahash.c:677
#1  hash_create (tabname=tabname@entry=0xb77f9a "Timezones", 
nelem=nelem@entry=4, info=info@entry=0x7ffd3055cf30, flags=flags@entry=16) at 
dynahash.c:529
#2  0x009ecddf in init_timezone_hashtable () at pgtz.c:211
#3  pg_tzset (name=name@entry=0xb49fd5 "GMT") at pgtz.c:248
#4  0x009ecfee in pg_timezone_initialize () at pgtz.c:372

5. DEF_FFACTOR=1 idiv on // just in case reproducer to validate measurement #1
104.333, 3.804
104.313, 3.772
103.951, 3.687

6. DEF_FFACTOR=2 idiv on
105.406, 3.783
105.33, 3.721
105.403, 3.777

7. DEF_FFACTOR=4 idiv on
105.803, 3.722
105.73, 3.728
106.104, 3.756

Some notes:
- SMgrRelationHash is initialized from start at 400, while DB here is small 8GB 
and has only couple of tables  -> no expansion needed in above test.
- local backend private overflow hash table for buffers: PrivateRefCountHash is 
initialized at 100 and maybe it grows during
- googling for DEF_FFACTOR I've found only 1 entry with value of <> 1 from 1993 
(see this [2] , what's interesting is the same DEF_SEGSIZE value after all 
those years)

Alvaro wrote:
>> ... however, if we're really tense about improving some hash table's
>> performance, it might be worth trying to use simplehash.h instead of
>> dynahash.c for it.

Thomas wrote:
> Sure, but removing the dynahash IDIV also seems like a slam dunk removal of a 
> useless expensive feature.

I agree with both, I just thought it might be interesting finding as this idiv 
might be (?) present in other common paths like ReadBuffer*() / PinBuffer() 
(some recent discussions, maybe on NUMA boxes), not just WAL recovery as it 
seems relatively easy to improve.

-J.

[1] - https://github.com/macdice/redo-bench
[2] - https://fuhrwerks.com/csrg/info/93c40a660b6cdf74

From: Thomas Munro 
Sent: Tuesday, September 8, 2020 2:55 AM
To: Alvaro Herrera 
Cc: Jakub Wartak ; pgsql-hackers 

Subject: Re: Division in dynahash.c due to HASH_FFACTOR

On Sat, Sep 5, 2020 at 2:34 AM Alvaro Herrera  wrote:
> On 2020-Sep-04, Jakub Wartak wrote:
&g

Division in dynahash.c due to HASH_FFACTOR

2020-09-04 Thread Jakub Wartak


Greetins hackers,

I have mixed feelings if this welcome contribution as the potential gain is 
relatively small in my tests, but still I would like to point out that 
HASH_FFACTOR functionality from dynahash.c could be removed or optimized 
(default fill factor is always 1, there's not a single place that uses custom 
custom fill factor other than DEF_FFACTOR=1 inside PostgreSQL repository). 
Because the functionality is present there seems to be division for every 
buffer access [BufTableLookup()] / or every smgropen() call (everything call to 
hash_search() is affected, provided it's not ShmemInitHash/HASH_PARTITION). 
This division is especially visible via perf on single process StartupXLOG WAL 
recovery process on standby in heavy duty 100% CPU conditions , as the top1 is 
inside hash_search:
   0x00888751 <+449>:   idiv   r8
   0x00888754 <+452>:   cmp    rax,QWORD PTR [r15+0x338] <<-- in perf 
annotate shows as 30-40%, even on default -O2, probably CPU pipelining for idiv 
above

I've made a PoC test to skip that division assuming ffactor would be gone:
   if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
-                       hctl->freeList[0].nentries / (long) (hctl->max_bucket + 
1) >= hctl->ffactor &&
+                       hctl->freeList[0].nentries >= (long) (hctl->max_bucket 
+ 1) &&

For a stream of WAL 3.7GB I'm getting consistent improvement of ~4%, (yes I 
know it's small, that's why I'm having mixed feelings):
gcc -O3: 104->100s
gcc -O2: 108->104s
pgbench -S -c 16 -j 4 -T 30 -M prepared: stays more or less the same (-s 100), 
so no positive impact there

After removing HASH_FFACTOR PostgreSQL still compiles...  Would removing it 
break some external API/extensions ? I saw several optimization for the "idiv" 
where it could be optimized e.g. see 
https://github.com/ridiculousfish/libdivide  Or maybe there is some other idea 
to expose bottlenecks of BufTableLookup() ? I also saw codepath 
PinBuffer()->GetPrivateRefCountEntry() -> dynahash that could be called pretty 
often I have no idea what kind of pgbench stresstest could be used to 
demonstrate the gain (or lack of it).

-Jakub Wartak.

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-31 Thread Jakub Wartak

Hi Thomas, hackers,

>> ... %CPU ... COMMAND
>> ... 97.4 ... postgres: startup recovering 00010089
> So, what else is pushing this thing off CPU, anyway?  For one thing, I
> guess it might be stalling while reading the WAL itself, because (1)
> we only read it 8KB at a time, relying on kernel read-ahead, which
> typically defaults to 128KB I/Os unless you cranked it up, but for
> example we know that's not enough to saturate a sequential scan on
> NVME system, so maybe it hurts here too (2) we keep having to switch
> segment files every 16MB.  Increasing WAL segment size and kernel
> readahead size presumably help with that, if indeed it is a problem,
> but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
> future segment every time we cross a boundary, and also maybe increase
> the size of our reads.

All of the above (1,2) would make sense and the effects IMHO are partially 
possible to achieve via ./configure compile options, but from previous 
correspondence [1] in this particular workload, it looked like it was not WAL 
reading, but reading random DB blocks into shared buffer: in that case I 
suppose it was the price of too many syscalls to the OS/VFS cache itself as the 
DB was small and fully cached there - so problem (3): 
copy_user_enhanced_fast_string <- 17.79%--copyout (!) <- __pread_nocancel <- 
16.56%--FileRead / mdread / ReadBuffer_common (!). Without some 
micro-optimization or some form of vectorized [batching] I/O in recovery it's 
dead end when it comes to small changes. Thing that come to my mind as for 
enhancing recovery:
- preadv() - works only for 1 fd, while WAL stream might require reading a lot 
of random pages into s_b (many relations/fds, even btree inserting to single 
relation might put data into many 1GB [default] forks). This would only 
micro-optimize INSERT (pk) SELECT nextval(seq) kind of processing on recovery 
side I suppose. Of coruse provided that StartupXLOG would be more working in a 
batched way: (a) reading a lot of blocks from WAL at once (b) then issuing 
preadv() to get all the DB blocks into s_b going from the same rel/fd (c) 
applying WAL. Sounds like a major refactor just to save syscalls :(
- mmap() - even more unrealistic
- IO_URING - gives a lot of promise here I think, is it even planned to be 
shown for PgSQL14 cycle ? Or it's more like PgSQL15?

-Jakub Wartak

[1] - 
https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com
 please see profile after "0001+0002 s_b(at)128MB: 85.392"

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-28 Thread Jakub Wartak

ckpoint 
before DB was open for users (it had to write 269462 buffers =~ 2GB =~ 140MB/s 
which is close to the xvda device speed): the complete output looks in 14master 
looks similar to this:

1598609928.620 startup 22543 LOG:  redo done at 1/12201C88
1598609928.624 checkpointer 22541 LOG:  checkpoint starting: end-of-recovery 
immediate wait
1598609944.908 checkpointer 22541 LOG:  checkpoint complete: wrote 269462 
buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.145 s, 
sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s; 
distance=4468855 kB, estimate=4468855 kB
1598609944.912 postmaster 22538 LOG:  database system is ready to accept 
connections

s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB 
is open 15s faster 

1598610331.556 startup 29499 LOG:  redo done at 1/12201C88
1598610331.559 checkpointer 29497 LOG:  checkpoint starting: immediate force
1598610331.562 postmaster 29473 LOG:  database system is ready to accept 
connections
1598610347.202 checkpointer 29497 LOG:  checkpoint complete: wrote 269462 
buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.092 s, 
sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s; 
distance=4468855 kB, estimate=4468855 kB

I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a 
lot of time and this 0002 patch bypasses that. I would find it quite useful in 
some scenarios (e.g. testing backups, PITR recoveries, opening DB from storage 
snapshots / storage replication, maybe with DWH-after-crash too).

TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to 
test if it would reduce time of starting / opening a fresh standby for 
read-only queries, but this parameter doesn't seem to influence that in my 
tests. As I've learned it's apparently much more complex to reproduce what I'm 
after and involves a lot of reading about LogStandbySnapshot() / standby 
recovery points on my side.

Now, back to smgropen() hash_search_by_values() reproducer...

-Jakub Wartak.

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-27 Thread Jakub Wartak

Hi Alvaro, Thomas, hackers

>> 14.69%  postgres  postgres[.] hash_search_with_hash_value
>> ---hash_search_with_hash_value
>>|--9.80%--BufTableLookup
[..]
>> --4.90%--smgropen
>>   |--2.86%--ReadBufferWithoutRelcache
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Yes, quite frankly my expectation was to see 
hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in simplified 
redo-bench script it's not the case. The original scenario was much more 
complex with plenty of differences (in no particular order: TB-sized DB VS 
~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple 
INSERTs wih plenty of data in VALUES() thrown as one commit, real 
primary->hot-standby replication [not closed DB in recovery], sorted not random 
UUIDs) - I'm going to try nail down these differences and maybe I manage to 
produce more realistic "pgbench reproducer" (this may take some time though).

-Jakub Wartak.

1 2 >

1 - 100 of 102 matches

Mail list logo