Re: Some performance degradation in REL_16 vs REL_15

2023-11-15 Thread Andres Freund
Hi,

On 2023-11-15 10:09:06 -0500, Tom Lane wrote:
> "Anton A. Melnikov"  writes:
> > I can't understand why i get the opposite results on my pc and on the 
> > server. It is clear that the absolute
> > TPS values will be different for various configurations. This is normal. 
> > But differences?
> > Is it unlikely that some kind of reference configuration is needed to 
> > accurately
> > measure the difference in performance. Probably something wrong with my pc, 
> > but now
> > i can not figure out what's wrong.
>
> > Would be very grateful for any advice or comments to clarify this problem.
>
> Benchmarking is hard :-(.

Indeed.


> IME it's absolutely typical to see variations of a couple of percent even
> when "nothing has changed", for example after modifying some code that's
> nowhere near any hot code path for the test case.  I usually attribute this
> to cache effects, such as a couple of bits of hot code now sharing or not
> sharing a cache line.

FWIW, I think we're overusing that explanation in our community. Of course you
can encounter things like this, but the replacement policies of cpu caches
have gotten a lot better and the caches have gotten bigger too.

IME this kind of thing is typically dwarfed by much bigger variations from
things like

- cpu scheduling - whether the relevant pgbench thread is colocated on the
  same core as the relevant backend can make a huge difference,
  particularly when CPU power saving modes are not disabled.  Just looking at
  tps from a fully cached readonly pgbench, with a single client:

  Power savings enabled, same core:
  37493

  Power savings enabled, different core:
  28539

  Power savings disabled, same core:
  38167

  Power savings disabled, different core:
  37365


- can transparent huge pages be used for the executable mapping, or not

  On newer kernels linux (and some filesystems) can use huge pages for the
  executable. To what degree that succeeds is a large factor in performance.

  Single threaded read-only pgbench

  postgres mapped without huge pages:
  37155 TPS

  with 2MB of postgres as huge pages:
  37695 TPS

  with 6MB of postgres as huge pages:
  42733 TPS

  The really annoying thing about this is that entirely unpredictable whether
  huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB,
  sometimes 6MB are mapped huge. Even though the on-disk contents are
  precisely the same.  And it can even change without rebuilding, if the
  binary is evicted from the page cache.

  This alone makes benchmarking extremely annoying. It basically can't be
  controlled and has huge effects.


- How long has the server been started

  If e.g. once you run your benchmark on the first connection to a database,
  and after a restart not (e.g. autovacuum starts up beforehand), you can get
  a fairly different memory layout and cache situation, due to [not] using the
  relcache init file. If not, you'll have a catcache that's populated,
  otherwise not.

  Another mean one is whether you start your benchmark within a relatively
  short time of the server starting. Readonly pgbench with a single client,
  started immediately after the server:

  progress: 12.0 s, 37784.4 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 13.0 s, 37779.6 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 14.0 s, 37668.2 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 15.0 s, 32133.0 tps, lat 0.031 ms stddev 0.113, 0 failed
  progress: 16.0 s, 37564.9 tps, lat 0.027 ms stddev 0.012, 0 failed
  progress: 17.0 s, 37731.7 tps, lat 0.026 ms stddev 0.001, 0 failed

  There's a dip at 15s, odd - turns out that's due to bgwriter writing a WAL
  record, which triggers walwriter to write it out and then initialize the
  whole WAL buffers with 0s - happens once.  In this case I've exagerated the
  effect a bit by using a 1GB wal_buffers, but it's visible otherwise too.
  Whether your benchmark period includes that dip or not adds a fair bit of
  noise.

  You can even see the effects of autovacuum workers launching - even if
  there's nothing to do!  Not as a huge dip, but enough to add some "run to
  run" variation.


- How much other dirty data is there in the kernel pagecache. If you e.g. just
  built a new binary, even with just minor changes, the kernel will need to
  flush those pages eventually, which may contend for IO and increases page
  faults.

  Rebuilding an optimized build generates something like 1GB of dirty
  data. Particularly with ccache, that'll typically not yet be flushed by the
  time you run a benchmark. That's not nothing, even with a decent NVMe SSD.

- many more, unfortunately

Greetings,

Andres Freund




Re: Some performance degradation in REL_16 vs REL_15

2023-11-15 Thread Andres Freund
Hi,

On 2023-11-15 11:33:44 +0300, Anton A. Melnikov wrote:
> The configure options and test scripts on my pc and server were the same:
> export CFLAGS="-O2"
> ./configure --enable-debug --with-perl --with-icu --enable-depend 
> --enable-tap-tests
> #reinstall
> #reinitdb
> #create database bench
> for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
> psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 
> -j6 bench;

Even with scale 8 you're likely significantly impacted by contention. And
obviously WAL write latency. See below for why that matters.



> I can't understand why i get the opposite results on my pc and on the server. 
> It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But 
> differences?
> Is it unlikely that some kind of reference configuration is needed to 
> accurately
> measure the difference in performance. Probably something wrong with my pc, 
> but now
> i can not figure out what's wrong.

One very common reason for symptoms like this are power-saving measures by the
CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go
into a powersaving mode - which can cause workloads that are latency sensitive
to be badly affected.  Both because initially the cpu will just work at a
lower frequency and because it takes time to shift to a higher latency.


Here's an example:
I bound the server and psql to the same CPU core (nothing else is allowed to
use that core. And ran the following:

\o /dev/null
SELECT 1; SELECT 1; SELECT 1; SELECT pg_sleep(0.1); SELECT 1; SELECT 1; SELECT 
1;
Time: 0.181 ms
Time: 0.085 ms
Time: 0.071 ms
Time: 100.474 ms
Time: 0.153 ms
Time: 0.077 ms
Time: 0.069 ms

You can see how the first query timing was slower, the next two were faster,
and then after the pg_sleep() it's slow again.


# tell the CPU to optimize for performance not power
cpupower frequency-set --governor performance

# disable going to lower power states
cpupower idle-set -D0

# disable turbo mode for consistent performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Now the timings are:
Time: 0.038 ms
Time: 0.028 ms
Time: 0.025 ms
Time: 1000.262 ms (00:01.000)
Time: 0.027 ms
Time: 0.024 ms
Time: 0.023 ms

Look, fast and reasonably consistent timings.

Switching back:
Time: 0.155 ms
Time: 0.123 ms
Time: 0.074 ms
Time: 1001.235 ms (00:01.001)
Time: 0.120 ms
Time: 0.077 ms
Time: 0.068 ms


The perverse thing is that this often means that *reducing* the number of
instructions executed yields to *worse* behaviour when under non-sustained
load, because from the CPUs point of view there is less need to increase clock
speed.


To show how much of a difference that can make, I ran pgbench with a single
client on one core, and the server on another (so the CPU is idle inbetween):
numactl --physcpubind 11 pgbench -n -M prepared -P1 -S -c 1 -T10

With power optimized configuration:
latency average = 0.035 ms
latency stddev = 0.002 ms
initial connection time = 5.255 ms
tps = 28434.334672 (without initial connection time)

With performance optimized configuration:
latency average = 0.025 ms
latency stddev = 0.001 ms
initial connection time = 3.544 ms
tps = 40079.995935 (without initial connection time)

That's a whopping 1.4x in throughput!


Now, the same thing, except that I used a custom workload where pgbench
transactions are executed in a pipelined fashion, 100 read-only transactions
in one script execution:

With power optimized configuration:
latency average = 1.055 ms
latency stddev = 0.125 ms
initial connection time = 6.915 ms
tps = 947.985286 (without initial connection time)

(this means we actually executed 94798.5286 readonly pgbench transactions/s)

With performance optimized configuration:
latency average = 1.376 ms
latency stddev = 0.083 ms
initial connection time = 3.759 ms
tps = 726.849018 (without initial connection time)

Suddenly the super-duper performance optimized settings are worse (but note
that stddev is down)! I suspect the problem is that now because we disabled
idle states, the cpu ends up clocking *lower*, due to power usage.

If I just change the relevant *cores* to the performance optimized
configuration:

cpupower -c 10,11 idle-set -D0; cpupower -c 10,11 frequency-set --governor 
performance

latency average = 0.940 ms
latency stddev = 0.061 ms
initial connection time = 3.311 ms
tps = 1063.719116 (without initial connection time)

It wins again.


Now, realistically you'd never use -D0 (i.e. disabling all downclocking, not
just lower states) - the power differential is quite big and as shown here it
can hurt performance as well.

On an idle system, looking at the cpu power usage with:
powerstat -D -R 5 1000

  TimeUser  Nice   Sys  IdleIO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts 
  pkg-0   dram  pkg-1
09:45:03   0.6   0.0   0.2  99.2   0.01   2861   2823000  46.84 
  24.82   3.68  18.33
09:45:08   1.0   0.0   0.1  99.0  

Re: Some performance degradation in REL_16 vs REL_15

2023-11-15 Thread Tom Lane
"Anton A. Melnikov"  writes:
> I can't understand why i get the opposite results on my pc and on the server. 
> It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But 
> differences?
> Is it unlikely that some kind of reference configuration is needed to 
> accurately
> measure the difference in performance. Probably something wrong with my pc, 
> but now
> i can not figure out what's wrong.

> Would be very grateful for any advice or comments to clarify this problem.

Benchmarking is hard :-(.  IME it's absolutely typical to see
variations of a couple of percent even when "nothing has changed",
for example after modifying some code that's nowhere near any
hot code path for the test case.  I usually attribute this to
cache effects, such as a couple of bits of hot code now sharing or
not sharing a cache line.  If you use two different compiler versions
then that situation is likely to occur all over the place even with
exactly the same source code.  NUMA creates huge reproducibility
problems too on multisocket machines (which your server is IIUC).
When I had a multisocket workstation I'd usually bind all the server
processes to one socket if I wanted more-or-less-repeatable numbers.

I wouldn't put a lot of faith in the idea that measured pgbench
differences of up to several percent are meaningful at all,
especially when comparing across different hardware and different
OS+compiler versions.  There are too many variables that have
little to do with the theoretical performance of the source code.

regards, tom lane




Re: Some performance degradation in REL_16 vs REL_15

2023-11-15 Thread Anton A. Melnikov

On 30.10.2023 22:51, Andres Freund wrote:


There's really no point in comparing peformance with assertions enabled
(leaving aside assertions that cause extreme performance difference, making
development harder). We very well might have added assertions making things
more expensive, without affecting performance in optimized/non-assert builds.



Thanks for advice! I repeated measurements on my pc without asserts and 
CFLAGS="-O2".
Also i reduced the number of clients to -c6 to leave a reserve of two cores
from my 8-core cpu and used -j6 accordingly.

The results were similar: on my pc REL_10_STABLE(c18c12c9) was faster than 
REL_16_STABLE(07494a0d)
but the effect became weaker:
 REL_10_STABLE gives ~965+-15 TPS(+-2%) while REL_16_STABLE gives ~920+-30 
TPS(+-3%) in the test: pgbench -s8 -c6 -T20 -j6
So 10 is faster than 16 by ~5%. (see raw-my-pc.txt attached for the raw data)

Then, thanks to my colleagues, i carried out similar measurements on the more 
powerful 24-core standalone server.
The REL_10_STABLE gives 8260+-100 TPS(+-1%) while REL_16_STABLE gives 8580+-90 
TPS(+-1%) in the same test: pgbench -s8 -c6 -T20 -j6
The test gave an opposite result!
On that server the 16 is faster than 10 by ~4%.

When i scaled the test on server to get the same reserve of two cores, the 
results became like this:
REL_10_STABLE gives ~16000+-300 TPS(+-2%) while REL_16_STABLE gives ~18500+-200 
TPS(+-1%) in the scaled test: pgbench -s24 -c22 -T20 -j22
Here the difference is more noticeable: 16 is faster than 10 by ~15%. 
(raw-server.txt)

The configure options and test scripts on my pc and server were the same:
export CFLAGS="-O2"
./configure --enable-debug --with-perl --with-icu --enable-depend 
--enable-tap-tests
#reinstall
#reinitdb
#create database bench
for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 
-j6 bench;

Configurations:
my pc:  8-core AMD Ryzen 7 4700U @ 1.4GHz, 64GB RAM, NVMe M.2 SSD drive.
Linux 5.15.0-88-generic #98~20.04.1-Ubuntu SMP Mon Oct 9 16:43:45 UTC 2023 
x86_64 x86_64 x86_64 GNU/Linux
server: 2x 12-hyperthreaded cores Intel(R) Xeon(R) CPU X5675 @ 3.07GHz, 24GB 
RAM, RAID from SSD drives.
Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

I can't understand why i get the opposite results on my pc and on the server. 
It is clear that the absolute
TPS values will be different for various configurations. This is normal. But 
differences?
Is it unlikely that some kind of reference configuration is needed to accurately
measure the difference in performance. Probably something wrong with my pc, but 
now
i can not figure out what's wrong.

Would be very grateful for any advice or comments to clarify this problem.

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

REL_10_STABLE c18c12c983a84d55e58b176969782c7ffac3272b
pgbench -s8 -c6 -T20 -j6
940.47198
954.621585
902.319686
965.387517
959.44536
970.882218
922.012141
969.642272
964.549628
935.639076
958.835093
976.912892
975.618375
960.599515
981.900039
973.34447
964.563699
960.321335
962.643262
975.631214
971.78315
965.226256
961.106572
968.520002
973.825485
978.49579
963.863368
973.906058
966.676175
965.186708
954.572371
977.620229
981.419347
962.751969
963.676599
967.966257
974.68403
955.342462
957.832817
984.065968
972.364891
977.489316
957.352265
969.463156
974.320994
949.679765
969.081674
963.190554
962.394456
966.84177
975.840044
954.471689
977.764019
968.67597
963.203923
964.752374
965.957151
979.17749
915.412491
975.120789
962.105916
980.343235
957.180492
953.552183
979.783099
967.906392
966.926945
962.962301
965.53471
971.030289
954.21045
961.266889
973.367193
956.736464
980.317352
911.188865
979.274233
980.267316
982.029926
977.731543
937.327052
978.161778
978.575841
962.661776
914.896072
966.902901
973.539272
980.418576
966.073472
963.196341
962.718863
977.062467
958.303102
959.937588
959.52382
934.876632
971.899844
979.71
964.154208
960.051284

REL_16_STABLE 07494a0df9a66219e5f1029de47ecabc34c9cb55
pgbench -s8 -c6 -T20 -j6
952.061203
905.964458
921.009294
921.970342
924.810464
935.988344
917.110599
925.62075
933.423024
923.445651
932.740532
927.994569
913.773152
922.955946
917.680486
923.145074
925.133017
922.36253
907.656249
927.980182
924.249294
933.355461
923.359649
919.694726
923.178731
929.250348
921.643735
916.546247
930.960814
913.333819
773.157522
945.293028
924.839608
932.228796
912.846096
924.01411
924.341422
909.823505
882.105606
920.337305
930.297982
909.238148
880.839643
939.582115
927.263785
927.921499
932.897521
931.77316
943.261293
853.433421
921.813303
916.463003
919.652647
914.662188
912.137913
923.279822
922.967526
936.344334
946.281347
801.718759
950.571673
928.845848
888.181388
885.603875
939.763546
896.841216
934.904546
929.369005
884.065433
874.953048
933.411683
930.654935
952.833611
942.193108
930.491705

Re: Some performance degradation in REL_16 vs REL_15

2023-10-30 Thread Andres Freund
Hi,

On 2023-10-30 15:28:53 +0300, Anton A. Melnikov wrote:
> For REL_16_STABLE at 7cc2f59dd the average TPS was: 2020+-70,
> for REL_10_STABLE at c18c12c98 - 2260+-70
> 
> The percentage difference was approximately 11%.
> Please see the 16vs10.png picture with the graphical representation of the 
> data obtained.
> Also there are the raw data in the raw_data_s21.txt.
> 
> In some days i hope to perform additional measurements that were mentioned 
> above in this letter.
> It would be interesting to establish the reason for this difference. And  i 
> would be very grateful
> if you could advise me what other settings can be tweaked.

There's really no point in comparing peformance with assertions enabled
(leaving aside assertions that cause extreme performance difference, making
development harder). We very well might have added assertions making things
more expensive, without affecting performance in optimized/non-assert builds.

Greetings,

Andres Freund




Re: Some performance degradation in REL_16 vs REL_15

2023-10-17 Thread Tom Lane
=?utf-8?B?6YKx5a6H6Iiq?=  writes:
> I wrote a script and test on branch REL_[10-16]_STABLE, and do see 
> performance drop in REL_13_STABLE, which is about 1~2%.

I'm really skeptical that we should pay much attention to these numbers.
You've made several of the mistakes that we typically tell people not to
make when using pgbench:

* scale <= number of sessions means you're measuring a lot of
row-update contention

* once you crank up the scale enough to avoid that problem, running
with the default shared_buffers seems like a pretty poor choice

* 10-second runtime is probably an order of magnitude too small
to get useful, reliable numbers

On top of all that, discrepancies on the order of a percent or two
commonly arise from hard-to-control-for effects like the cache
alignment of hot spots in different parts of the code.  That means
that you can see changes of that size from nothing more than
day-to-day changes in completely unrelated parts of the code.

I'd get excited about say a 10% performance drop, because that's
probably more than noise; but I'm not convinced that any of the
differences you show here are more than noise.

regards, tom lane




Re: Some performance degradation in REL_16 vs REL_15

2023-10-17 Thread Andres Freund
Hi,

On 2023-10-16 11:04:25 +0300, Anton A. Melnikov wrote:
> On 13.10.2023 05:05, Andres Freund wrote:
> > Could you provide a bit more details about how you ran the benchmark?  The
> > reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on 
> > spinning
> > rust and using the default settings, I get considerably higher results.
> > 
> > Oh - I do get results closer to yours if I use pgbench scale 1, causing a 
> > lot
> > of row level contention. What scale did you use?
> 
> 
> I use default scale of 1.

That means you're largely going to be bottlenecked due to row level
contention. For read/write pgbench you normally want to use a scale that's
bigger than the client count, best by at least 2x.

Have you built postgres with assertions enabled or such?

What is the server configuration for both versions?


> And run the command sequence:
> $pgbench -i bench
> $sleep 1
> $pgbench -c20 -T10 -j8

I assume you also specify the database name here, given you specified it for
pgbench -i?

As you're not doing a new initdb here, the state of the cluster will
substantially depend on what has run before. This can matter substantially
because a cluster with prior substantial write activity will already have
initialized WAL files and can reuse them cheaply, whereas one without that
activity needs to initialize new files.  Although that matters a bit less with
scale 1, because there's just not a whole lot of writes.

At the very least you should trigger a checkpoint before or after pgbench
-i. The performance between having a checkpoint during the pgbench run or not
is substantially different, and if you're not triggering one explicitly, it'll
be up to random chance whether it happens during the run or not. It's less
important if you run pgbench for an extended time, but if you do it just for
10s...

E.g. on my workstation, if there's no checkpoint, I get around 633 TPS across
repeated runs, but if there's a checkpoint between pgbench -i and the pgbench
run, it's around 615 TPS.

Greetings,

Andres Freund




Re: Some performance degradation in REL_16 vs REL_15

2023-10-17 Thread 邱宇航
I wrote a script and test on branch REL_[10-16]_STABLE, and do see performance 
drop in REL_13_STABLE, which is about 1~2%.

scale   round   10  11  12  13  14  15  16
1   1   7922.2  8018.3  8102.8  7838.3  7829.2  7870.0  7846.1
2   7922.4  7923.5  8090.3  7887.7  7912.4  7815.2  7865.6
3   7937.6  7964.9  8012.8  7918.5  7879.4  7786.4  7981.1
4   8000.4  7959.5  8141.1  7886.3  7840.9  7863.5  8022.4
5   7921.8  7945.5  8005.2  7993.7  7957.0  7803.8  7899.8
6   7893.8  7895.1  8017.2  7879.8  7880.9  7911.4  7909.2
7   7879.3  7853.5  8071.7  7956.2  7876.7  7863.3  7986.3
8   7980.5  7964.1  8119.2  8015.2  7877.6  7784.9  7923.6
9   8083.9  7946.4  7960.3  7913.9  7924.6  7867.7  7928.6
10  7971.2  7991.8  7999.5  7812.4  7824.3  7831.0  7953.4
AVG 7951.3  7946.3  8052.0  7910.2  7880.3  7839.7  7931.6
MED 7930.0  7952.9  8044.5  7900.8  7878.5  7847.1  7926.1
10  1   41221.5 41394.8 40926.8 40566.6 41661.3 40511.9 40961.8
2   40974.0 40697.9 40842.4 40269.2 41127.7 40795.5 40814.9
3   41453.5 41426.4 41066.2 40890.9 41018.6 40897.3 40891.7
4   41691.9 40294.9 41189.8 40873.8 41539.7 40943.2 40643.8
5   40843.4 40855.5 41243.8 40351.3 40863.2 40839.6 40795.5
6   40969.3 40897.9 41380.8 40734.7 41269.3 41301.0 41061.0
7   40981.1 41119.5 41158.0 40834.6 40967.1 40790.6 41061.6
8   41006.4 41205.9 40740.3 40978.7 40742.4 40951.6 41242.1
9   41089.9 41129.7 40648.3 40622.1 40782.0 40460.5 40877.9
10  41280.3 41462.7 41316.4 40728.0 40983.9 40747.0 40964.6
AVG 41151.1 41048.5 41051.3 40685.0 41095.5 40823.8 40931.5
MED 41048.2 41124.6 41112.1 40731.3 41001.3 40817.6 40926.7
100 1   43429.0 43190.2 44099.3 43941.5 43883.3 44215.0 44604.9
2   43281.7 43795.2 44963.6 44331.5 43559.7 43571.5 43403.9
3   43749.0 43614.1 44616.7 43759.5 43617.8 43530.3 43362.4
4   43362.0 43197.3 44296.7 43692.4 42020.5 43607.3 43081.8
5   43373.4 43288.0 44240.9 43795.0 43630.6 43576.7 43512.0
6   43637.0 43385.2 45130.1 43792.5 43635.4 43905.2 43371.2
7   43621.2 43474.2 43735.0 43592.2 43889.7 43947.7 43369.8
8   43351.0 43937.5 44285.6 43877.2 43771.1 43879.1 43680.4
9   43481.3 43700.5 44119.9 43786.9 43440.8 44083.1 43563.2
10  43238.7 43559.5 44310.8 43406.0 44306.6 43376.3 43242.7
AVG 43452.4 43514.2 44379.9 43797.5 43575.6 43769.2 43519.2
MED 43401.2 43516.8 44291.2 43789.7 43633.0 43743.2 43387.5

The script looks like:
initdb data >/dev/null 2>&1 #initdb on every round
pg_ctl -D data -l logfile start >/dev/null 2>&1 #start without changing any 
setting
pgbench -i postgres $scale >/dev/null 2>&1
sleep 1 >/dev/null 2>&1
pgbench -c20 -T10 -j8

And here is the pg_config output:
...
CONFIGURE =  '--enable-debug' '--prefix=/home/postgres/base' '--enable-depend' 
'PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig::/usr/lib/pkgconfig'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type 
-Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g 
-O2
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed 
-Wl,-rpath,'/home/postgres/base/lib',--enable-new-dtags
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgcommon -lpgport -lz -lreadline -lpthread -lrt -ldl -lm 
VERSION = PostgreSQL 16.0

—-
Yuhang Qiu

Re: Some performance degradation in REL_16 vs REL_15

2023-10-16 Thread Anton A. Melnikov

On 13.10.2023 05:05, Andres Freund wrote:

Could you provide a bit more details about how you ran the benchmark?  The
reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
rust and using the default settings, I get considerably higher results.

Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
of row level contention. What scale did you use?



I use default scale of 1.
And run the command sequence:
$pgbench -i bench
$sleep 1
$pgbench -c20 -T10 -j8
in a loop to get similar initial conditions for every "pgbench -c20 -T10 -j8" 
run.

Thanks for your interest!

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 11:00:22 +0300, Anton A. Melnikov wrote:
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
> 
> The performance drop is approximately 3,5%  while the corrected standard 
> deviation is only 0.3%.
> See the raw_data.txt attached.

Could you provide a bit more details about how you ran the benchmark?  The
reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
rust and using the default settings, I get considerably higher results.

Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
of row level contention. What scale did you use?

Greetings,

Andres Freund




Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 09:20:36PM +1300, David Rowley wrote:
> It would be interesting to know what's to blame here and if you can
> attribute it to a certain commit.

+1.
--
Michael


signature.asc
Description: PGP signature


Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread David Rowley
On Thu, 12 Oct 2023 at 21:01, Anton A. Melnikov
 wrote:
>
> Greetengs!
>
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
>
> And is it worth spending time bisecting for the commit where this degradation 
> may have occurred?

It would be interesting to know what's to blame here and if you can
attribute it to a certain commit.

David