Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-29 Thread Jeff Janes
On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda 
wrote:

> All this benchmarking has led me to a philosophical question, why does PG
> need shared_buffers in the first place?
>

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until
the WAL record "protecting" that buffer has been flushed to disk. If a
dirty shared buffer got written to disk, but then a crash happened before
the WAL record go flushed to disk, then the data could be corrupted when it
comes back up. So shared_buffers effectively serves as cooling pond where
dirty buffers wait for their WAL to be flushed naturally so they can be
written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping
times without corrupting each other.  Perhaps that could be implemented to
have just the buffer headers in shared memory to coordinate the locking,
and not having the buffers themselves in shared memory.  But that is not
how it is currently implemented.


> What's wrong with letting the OS do the caching/buffering?
>

Nothing, and that is what it does.  Which is why the advice for
shared_buffers is often to use a small fraction of RAM, leaving the rest
for the OS to do its thing.  But PostgreSQL still needs a way to lock those
pages, both against concurrent access by its own clients, and against
getting flushed out of order by the OS.  There is no performant way to
release the dirty pages immediately to the OS while still constraining the
order in which the OS flushes them to disk.

Finally, while reading a page from the OS cache into shared_buffers is much
faster than reading it from disk, it is still much slower than finding it
already located in shared_buffers.  So if your entire database fits in RAM,
you will get better performance if shared_buffers is large enough for the
entire thing to fit in there, as well.  This is an exception to the rule
that shared_buffers should be a small fraction of RAM.


> Isn't it optimised for this kind of stuff?
>

Maybe.  But you might be surprised at poorly optimized it is.  It depends
on your OS and version of it, of course.  If you have a high usage_count
buffer which is re-dirtied constantly, it will only get written and flushed
to disk once per checkpoint if under PostgreSQL control. But I've seen
pages like that get written many times per second under kernel control.
Whatever optimization it tried to do, it wasn't very good at.  Also, if
many contiguous pages are dirtied in a close time-frame, but not dirtied in
their physical order, the kernel should be able to re-order them into long
sequential writes, correct?  But empirically, it doesn't, at least back in
the late 2.* series kernels when I did the experiments.  I don't know if it
didn't even try, or tried but failed.  (Of course back then, PostgreSQL
didn't do a good job of it either)

Cheers,

Jeff


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-29 Thread Nicolas Charles

Le 29/01/2019 à 07:15, Saurabh Nanda a écrit :


c) I tried another cloud hosting provider (E2E Networks) and just
the raw performance numbers (with default configuration) are
blowing Hetzner out of the water.


I noticed that on E2E, the root filesystem is mounted with the 
following options:


    /dev/xvda on / type ext4
(rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered)

whereas on Hetzner, it is mounted with the following options:

    /dev/nvme0n1p3 on / type ext4
    (rw,relatime,data=ordered)

How much of a difference can this have on absolute TPS numbers?



Differences can be significative. noatime does not update inode access 
time, while relatime updates the inode access time if the change time 
was before access time (which can be often the case for a database)


nobarrier disable block-level write barriers. Barriers ensure that data 
is effectively stored on system, The man command says: "If disabled  on  
a  device with  a  volatile (non-battery-backed)  write-back  cache,  
the nobarrier option will lead to filesystem corruption on a  system 
crash or power loss."


You should probably consider noatime compared to relatime, and 
nobarriers depends if you have a battery or not


Also, this is an SSD, so you should TRIM  it, either with preiodical 
fstrim, or using the discard option



Nicolas







Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-28 Thread Saurabh Nanda
Yet another update:

a) I've tried everything with me EX41-SSD server on Hetzner and nothing is
increasing the performance over & above the default configuration.
b) I tried commissioning a new EX41-SSD server and was able to replicate
the same pathetic performance numbers.
c) I tried another cloud hosting provider (E2E Networks) and just the raw
performance numbers (with default configuration) are blowing Hetzner out of
the water.

This leads me to believe that my assumption of the first hardware (or SSD)
being faulty is incorrect. Something is wrong with either the EX41-SSD
hardware or the out-of-box configuration. I'm commissioning something from
their PX line (which is marked as "Datacenter Edition") and checking if
that makes things better.

++--+--+
| client | Hetzner  | E2E Networks |
|| EX41-SSD [1] | Cloud Server [2] |
++--+--+
| 1  | ~160 | ~400 |
++--+--+
| 4  | ~460 | ~1450|
++--+--+
| 8  | ~850 | ~2600|
++--+--+
| 12 | ~1200| ~4000|
++--+--+


[1] lshw output for Hetzner -
https://gist.github.com/saurabhnanda/613813d0d58fe1a406a8ce9b62ad10a9
[2] lshw output for E2E -
https://gist.github.com/saurabhnanda/d276603990aa773269bad35f335344eb -
since this is a cloud server low-level hardware info is not available. It's
advertised as a 9vCPU + 30GB RAM + SSD cloud instance.

-- Saurabh.


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-28 Thread Saurabh Nanda
>
> Do you know which of the settings is causing lower TPS ?
>
> I suggest to check shared_buffers.
>
> If you haven't done it, disabling THP and KSM can resolve performance
> issues,
> esp. with large RAM like shared_buffers, at least with older kernels.
>
> https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com


I've tried reducing the number of variables to a bare minimum and have the
following three cases now with RAID disabled:

a) only default settings [1]
b) default settings with shared_buffers=2G [2]
c) default settings with shared_buffers=2G & huge_pages=on [3]

The numbers are still not making any sense whatsoever.

++--++---+
| client | Defaults [1] | buffers=2G [2] | buffers=2G|
||  || huge_pages=on |
++--++---+
| 1  | 348-475 (??) | 529-583 (??)   | 155-290   |
++--++---+
| 4  | 436-452  | 451-452| 388-403   |
++--++---+
| 8  | 862-869  | 859-861| 778-781   |
++--++---+
| 12 | 1210-1219| 1220-1225  | 1110- |
++--++---+



[1] Default settings
 checkpoint_completion_target=0.5
 default_statistics_target=100
 effective_io_concurrency=1
 max_parallel_workers=8
 max_parallel_workers_per_gather=2
 max_wal_size=1024 MB
 max_worker_processes=20
 min_wal_size=80 MB
 random_page_cost=4
* shared_buffers=1024 8kB*
 wal_buffers=32 8kB
 work_mem=4096 kB

[2] Increased shared_buffers
 checkpoint_completion_target=0.5
 default_statistics_target=100
 effective_io_concurrency=1
 max_parallel_workers=8
 max_parallel_workers_per_gather=2
 max_wal_size=1024 MB
 max_worker_processes=20
 min_wal_size=80 MB
 random_page_cost=4
* shared_buffers=262144 8kB*
 wal_buffers=2048 8kB
 work_mem=4096 kB

[3] Same settings as [2] with huge_pages=on and the following changes:

 $ cat /sys/kernel/mm/transparent_hugepage/enabled
 always madvise [never]

 $ cat /proc/meminfo |grep -i huge
 AnonHugePages: 0 kB
 ShmemHugePages:0 kB
 HugePages_Total:5000
 HugePages_Free: 3940
 HugePages_Rsvd:1
 HugePages_Surp:0
 Hugepagesize:   2048 kB

-- Saurabh.


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-28 Thread Saurabh Nanda
I've disabled transpare huge-pages and enabled huge_pages as given below.
Let's see what happens. (I feel like a monkey pressing random buttons
trying to turn a light bulb on... and I'm sure the monkey would've had it
easier!)

AnonHugePages: 0 kB
ShmemHugePages:0 kB
HugePages_Total:5000
HugePages_Free: 4954
HugePages_Rsvd: 1015
HugePages_Surp:0
Hugepagesize:   2048 kB

-- Saurabh.


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-28 Thread Saurabh Nanda
>
> You should probably include the detailed hardware you are working on -
> especially for the SSD, the model can have a big impact, as well as its
> wear.
>

What's the best tool to get meaningful information for SSD drives?

-- Saurabh.


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-28 Thread Nicolas Charles

Le 28/01/2019 à 15:03, Saurabh Nanda a écrit :
An update. It seems (to my untrained eye) that something is wrong with 
the second SSD in the RAID configuration. Here's my question on 
serverfault related to what I saw with iostat - 
https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1


I've disabled RAID and rebooted the server to run the benchmarks with 
client=1,4,8,12 with shared_buffers=8MB (default) vs 
shared_buffers=2GB (optimised?) and will report back.



You should probably include the detailed hardware you are working on - 
especially for the SSD, the model can have a big impact, as well as its 
wear.



Nicolas





Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Saurabh Nanda
All this benchmarking has led me to a philosophical question, why does PG
need shared_buffers in the first place? What's wrong with letting the OS do
the caching/buffering?  Isn't it optimised for this kind of stuff?


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Jeff Janes
>
>
> You could also try pg_test_fsync to get low-level information, to
>> supplement the high level you get from pgbench.
>
>
> Thanks for pointing me to this tool. never knew pg_test_fsync existed!
> I've run `pg_test_fsync -s 60` two times and this is the output -
> https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm
> not sure what to make of it?
>

I don't know what to make of that either.  I'd expect fdatasync using two
8kB writes to be about the same throughput as using one 8kB write, but
instead it is 4 times slower.  Also, I'd expect open_datasync to get slower
by a factor of 2, not a factor of 8, when going from one to two 8kB writes
(that is not directly relevant, as you aren't using open_datasync, but is
curious nonetheless).  Is this reproducible with different run lengths?  I
wonder if your write cache (or something) gets "tired" during the first
part of pg_test_fsync and thus degrades the subsequent parts of the test.
I would say something in your IO stack is not optimal, maybe some component
is "consumer grade" rather than "server grade".  Maybe you can ask Hetzner
about that.


> The effects of max_wal_size are going to depend on how you have IO
>> configured, for example does pg_wal shared the same devices and controllers
>> as the base data?  It is mostly about controlling disk usage and
>> crash-recovery performance, neither of which is of primary importance to
>> pgbench performance.
>
>
>  The WAL and the data-directory reside on the same SSD disk -- is this a
> bad idea?
>

If you are trying to squeeze out every last bit of performance, then I
think it is bad idea.  Or at least, something to try the alternative and
see.  The flushing that occurs during checkpoints and the flushing that
occurs for every commit can interfere with each other.


> I was under the impression that smaller values for max_wal_size cause
> pg-server to do "maintenance work" related to wal rotation, etc. more
> frequently and would lead to lower pgbench performance.
>

If you choose ridiculously small values it would.  But once the value is
sufficient, increasing it further wouldn't do much.  Given your low level
of throughput, I would think the default is already sufficient.

Thanks for including the storage info.  Nothing about it stands out to me
as either good or bad, but I'm not a hardware maven; hopefully one will be
reading along and speak up.


> PS: Cc-ing the list back again because I assume you didn't intend for your
> reply to be private, right?
>

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff

>


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Saurabh Nanda
> Do you know which of the settings is causing lower TPS ?


> I suggest to check shared_buffers.
>

I'm trying to find this, but it's taking a lot of time in re-running the
benchmarks changing one config setting at a time. Thanks for the tip
related to shared_buffers.


>
> If you haven't done it, disabling THP and KSM can resolve performance
> issues,
> esp. with large RAM like shared_buffers, at least with older kernels.
>
> https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com


Is this a well-known performance "hack"? Is there any reason why it is not
mentioned at https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
? Are the stability implications of fiddling with THP and KSM well-known?
Also, wrt KSM, my understand was that when a process forks the process'
memory is anyways "copy on write", right? What other kind of pages would
end-up being de-duplicated by ksmd? (Caveat: This is the first time I'm
hearing about KSM and my knowledge is based off a single reading of
https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.html )

-- Saurabh.


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Justin Pryzby
On Sun, Jan 27, 2019 at 01:09:16PM +0530, Saurabh Nanda wrote:
> It seems that PGOPTIONS="-c synchronous_commit=off" has a significant
> impact. However, I still can not understand why the TPS for the optimised
> case is LOWER than the default for higher concurrency levels!

Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

Justin



Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-26 Thread Saurabh Nanda
>
>
> PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 
>
>
> I am currently running all my benchmarks with synchronous_commit=off and
> will get back with my findings.
>


It seems that PGOPTIONS="-c synchronous_commit=off" has a significant
impact. However, I still can not understand why the TPS for the optimised
case is LOWER than the default for higher concurrency levels!

++-++
| client | Mostly defaults [1] | Optimised settings [2] |
++-++
| 1  | 80-86   | 169-180|
++-++
| 6  | 350-376 | 1265-1397  |
++-++
| 12 | 603-619 | 1746-2352  |
++-++
| 24 | 947-1015| 1869-2518  |
++-++
| 48 | 1435-1512   | 1912-2818  |
++-++
| 96 | 1769-1811   | 1546-1753  |
++-++
| 192| 1857-1992   | 1332-1508  |
++-++
| 384| 1667-1793   | 1356-1450  |
++-++


[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11.
A snippet of the relevant setts are given below:

max_connection=400
work_mem=4MB
maintenance_work_mem=64MB
shared_buffers=128MB
temp_buffers=8MB
effective_cache_size=4GB
wal_buffers=-1
wal_sync_method=fsync
max_wal_size=1GB
*autovacuum=off# Auto-vacuuming was disabled*


[2] An optimised version of settings was obtained from
https://pgtune.leopard.in.ua/#/  and along with that the benchmarks were
run with *PGOPTIONS="-c synchronous_commit=off"*

max_connections = 400
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 3495kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 12
max_parallel_workers_per_gather = 6
max_parallel_workers = 12
*autovacuum=off# Auto-vacuuming was disabled*


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-24 Thread Saurabh Nanda
Is there any material on how to benchmark Postgres meaningfully? I'm
getting very frustrated with the numbers that `pgbench` is reporting:

-- allocating more resources to Postgres seems to be randomly dropping
performance
-- there seems to be no repeatability in the benchmarking numbers [1]
-- there is no to figure out what is causing a bottleneck and which
knob/setting is going to alleviate it.

How do the PG wizards figure all this out?

[1]
https://dba.stackexchange.com/questions/227790/pgbench-20-30-variation-in-benchmark-results-non-repeatable-benchmarks

-- Saurabh.

On Thu, Jan 24, 2019 at 12:46 AM Saurabh Nanda 
wrote:

> Hi,
>
> Please pardon me if this question is already answered in the
> documentation, Wiki, or the mailing list archive. The problem is, that I
> don't know the exact term to search for - I've tried searching for "linear
> scalability" and "concurrency vs performance" but didn't find what I was
> looking for.
>
> ## MAIN QUESTION
>
> pgbench -c 1 achieves approx 80 TPS
> pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS
> pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS
>
> If pgbench is being run on a 4c/8t machine and pg-server is being run on a
> 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1
> Gbit/s connection, I don't think either pgbench or pg-server is being
> constrained by hardware, right?
>
> *In that case why is it not possible to achieve linear scalability, at
> least till 12 concurrent connections (i.e. the thread-count of pg-server)?*
> What is an easy way to identify the limiting factor? Is it network
> connectivity? Disk IOPS? CPU load? Some config parameter?
>
> ## SECONDARY QUESTION
>
> *At what level of concurrent connections should settings like
> shared_buffers, effective_cache_size, max_wal_size start making a
> difference?* With my hardware [1], I'm seeing a difference only after 48
> concurrent connections. And that too it's just a 15-30% improvement over
> the default settings that ship with the Ubuntu 18.04 package. Is this
> expected? Isn't this allocating too many resources for too little gain?
>
> ## CONTEXT
>
> I am currently trying to benchmark PG 11 (via pgbench) to figure out the
> configuration parameters that deliver optimum performance for my hardware
> [1] and workload [2]
>
> Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
> I've made the following relevant changes to the default PG config on Ubuntu
> 18.04:
>
> max_connection=400
> work_mem=4MB
> maintenance_work_mem=64MB
> shared_buffers=12288MB
> temp_buffers=8MB
> effective_cache_size=16GB
> wal_buffers=-1
> wal_sync_method=fsync
> max_wal_size=5GB
> autovacuum=off  # NOTE: Only for benchmarking
>
> [1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1
> [2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B
>
> Thanks,
> Saurabh.
>


-- 
http://www.saurabhnanda.com


Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-23 Thread Saurabh Nanda
Hi,

Please pardon me if this question is already answered in the documentation,
Wiki, or the mailing list archive. The problem is, that I don't know the
exact term to search for - I've tried searching for "linear scalability"
and "concurrency vs performance" but didn't find what I was looking for.

## MAIN QUESTION

pgbench -c 1 achieves approx 80 TPS
pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS
pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS

If pgbench is being run on a 4c/8t machine and pg-server is being run on a
6c/12t machine with 32GB RAM [1], and the two servers are connected with 1
Gbit/s connection, I don't think either pgbench or pg-server is being
constrained by hardware, right?

*In that case why is it not possible to achieve linear scalability, at
least till 12 concurrent connections (i.e. the thread-count of pg-server)?*
What is an easy way to identify the limiting factor? Is it network
connectivity? Disk IOPS? CPU load? Some config parameter?

## SECONDARY QUESTION

*At what level of concurrent connections should settings like
shared_buffers, effective_cache_size, max_wal_size start making a
difference?* With my hardware [1], I'm seeing a difference only after 48
concurrent connections. And that too it's just a 15-30% improvement over
the default settings that ship with the Ubuntu 18.04 package. Is this
expected? Isn't this allocating too many resources for too little gain?

## CONTEXT

I am currently trying to benchmark PG 11 (via pgbench) to figure out the
configuration parameters that deliver optimum performance for my hardware
[1] and workload [2]

Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
I've made the following relevant changes to the default PG config on Ubuntu
18.04:

max_connection=400
work_mem=4MB
maintenance_work_mem=64MB
shared_buffers=12288MB
temp_buffers=8MB
effective_cache_size=16GB
wal_buffers=-1
wal_sync_method=fsync
max_wal_size=5GB
autovacuum=off  # NOTE: Only for benchmarking

[1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1
[2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B

Thanks,
Saurabh.