Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda wrote: > All this benchmarking has led me to a philosophical question, why does PG > need shared_buffers in the first place? > PostgreSQL cannot let the OS get its hands on a dirty shared buffer until the WAL record "protecting" that buffer has been flushed to disk. If a dirty shared buffer got written to disk, but then a crash happened before the WAL record go flushed to disk, then the data could be corrupted when it comes back up. So shared_buffers effectively serves as cooling pond where dirty buffers wait for their WAL to be flushed naturally so they can be written without instigating a performance-reducing flush just for them. Also, concurrent clients needs to access the same disk pages at overlapping times without corrupting each other. Perhaps that could be implemented to have just the buffer headers in shared memory to coordinate the locking, and not having the buffers themselves in shared memory. But that is not how it is currently implemented. > What's wrong with letting the OS do the caching/buffering? > Nothing, and that is what it does. Which is why the advice for shared_buffers is often to use a small fraction of RAM, leaving the rest for the OS to do its thing. But PostgreSQL still needs a way to lock those pages, both against concurrent access by its own clients, and against getting flushed out of order by the OS. There is no performant way to release the dirty pages immediately to the OS while still constraining the order in which the OS flushes them to disk. Finally, while reading a page from the OS cache into shared_buffers is much faster than reading it from disk, it is still much slower than finding it already located in shared_buffers. So if your entire database fits in RAM, you will get better performance if shared_buffers is large enough for the entire thing to fit in there, as well. This is an exception to the rule that shared_buffers should be a small fraction of RAM. > Isn't it optimised for this kind of stuff? > Maybe. But you might be surprised at poorly optimized it is. It depends on your OS and version of it, of course. If you have a high usage_count buffer which is re-dirtied constantly, it will only get written and flushed to disk once per checkpoint if under PostgreSQL control. But I've seen pages like that get written many times per second under kernel control. Whatever optimization it tried to do, it wasn't very good at. Also, if many contiguous pages are dirtied in a close time-frame, but not dirtied in their physical order, the kernel should be able to re-order them into long sequential writes, correct? But empirically, it doesn't, at least back in the late 2.* series kernels when I did the experiments. I don't know if it didn't even try, or tried but failed. (Of course back then, PostgreSQL didn't do a good job of it either) Cheers, Jeff
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
Le 29/01/2019 à 07:15, Saurabh Nanda a écrit : c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water. I noticed that on E2E, the root filesystem is mounted with the following options: /dev/xvda on / type ext4 (rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered) whereas on Hetzner, it is mounted with the following options: /dev/nvme0n1p3 on / type ext4 (rw,relatime,data=ordered) How much of a difference can this have on absolute TPS numbers? Differences can be significative. noatime does not update inode access time, while relatime updates the inode access time if the change time was before access time (which can be often the case for a database) nobarrier disable block-level write barriers. Barriers ensure that data is effectively stored on system, The man command says: "If disabled on a device with a volatile (non-battery-backed) write-back cache, the nobarrier option will lead to filesystem corruption on a system crash or power loss." You should probably consider noatime compared to relatime, and nobarriers depends if you have a battery or not Also, this is an SSD, so you should TRIM it, either with preiodical fstrim, or using the discard option Nicolas
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
Yet another update: a) I've tried everything with me EX41-SSD server on Hetzner and nothing is increasing the performance over & above the default configuration. b) I tried commissioning a new EX41-SSD server and was able to replicate the same pathetic performance numbers. c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water. This leads me to believe that my assumption of the first hardware (or SSD) being faulty is incorrect. Something is wrong with either the EX41-SSD hardware or the out-of-box configuration. I'm commissioning something from their PX line (which is marked as "Datacenter Edition") and checking if that makes things better. ++--+--+ | client | Hetzner | E2E Networks | || EX41-SSD [1] | Cloud Server [2] | ++--+--+ | 1 | ~160 | ~400 | ++--+--+ | 4 | ~460 | ~1450| ++--+--+ | 8 | ~850 | ~2600| ++--+--+ | 12 | ~1200| ~4000| ++--+--+ [1] lshw output for Hetzner - https://gist.github.com/saurabhnanda/613813d0d58fe1a406a8ce9b62ad10a9 [2] lshw output for E2E - https://gist.github.com/saurabhnanda/d276603990aa773269bad35f335344eb - since this is a cloud server low-level hardware info is not available. It's advertised as a 9vCPU + 30GB RAM + SSD cloud instance. -- Saurabh.
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
> > Do you know which of the settings is causing lower TPS ? > > I suggest to check shared_buffers. > > If you haven't done it, disabling THP and KSM can resolve performance > issues, > esp. with large RAM like shared_buffers, at least with older kernels. > > https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com I've tried reducing the number of variables to a bare minimum and have the following three cases now with RAID disabled: a) only default settings [1] b) default settings with shared_buffers=2G [2] c) default settings with shared_buffers=2G & huge_pages=on [3] The numbers are still not making any sense whatsoever. ++--++---+ | client | Defaults [1] | buffers=2G [2] | buffers=2G| || || huge_pages=on | ++--++---+ | 1 | 348-475 (??) | 529-583 (??) | 155-290 | ++--++---+ | 4 | 436-452 | 451-452| 388-403 | ++--++---+ | 8 | 862-869 | 859-861| 778-781 | ++--++---+ | 12 | 1210-1219| 1220-1225 | 1110- | ++--++---+ [1] Default settings checkpoint_completion_target=0.5 default_statistics_target=100 effective_io_concurrency=1 max_parallel_workers=8 max_parallel_workers_per_gather=2 max_wal_size=1024 MB max_worker_processes=20 min_wal_size=80 MB random_page_cost=4 * shared_buffers=1024 8kB* wal_buffers=32 8kB work_mem=4096 kB [2] Increased shared_buffers checkpoint_completion_target=0.5 default_statistics_target=100 effective_io_concurrency=1 max_parallel_workers=8 max_parallel_workers_per_gather=2 max_wal_size=1024 MB max_worker_processes=20 min_wal_size=80 MB random_page_cost=4 * shared_buffers=262144 8kB* wal_buffers=2048 8kB work_mem=4096 kB [3] Same settings as [2] with huge_pages=on and the following changes: $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] $ cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages:0 kB HugePages_Total:5000 HugePages_Free: 3940 HugePages_Rsvd:1 HugePages_Surp:0 Hugepagesize: 2048 kB -- Saurabh.
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
I've disabled transpare huge-pages and enabled huge_pages as given below. Let's see what happens. (I feel like a monkey pressing random buttons trying to turn a light bulb on... and I'm sure the monkey would've had it easier!) AnonHugePages: 0 kB ShmemHugePages:0 kB HugePages_Total:5000 HugePages_Free: 4954 HugePages_Rsvd: 1015 HugePages_Surp:0 Hugepagesize: 2048 kB -- Saurabh.
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
> > You should probably include the detailed hardware you are working on - > especially for the SSD, the model can have a big impact, as well as its > wear. > What's the best tool to get meaningful information for SSD drives? -- Saurabh.
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
Le 28/01/2019 à 15:03, Saurabh Nanda a écrit : An update. It seems (to my untrained eye) that something is wrong with the second SSD in the RAID configuration. Here's my question on serverfault related to what I saw with iostat - https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1 I've disabled RAID and rebooted the server to run the benchmarks with client=1,4,8,12 with shared_buffers=8MB (default) vs shared_buffers=2GB (optimised?) and will report back. You should probably include the detailed hardware you are working on - especially for the SSD, the model can have a big impact, as well as its wear. Nicolas
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
All this benchmarking has led me to a philosophical question, why does PG need shared_buffers in the first place? What's wrong with letting the OS do the caching/buffering? Isn't it optimised for this kind of stuff?
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
> > > You could also try pg_test_fsync to get low-level information, to >> supplement the high level you get from pgbench. > > > Thanks for pointing me to this tool. never knew pg_test_fsync existed! > I've run `pg_test_fsync -s 60` two times and this is the output - > https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm > not sure what to make of it? > I don't know what to make of that either. I'd expect fdatasync using two 8kB writes to be about the same throughput as using one 8kB write, but instead it is 4 times slower. Also, I'd expect open_datasync to get slower by a factor of 2, not a factor of 8, when going from one to two 8kB writes (that is not directly relevant, as you aren't using open_datasync, but is curious nonetheless). Is this reproducible with different run lengths? I wonder if your write cache (or something) gets "tired" during the first part of pg_test_fsync and thus degrades the subsequent parts of the test. I would say something in your IO stack is not optimal, maybe some component is "consumer grade" rather than "server grade". Maybe you can ask Hetzner about that. > The effects of max_wal_size are going to depend on how you have IO >> configured, for example does pg_wal shared the same devices and controllers >> as the base data? It is mostly about controlling disk usage and >> crash-recovery performance, neither of which is of primary importance to >> pgbench performance. > > > The WAL and the data-directory reside on the same SSD disk -- is this a > bad idea? > If you are trying to squeeze out every last bit of performance, then I think it is bad idea. Or at least, something to try the alternative and see. The flushing that occurs during checkpoints and the flushing that occurs for every commit can interfere with each other. > I was under the impression that smaller values for max_wal_size cause > pg-server to do "maintenance work" related to wal rotation, etc. more > frequently and would lead to lower pgbench performance. > If you choose ridiculously small values it would. But once the value is sufficient, increasing it further wouldn't do much. Given your low level of throughput, I would think the default is already sufficient. Thanks for including the storage info. Nothing about it stands out to me as either good or bad, but I'm not a hardware maven; hopefully one will be reading along and speak up. > PS: Cc-ing the list back again because I assume you didn't intend for your > reply to be private, right? > Yes, I had intended to include the list but hit the wrong button, sorry. Cheers, Jeff >
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
> Do you know which of the settings is causing lower TPS ? > I suggest to check shared_buffers. > I'm trying to find this, but it's taking a lot of time in re-running the benchmarks changing one config setting at a time. Thanks for the tip related to shared_buffers. > > If you haven't done it, disabling THP and KSM can resolve performance > issues, > esp. with large RAM like shared_buffers, at least with older kernels. > > https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com Is this a well-known performance "hack"? Is there any reason why it is not mentioned at https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ? Are the stability implications of fiddling with THP and KSM well-known? Also, wrt KSM, my understand was that when a process forks the process' memory is anyways "copy on write", right? What other kind of pages would end-up being de-duplicated by ksmd? (Caveat: This is the first time I'm hearing about KSM and my knowledge is based off a single reading of https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.html ) -- Saurabh.
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
On Sun, Jan 27, 2019 at 01:09:16PM +0530, Saurabh Nanda wrote: > It seems that PGOPTIONS="-c synchronous_commit=off" has a significant > impact. However, I still can not understand why the TPS for the optimised > case is LOWER than the default for higher concurrency levels! Do you know which of the settings is causing lower TPS ? I suggest to check shared_buffers. If you haven't done it, disabling THP and KSM can resolve performance issues, esp. with large RAM like shared_buffers, at least with older kernels. https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com Justin
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
> > > PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 > > > I am currently running all my benchmarks with synchronous_commit=off and > will get back with my findings. > It seems that PGOPTIONS="-c synchronous_commit=off" has a significant impact. However, I still can not understand why the TPS for the optimised case is LOWER than the default for higher concurrency levels! ++-++ | client | Mostly defaults [1] | Optimised settings [2] | ++-++ | 1 | 80-86 | 169-180| ++-++ | 6 | 350-376 | 1265-1397 | ++-++ | 12 | 603-619 | 1746-2352 | ++-++ | 24 | 947-1015| 1869-2518 | ++-++ | 48 | 1435-1512 | 1912-2818 | ++-++ | 96 | 1769-1811 | 1546-1753 | ++-++ | 192| 1857-1992 | 1332-1508 | ++-++ | 384| 1667-1793 | 1356-1450 | ++-++ [1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below: max_connection=400 work_mem=4MB maintenance_work_mem=64MB shared_buffers=128MB temp_buffers=8MB effective_cache_size=4GB wal_buffers=-1 wal_sync_method=fsync max_wal_size=1GB *autovacuum=off# Auto-vacuuming was disabled* [2] An optimised version of settings was obtained from https://pgtune.leopard.in.ua/#/ and along with that the benchmarks were run with *PGOPTIONS="-c synchronous_commit=off"* max_connections = 400 shared_buffers = 8GB effective_cache_size = 24GB maintenance_work_mem = 2GB checkpoint_completion_target = 0.7 wal_buffers = 16MB default_statistics_target = 100 random_page_cost = 1.1 effective_io_concurrency = 200 work_mem = 3495kB min_wal_size = 1GB max_wal_size = 2GB max_worker_processes = 12 max_parallel_workers_per_gather = 6 max_parallel_workers = 12 *autovacuum=off# Auto-vacuuming was disabled*
Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
Is there any material on how to benchmark Postgres meaningfully? I'm getting very frustrated with the numbers that `pgbench` is reporting: -- allocating more resources to Postgres seems to be randomly dropping performance -- there seems to be no repeatability in the benchmarking numbers [1] -- there is no to figure out what is causing a bottleneck and which knob/setting is going to alleviate it. How do the PG wizards figure all this out? [1] https://dba.stackexchange.com/questions/227790/pgbench-20-30-variation-in-benchmark-results-non-repeatable-benchmarks -- Saurabh. On Thu, Jan 24, 2019 at 12:46 AM Saurabh Nanda wrote: > Hi, > > Please pardon me if this question is already answered in the > documentation, Wiki, or the mailing list archive. The problem is, that I > don't know the exact term to search for - I've tried searching for "linear > scalability" and "concurrency vs performance" but didn't find what I was > looking for. > > ## MAIN QUESTION > > pgbench -c 1 achieves approx 80 TPS > pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS > pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS > > If pgbench is being run on a 4c/8t machine and pg-server is being run on a > 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 > Gbit/s connection, I don't think either pgbench or pg-server is being > constrained by hardware, right? > > *In that case why is it not possible to achieve linear scalability, at > least till 12 concurrent connections (i.e. the thread-count of pg-server)?* > What is an easy way to identify the limiting factor? Is it network > connectivity? Disk IOPS? CPU load? Some config parameter? > > ## SECONDARY QUESTION > > *At what level of concurrent connections should settings like > shared_buffers, effective_cache_size, max_wal_size start making a > difference?* With my hardware [1], I'm seeing a difference only after 48 > concurrent connections. And that too it's just a 15-30% improvement over > the default settings that ship with the Ubuntu 18.04 package. Is this > expected? Isn't this allocating too many resources for too little gain? > > ## CONTEXT > > I am currently trying to benchmark PG 11 (via pgbench) to figure out the > configuration parameters that deliver optimum performance for my hardware > [1] and workload [2] > > Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > I've made the following relevant changes to the default PG config on Ubuntu > 18.04: > > max_connection=400 > work_mem=4MB > maintenance_work_mem=64MB > shared_buffers=12288MB > temp_buffers=8MB > effective_cache_size=16GB > wal_buffers=-1 > wal_sync_method=fsync > max_wal_size=5GB > autovacuum=off # NOTE: Only for benchmarking > > [1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1 > [2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B > > Thanks, > Saurabh. > -- http://www.saurabhnanda.com
Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?
Hi, Please pardon me if this question is already answered in the documentation, Wiki, or the mailing list archive. The problem is, that I don't know the exact term to search for - I've tried searching for "linear scalability" and "concurrency vs performance" but didn't find what I was looking for. ## MAIN QUESTION pgbench -c 1 achieves approx 80 TPS pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS If pgbench is being run on a 4c/8t machine and pg-server is being run on a 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 Gbit/s connection, I don't think either pgbench or pg-server is being constrained by hardware, right? *In that case why is it not possible to achieve linear scalability, at least till 12 concurrent connections (i.e. the thread-count of pg-server)?* What is an easy way to identify the limiting factor? Is it network connectivity? Disk IOPS? CPU load? Some config parameter? ## SECONDARY QUESTION *At what level of concurrent connections should settings like shared_buffers, effective_cache_size, max_wal_size start making a difference?* With my hardware [1], I'm seeing a difference only after 48 concurrent connections. And that too it's just a 15-30% improvement over the default settings that ship with the Ubuntu 18.04 package. Is this expected? Isn't this allocating too many resources for too little gain? ## CONTEXT I am currently trying to benchmark PG 11 (via pgbench) to figure out the configuration parameters that deliver optimum performance for my hardware [1] and workload [2] Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I've made the following relevant changes to the default PG config on Ubuntu 18.04: max_connection=400 work_mem=4MB maintenance_work_mem=64MB shared_buffers=12288MB temp_buffers=8MB effective_cache_size=16GB wal_buffers=-1 wal_sync_method=fsync max_wal_size=5GB autovacuum=off # NOTE: Only for benchmarking [1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1 [2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B Thanks, Saurabh.