[PERFORM] Varying performacne results after instance restart
Hello everybody, I'm experiencing a performance related issue during some validation measurements. Let's first of all clarify what kind of problem I am facing. I've set up a plain tpc-h database without any additional indexes or something like that. To get some performance impressions I wrote a little script that uses 1) pg_ctl to start postgres, 2) executes the same query several times and finally 3) stops the postgres instance using pg_ctl again. These 3 steps are executed several times resulting in varying performance results of which I think that they should be smaller (taking runtime differences I don't have any influence on into account) . Run1: real0m15.005s user0m0.000s sys 0m0.000s Run2: real0m14.012s user0m0.000s sys 0m0.000s Please be aware that I provided numbers taken from the middle of each run (to exclude any io-related effects), so the buffers are warm, I disabled intels' turbo boost feature and that I have exclusive access to the machine. I used explain (analyze, buffers) to make sure no data has to be loaded from disks during runtime (top and vmstat did not show any movements as well). I flushed the os buffers (no effect regarding performance difference), I disabled autovacuum, I adapted my buffer sizes to hold the whole data required, but I was not able to sort the performance difference out yet. The query I'm currently using is quite easy: select min(l_extendedprice * l_discount * l_tax) from lineitem; Note that I'm aware that my shared_buffer settings has to be bigger than 1/4 of the table to get past the ring buffer. Would be great if anyone could give me some advise where to look at or provide me with some information about postgres performance behaviour. Thank you very much -- View this message in context: http://postgresql.1045698.n5.nabble.com/Varying-performacne-results-after-instance-restart-tp5801717.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Checkpoints and slow queries
Elanchezhiyan Elango elanela...@gmail.com writes: The problem is that while this makes the checkpoints less frequent, it accumulates more changes that need to be written to disk during the checkpoint. Which means the impact more severe. True. But the checkpoints finish in approximately 5-10 minutes every time (even with checkpoint_completion_target of 0.9). There's something wrong with that. I wonder whether you need to kick checkpoint_segments up some more to keep the checkpoint from being run too fast. Even so, though, a checkpoint spread over 5-10 minutes ought to provide the kernel with enough breathing room to flush things. It sounds like the kernel is just sitting on the dirty buffers until it gets hit with fsyncs, and then it's dumping them as fast as it can. So you need some more work on tuning the kernel parameters. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Revisiting disk layout on ZFS systems
I've been doing a bit of benchmarking and real-world performance testing, and have found some curious results. The load in question is a fairly-busy machine hosting a web service that uses Postgresql as its back end. Conventional Wisdom is that you want to run an 8k record size to match Postgresql's inherent write size for the database. However, operational experience says this may no longer be the case now that modern ZFS systems support LZ4 compression, because modern CPUs can compress fast enough that they overrun raw I/O capacity. This in turn means that the recordsize is no longer the record size, basically, and Postgresql's on-disk file format is rather compressible -- indeed, in actual performance on my dataset it appears to be roughly 1.24x which is nothing to sneeze at. The odd thing is that I am getting better performance with a 128k record size on this application than I get with an 8k one! Not only is the system faster to respond subjectively and can it sustain a higher TPS load objectively but the I/O busy percentage as measured during operation is MARKEDLY lower (by nearly an order of magnitude!) This is not expected behavior! What I am curious about, however, is the xlog -- that appears to suffer pretty badly from 128k record size, although it compresses even more-materially; 1.94x (!) The files in the xlog directory are large (16MB each) and thus first blush would be that having a larger record size for that storage area would help. It appears that instead it hurts. Ideas? -- -- Karl k...@denninger.net smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] Slow queries on 9.3.1 despite use of index
Michael van Rooyen mich...@loot.co.za writes: I'm trying to get to the bottom of a performance issue on a server running PostgreSQL 9.3.1 on Centos 5. Hm ... it seems pretty suspicious that all of these examples take just about exactly 1 second longer than you might expect. I'm wondering if there is something sitting on an exclusive table lock somewhere, and releasing it after 1 second. In particular, this looks quite a bit like the old behavior of autovacuum when it was trying to truncate empty pages off the end of a relation --- it would hold off other accesses to the table until deadlock_timeout elapsed, whereupon it'd get kicked off the exclusive lock (and have to retry the truncation next time). Are you *sure* this server is running 9.3.1, and not something pre-9.3? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow queries on 9.3.1 despite use of index
On Mon, Apr 28, 2014 at 10:12 AM, Michael van Rooyen mich...@loot.co.zawrote: I'm trying to get to the bottom of a performance issue on a server running PostgreSQL 9.3.1 on Centos 5. The machine is a dual quad-core Xeon E5620 with 24GB ECC RAM and four enterprise SATA Seagate Constellation ES drives configured as 2 software RAID1 volumes. The main DB is on one volume and some, more used indexes and WAL logs are on the other. Overall, the load doesn't appear to be heavy, but we're still getting slow queries, for example, here's an extract from the log: 2014-04-28 16:51:02.904 GMT 25998: LOG: checkpoint starting: time 2014-04-28 16:53:37.204 GMT 30053: LOG: duration: 1067.464 ms execute unnamed: select id from ProductSupplier where product='25553848082928' 2014-04-28 16:54:12.701 GMT 30053: LOG: duration: 1105.844 ms execute unnamed: select id from ProductSupplier where product='1626407082928' 2014-04-28 16:54:46.789 GMT 30053: LOG: duration: 1060.585 ms execute unnamed: select ... 2014-04-28 16:55:58.058 GMT 30053: LOG: duration: 1309.192 ms execute unnamed: select id from ProductCategory where product='1932872082928' 2014-04-28 16:56:06.019 GMT 25998: LOG: checkpoint complete: wrote 6647 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=298.862 s, sync=4.072 s, total=303.115 s; sync files=155, longest=0.234 s, average=0.026 s It looks like something is causing your IO to seize up briefly. It is common for the sync phase of the checkpoint to do that, but that would only explain 3 of the 4 reports above. Is this causing an actual problem for your users, or are you just trying to be proactive? You could change the kernel setting dirty_background_bytes to try to reduce this problem. Overall the load on the server seems quite low, for example, typical vmstat -1 is: procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 0 1304 77740 11960 1770915600998222 2 1 89 8 0 1 0304 75228 11960 1771116400 1256 635 1418 6498 0 0 94 6 0 0 1304 73440 11968 1771203600 1232 149 1253 6232 1 0 94 6 0 0 2304 78472 11968 1770601600 176089 1325 6361 1 0 94 5 0 0 1304 75616 11968 1770848000 216472 1371 6855 1 0 94 5 0 0 1304 73292 11968 1771032000 1760 112 1335 6673 1 0 94 5 0 0 2304 77956 11964 1770352800 1204 5614 1867 6712 0 0 94 6 0 It that typical for when the problem is not occurring, or typical for when it is occurring. Without having timestamps to correlate the vmstat back to log file, it is very hard to make use of this info. Some versions of vmstat have a -t flag. I've tried to optimise postgresql.conf for performance: max_connections = 1000 # (change requires restart) 1000 is extremely high. How many connections do you actually use at any one time? shared_buffers = 2GB# min 128kB or max_connections*16kB work_mem = 100MB# min 64kB 100MB is also very high, at least on conjunction with the high max_connections. maintenance_work_mem = 100MB# min 1MB synchronous_commit = off# immediate fsync at commit wal_buffers = 16MB # min 32kB checkpoint_segments = 64# in logfile segments, min 1, 16MB each checkpoint_timeout = 10min # range 30s-1h effective_cache_size = 16GB logging_collector = on # Enable capturing of stderr and csvlog log_directory = 'pg_log'# directory where log files are written, log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern, log_rotation_age = 1d # Automatic rotation of logfiles will log_min_duration_statement = 1000 # -1 is disabled, 0 logs all statements I would lower this. You can see that few statements were just over 1000 ms, but can't tell if there are lot that are at 800 ms, or if you have bimodal distribution with most being 1ms and a few being 1200ms. Cheers, Jeff
Re: [PERFORM] Revisiting disk layout on ZFS systems
On 4/28/2014 1:04 PM, Heikki Linnakangas wrote: On 04/28/2014 06:47 PM, Karl Denninger wrote: What I am curious about, however, is the xlog -- that appears to suffer pretty badly from 128k record size, although it compresses even more-materially; 1.94x (!) The files in the xlog directory are large (16MB each) and thus first blush would be that having a larger record size for that storage area would help. It appears that instead it hurts. The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that. - Heikki It shouldn't as ZFS re-writes on change, and what's showing up is not high I/O *count* but rather percentage-busy, which implies lots of head movement (that is, lots of sub-allocation unit writes.) Isn't WAL essentially sequential writes during normal operation? -- -- Karl k...@denninger.net smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] Revisiting disk layout on ZFS systems
On 04/28/2014 06:47 PM, Karl Denninger wrote: What I am curious about, however, is the xlog -- that appears to suffer pretty badly from 128k record size, although it compresses even more-materially; 1.94x (!) The files in the xlog directory are large (16MB each) and thus first blush would be that having a larger record size for that storage area would help. It appears that instead it hurts. The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Revisiting disk layout on ZFS systems
On 04/28/2014 09:07 PM, Karl Denninger wrote: The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that. It shouldn't as ZFS re-writes on change, and what's showing up is not high I/O*count* but rather percentage-busy, which implies lots of head movement (that is, lots of sub-allocation unit writes.) That sounds consistent frequent fsyncs. Isn't WAL essentially sequential writes during normal operation? Yes, it's totally sequential. But it's fsync'd at every commit, which means a lot of small writes. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Revisiting disk layout on ZFS systems
On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger k...@denninger.net wrote: On 4/28/2014 1:04 PM, Heikki Linnakangas wrote: On 04/28/2014 06:47 PM, Karl Denninger wrote: What I am curious about, however, is the xlog -- that appears to suffer pretty badly from 128k record size, although it compresses even more-materially; 1.94x (!) The files in the xlog directory are large (16MB each) and thus first blush would be that having a larger record size for that storage area would help. It appears that instead it hurts. The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that. - Heikki It shouldn't as ZFS re-writes on change, and what's showing up is not high I/O *count* but rather percentage-busy, which implies lots of head movement (that is, lots of sub-allocation unit writes.) Isn't WAL essentially sequential writes during normal operation? Only if you have some sort of non-volatile intermediary, or are willing to risk your data integrity. Otherwise, the fsync nature trumps the sequential nature. Cheers, Jeff
Re: [PERFORM] Revisiting disk layout on ZFS systems
On 4/28/2014 1:22 PM, Heikki Linnakangas wrote: On 04/28/2014 09:07 PM, Karl Denninger wrote: The WAL is fsync'd frequently. My guess is that that causes a lot of extra work to repeatedly recompress the same data, or something like that. It shouldn't as ZFS re-writes on change, and what's showing up is not high I/O*count* but rather percentage-busy, which implies lots of head movement (that is, lots of sub-allocation unit writes.) That sounds consistent frequent fsyncs. Isn't WAL essentially sequential writes during normal operation? Yes, it's totally sequential. But it's fsync'd at every commit, which means a lot of small writes. - Heikki Makes sense; I'll muse on whether there's a way to optimize this further... I'm not running into performance problems at present but I'd rather be ahead of it -- -- Karl k...@denninger.net smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] Revisiting disk layout on ZFS systems
On 4/28/2014 1:26 PM, Jeff Janes wrote: On Mon, Apr 28, 2014 at 11:07 AM, Karl Denninger k...@denninger.net mailto:k...@denninger.net wrote: Isn't WAL essentially sequential writes during normal operation? Only if you have some sort of non-volatile intermediary, or are willing to risk your data integrity. Otherwise, the fsync nature trumps the sequential nature. That would be a no on the data integrity :-) -- -- Karl k...@denninger.net smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] Checkpoints and slow queries
On 28.4.2014 16:07, Tom Lane wrote: Elanchezhiyan Elango elanela...@gmail.com writes: The problem is that while this makes the checkpoints less frequent, it accumulates more changes that need to be written to disk during the checkpoint. Which means the impact more severe. True. But the checkpoints finish in approximately 5-10 minutes every time (even with checkpoint_completion_target of 0.9). There's something wrong with that. I wonder whether you need to kick checkpoint_segments up some more to keep the checkpoint from being run too fast. Even so, though, a checkpoint spread over 5-10 minutes ought to provide the kernel with enough breathing room to flush things. It sounds like the kernel is just sitting on the dirty buffers until it gets hit with fsyncs, and then it's dumping them as fast as it can. So you need some more work on tuning the kernel parameters. There's certainly something fishy, because although this is the supposed configuration: checkpoint_segments = 250 checkpoint_timeout = 1h checkpoint_completion_target = 0.9 the checkpoint logs typically finish in much shorter periods of time. Like this, for example: regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Checkpoints and slow queries
On Mon, Apr 28, 2014 at 1:41 PM, Tomas Vondra t...@fuzzy.cz wrote: On 28.4.2014 16:07, Tom Lane wrote: Elanchezhiyan Elango elanela...@gmail.com writes: The problem is that while this makes the checkpoints less frequent, it accumulates more changes that need to be written to disk during the checkpoint. Which means the impact more severe. True. But the checkpoints finish in approximately 5-10 minutes every time (even with checkpoint_completion_target of 0.9). There's something wrong with that. I wonder whether you need to kick checkpoint_segments up some more to keep the checkpoint from being run too fast. Even so, though, a checkpoint spread over 5-10 minutes ought to provide the kernel with enough breathing room to flush things. It sounds like the kernel is just sitting on the dirty buffers until it gets hit with fsyncs, and then it's dumping them as fast as it can. So you need some more work on tuning the kernel parameters. There's certainly something fishy, because although this is the supposed configuration: checkpoint_segments = 250 checkpoint_timeout = 1h checkpoint_completion_target = 0.9 the checkpoint logs typically finish in much shorter periods of time. That doesn't look fishy to me. The checkpointer will not take more than one nap between buffers, so it will always write at least 10 buffers per second (of napping time) even if that means it finishes early. Which seems to be the case here--the length of the write cycle seems to be about one tenth the number of buffers written. Even if that were not the case, it also doesn't count buffers written by the backends or the background writer as having been written, so that is another reason for it to finish early. Perhaps the fsync queue should pass on information of whether the written buffers were marked for the checkpointer. There is no reason to think this would improve performance, but it might reduce confusion. Cheers, Jeff
Re: [PERFORM] Checkpoints and slow queries
Sorry, hit send too early by accident. On 28.4.2014 16:07, Tom Lane wrote: Elanchezhiyan Elango elanela...@gmail.com writes: The problem is that while this makes the checkpoints less frequent, it accumulates more changes that need to be written to disk during the checkpoint. Which means the impact more severe. True. But the checkpoints finish in approximately 5-10 minutes every time (even with checkpoint_completion_target of 0.9). There's something wrong with that. I wonder whether you need to kick checkpoint_segments up some more to keep the checkpoint from being run too fast. Too fast? All the checkpoints listed in the log were timed, pretty much exactly in 1h intervals: Apr 26 00:12:57 LOG: checkpoint starting: time Apr 26 01:12:57 LOG: checkpoint starting: time Apr 26 02:12:57 LOG: checkpoint starting: time Apr 26 03:12:57 LOG: checkpoint starting: time Apr 26 04:12:58 LOG: checkpoint starting: time Apr 26 05:12:57 LOG: checkpoint starting: time Apr 26 06:12:57 LOG: checkpoint starting: time There's certainly something fishy, because although this is the supposed configuration: checkpoint_segments = 250 checkpoint_timeout = 1h checkpoint_completion_target = 0.9 the checkpoint logs typically finish in much shorter periods of time. Like this, for example: Apr 26 10:12:57 LOG: checkpoint starting: time Apr 26 10:26:27 LOG: checkpoint complete: wrote 9777 buffers (15.3%); 0 transaction log file(s) added, 0 removed, 153 recycled; write=800.377 s, sync=8.605 s, total=809.834 s; sync files=719, longest=1.034 s, average=0.011 s And that's one of the longer runs - most of the others run in ~5-6 minutes. Now, maybe I'm mistaken but I'd expect the checkpoints to finish in ~54 minutes, which is (0.9*checkpoint_completion_target). Even so, though, a checkpoint spread over 5-10 minutes ought to provide the kernel with enough breathing room to flush things. It sounds like the kernel is just sitting on the dirty buffers until it gets hit with fsyncs, and then it's dumping them as fast as it can. So you need some more work on tuning the kernel parameters. I'm not sure about this - the /proc/meminfo snapshots sent in the previous post show that the amount of Dirty memory is usually well below ~20MB, with max at ~36MB at 22:24:26, and within matter of seconds it drops down to ~10MB of dirty data. Also, the kernel settings seem quite aggressive to me: vm.dirty_background_ratio = 1 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Checkpoints and slow queries
On 28.4.2014 07:50, Elanchezhiyan Elango wrote: So how much data in total are we talking about? OK, so there are multiple tables, and you're updating 50k rows in all tables in total? Every 5 minutes: 50K rows are updated in 4 tables. 2K rows are updated in 39 tables. Every 1 hour (on top of the hour): 50K rows are updated in 8 tables. 2K rows are updated in 78 tables. If every update will take up space equivalent to 1 row, then there are 278K rows updated across all tables every 5 minutes. And 556K (278 * 2) rows updated across all tables every 1 hour. All tables follow the same schema except some tables don't have the 'port' field. And the data[] column on each row could have maximum 48 values. I wasn't really asking about the amount of updates (that's reasonably well seen in the checkpoint logs), but about the size of the database. Can you post \dt+ and \di+ so that we get an idea of table/index sizes? \dt+: http://pastebin.com/Dvg2vSeb \di+: http://pastebin.com/586MGn0U According to the output, it seems you're dealing with ~20GB of data and ~30GB of indexes. Is that about right? Thanks for your input on ext3 filesystem and having WAL on a different disk. I'll see if these can be changed. I cannot change these in the short term. What kernel version is this, actually? Also, have you done some basic performance tests, to see how the disk array behaves? I mean something like dd if=/dev/zero of=/mnt/raid/test.file bs=1M count=16000 dd if=/mnt/raid/test.file of=/dev/null bs=1M count=16000 to test sequential performance, pgbench to test something more random etc. Because trying to solve this from the it's checkpoint issue when in reality it might be something completely different. Also, are you sure there's no other source of significant I/O activity? Try to run iotop to watch what's happening there. regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Checkpoints and slow queries
On 28.4.2014 22:54, Jeff Janes wrote: On Mon, Apr 28, 2014 at 1:41 PM, Tomas Vondra t...@fuzzy.cz There's certainly something fishy, because although this is the supposed configuration: checkpoint_segments = 250 checkpoint_timeout = 1h checkpoint_completion_target = 0.9 the checkpoint logs typically finish in much shorter periods of time. That doesn't look fishy to me. The checkpointer will not take more than one nap between buffers, so it will always write at least 10 buffers per second (of napping time) even if that means it finishes early. Which seems to be the case here--the length of the write cycle seems to be about one tenth the number of buffers written. Oh, makes sense I guess. Apparently I'm tuning this only on systems doing a lot of I/O, so this behaviour somehow escaped me. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow queries on 9.3.1 despite use of index
On 2014/04/28 07:50 PM, Tom Lane wrote: Michael van Rooyen mich...@loot.co.za writes: I'm trying to get to the bottom of a performance issue on a server running PostgreSQL 9.3.1 on Centos 5. Hm ... it seems pretty suspicious that all of these examples take just about exactly 1 second longer than you might expect. I'm wondering if there is something sitting on an exclusive table lock somewhere, and releasing it after 1 second. I do have log_min_duration_statement = 1000, which may cause this. In particular, this looks quite a bit like the old behavior of autovacuum when it was trying to truncate empty pages off the end of a relation --- it would hold off other accesses to the table until deadlock_timeout elapsed, whereupon it'd get kicked off the exclusive lock (and have to retry the truncation next time). Are you *sure* this server is running 9.3.1, and not something pre-9.3? Definitely 9.3.1. The strange thing is I have other servers with similar configurations and load and with the same database, where performance is great, so it's hard for me to know what's different here. Maybe I'm expecting too much from these SATA drives, or it's time to add lots of RAM... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow queries on 9.3.1 despite use of index
Michael van Rooyen mich...@loot.co.za writes: On 2014/04/28 07:50 PM, Tom Lane wrote: Hm ... it seems pretty suspicious that all of these examples take just about exactly 1 second longer than you might expect. I'm wondering if there is something sitting on an exclusive table lock somewhere, and releasing it after 1 second. I do have log_min_duration_statement = 1000, which may cause this. Ah, I overlooked that detail. Never mind that theory then. Although it might still be worth turning on log_lock_waits for awhile, just to eliminate the possibility of lock-induced delays. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance