Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?
On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote: On 16 Listopad 2011, 2:21, Cody Caughlan wrote: How did you build your RAID array? Maybe I have a fundamental flaw / misconfiguration. I am doing it via: $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde $ pvcreate /dev/md0 $ vgcreate lvm-raid10 /dev/md0 $ lvcreate -l 215021 lvm-raid10 -n lvm0 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 $ mkfs.xfs -f /dev/lvm-raid10/lvm0 $ mkdir -p /data mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data I'm not using EC2 much, and those were my first attempts with ephemeral storage, so this may be a stupid question, but why are you building a RAID-10 array on an ephemeral storage, anyway? You already have a standby, so if the primary instance fails you can easily failover. Yes, the slave will become master if master goes down. We have no plan to try and resurrect the master in the case of failure, hence the choice of ephemeral vs EBS. We chose RAID10 over RAID0 to get the best combination of performance and minimizing probability of a single drive failure bringing down the house. So, yes, RAID0 would ultimately deliver the best performance, with more risk. What are you going to do in case of a drive failure? With a server this is rather easy - just put there a new drive and you're done, but can you do that on EC2? I guess you can't do that when the instance is running, so you'll have to switch to the standby anyway, right? Have you ever tried this (how it affects the performance etc.)? As far as I know one cannot alter the ephemeral drives in a running instance, so yes, the whole instance would have to be written off. So what additional protection does that give you? Wouldn't a RAID-0 be a better utilization of the resources? Too much risk. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?
On Tue, Nov 15, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 14.11.2011 22:58, Cody Caughlan napsal(a): I ran bonnie++ on a slave node, doing active streaming replication but otherwise idle: http://batch-files-test.s3.amazonaws.com/sql03.prod.html bonnie++ on the master node: http://batch-files-test.s3.amazonaws.com/sql01.prod.html If I am reading this right, this is my first time using it, the numbers dont look too good. I've done some benchmarks on my own (m1.xlarge instance), and the results are these (http://pastebin.com/T1LXHru0): single drive dd writes: 62 MB/s dd reads: 110 MB/s bonnie seq. writes: 55 MB/s bonnie seq. rewrite: 33 MB/s bonnie seq. reads: 91 MB/s bonnie seeks: 370/s raid 0 (4 devices) - dd writes: 220 MB/s dd reads: 380 MB/s bonnie seq. writes: 130 MB/s bonnie seq. rewrite: 114 MB/s bonnie seq. reads: 280 MB/s bonnie seeks: 570/s raid 10 (4 devices) - dd writes: 90 MB/s dd reads: 200 MB/s bonnie seq. writes: 49 MB/s bonnie seq. rewrite: 56 MB/s bonnie seq. reads: 160 MB/s bonnie seeks: 590/s Interesting. I spun up a new m1.xlarge and did the same RAID10 config (4 drives) except with a chunk size of 512K (instead of 256K) and the machine was completely idle. Bonnie: http://batch-files-test.s3.amazonaws.com/idle-512k-chunk.html Which has similar-ish performance as yours, except for worse seeks but a bit better seq. reads. The other bonnies I sent over were NOT on idle systems. This one is the master, which receives a heavy stream of writes and some reads http://batch-files-test.s3.amazonaws.com/sql01.prod.html And this is the slave, which is all writes and no reads: http://batch-files-test.s3.amazonaws.com/sql03.prod.html How did you build your RAID array? Maybe I have a fundamental flaw / misconfiguration. I am doing it via: $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde $ pvcreate /dev/md0 $ vgcreate lvm-raid10 /dev/md0 $ lvcreate -l 215021 lvm-raid10 -n lvm0 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0 $ mkfs.xfs -f /dev/lvm-raid10/lvm0 $ mkdir -p /data mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Slow queries / commits, mis-configuration or hardware issues?
Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a dedicated master with 2 streaming replication nodes. The machine has 16GB of RAM and 4 cores. We're starting to see some slow queries, especially COMMITs that are happening more frequently. The slow queries are against seemingly well-indexed tables. I have log_min_duration = 150ms Slow commits like: 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms statement: COMMIT 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms statement: COMMIT These slow COMMITs are against tables that received a large number of UPDATEs and are growing fairly rapidly. And slow queries like: 2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG: duration: 1694.456 ms statement: SELECT facebook_wall_posts.* FROM facebook_wall_posts WHERE facebook_wall_posts.token = '984c44e75975b224b38197cf8f8fc76a' LIMIT 1 query plan: http://explain.depesz.com/s/wbm The # of rows in facebook_wall_posts is 5841 Structure of facebook_wall_posts: Table public.facebook_wall_posts Column |Type | Modifiers +-+-- id | integer | not null default nextval('facebook_wall_posts_id_seq'::regclass) album_id | integer | not null token | character varying(32) | not null fb_recipient_id| character varying(64) | not null post_id| character varying(100) | not null fb_post_created_at | timestamp without time zone | data | text| created_at | timestamp without time zone | updated_at | timestamp without time zone | fb_post_deleted_at | timestamp without time zone | Indexes: facebook_wall_posts_pkey PRIMARY KEY, btree (id) index_facebook_wall_posts_on_token UNIQUE, btree (token) index_facebook_wall_posts_on_album_id btree (album_id) And another slow query: 2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG: duration: 979.740 ms statement: SELECT facebook_friends.friend_id FROM facebook_friends WHERE facebook_friends.user_id = 9134671 Query plan: http://explain.depesz.com/s/x1q # of rows in facebook_friends is 27075088 Structure of facebook_friends: Table public.facebook_friends Column |Type | Modifiers +-+--- id | integer | not null default nextval('facebook_friends_id_seq'::regclass) user_id| integer | not null friend_id | integer | not null created_at | timestamp without time zone | Indexes: facebook_friends_pkey PRIMARY KEY, btree (id) index_facebook_friends_on_user_id_and_friend_id UNIQUE, btree (user_id, friend_id) We have auto-vacuum enabled and running. But yesterday I manually ran vacuum on the database. Autovacuum settings: autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold| 50 autovacuum_freeze_max_age | 2 autovacuum_max_workers | 3 autovacuum_naptime | 60 autovacuum_vacuum_cost_delay| 20 autovacuum_vacuum_cost_limit| -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 50 other postgresql.conf settings: shared_buffers = 3584MB wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 wal_keep_segments = 1024 maintenance_work_mem = 256MB work_mem = 88MB shared_buffers = 3584MB effective_cache_size = 10GB The PGDATA dir is a RAID10 on 4 local (ephemeral in EC2 speak) drives. I ran some dd tests and received the following output: --- WRITING --- root@sql03:/data# time sh -c dd if=/dev/zero of=/data/tmp/bigfile bs=8k count=400 sync 400+0 records in 400+0 records out 3276800 bytes (33 GB) copied, 670.663 s, 48.9 MB/s real11m52.199s user0m2.720s sys 0m45.330s --- READING --- root@sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k 400+0 records in 400+0 records out 3276800 bytes (33 GB) copied, 155.429 s, 211 MB/s real2m35.434s user0m2.400s sys 0m33.160s I have enabled log_checkpoints and here is a recent sample from the log: 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync files=60, longest=1.484 s, average=0.049 s 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 recycled;
Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?
Thanks for your response. Please see below for answers to your questions. On Mon, Nov 14, 2011 at 11:22 AM, Tomas Vondra t...@fuzzy.cz wrote: On 14 Listopad 2011, 19:16, Cody Caughlan wrote: shared_buffers = 3584MB wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 wal_keep_segments = 1024 maintenance_work_mem = 256MB work_mem = 88MB shared_buffers = 3584MB effective_cache_size = 10GB Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is usually too low). Ok, will do. The PGDATA dir is a RAID10 on 4 local (ephemeral in EC2 speak) drives. I ran some dd tests and received the following output: --- WRITING --- root@sql03:/data# time sh -c dd if=/dev/zero of=/data/tmp/bigfile bs=8k count=400 sync 400+0 records in 400+0 records out 3276800 bytes (33 GB) copied, 670.663 s, 48.9 MB/s real 11m52.199s user 0m2.720s sys 0m45.330s This measures sequential write performance (and the same holds for the read test). We need to know the random I/O performance too - use bonnie++ or similar tool. Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for sequential read/writes, and about twice that performance for a 4-drive RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes (50MB/s) are rather slow. Have you measured this on an idle system, or when the db was running? I ran bonnie++ on a slave node, doing active streaming replication but otherwise idle: http://batch-files-test.s3.amazonaws.com/sql03.prod.html bonnie++ on the master node: http://batch-files-test.s3.amazonaws.com/sql01.prod.html If I am reading this right, this is my first time using it, the numbers dont look too good. See for example this: [1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes [2] http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html I have enabled log_checkpoints and here is a recent sample from the log: 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync files=82, longest=2.943 s, average=0.050 s Nothing special here - this just says that the checkpoints were timed and finished on time (the default checkpoint timeout is 5 minutes, with completion target 0.9 the expected checkpoint time is about 270s). Not a checkpoint issue, probably. I've been collecting random samples from pg_stat_bgwriter: https://gist.github.com/4faec2ca9a79ede281e1 Although it's a bit difficult to interpret this (collect the data in regular intervals - e.g. every hour - and post the differences, please), but it seems reasonable. Ok, I have a cron running every hour to grab this data. I will post back in a few hours or tomorrow. So given all this information (if you need more just let me know), is there something fundamentally wrong or mis-configured? Do I have an I/O issue? Probably - the discrepancy between read/write performance is a bit suspicious. Try to watch the I/O performance when this happens, i.e. run iostat -x and watch the output (especially %util, r_await, w_await) and post several lines of the output. Heres a gist of running iostat -x 3 for about a few minutes: https://gist.github.com/f94d98f2ef498a522ac2 Indeed, the %iowat and await values can spike up drastically. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?
On Mon, Nov 14, 2011 at 2:57 PM, Tomas Vondra t...@fuzzy.cz wrote: On 14 Listopad 2011, 22:58, Cody Caughlan wrote: Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is usually too low). Ok, will do. Yes, but find out what that means and think about the possible impact first. It usually improves the checkpoint behaviour but increases the recovery time and you may need more checkpoint segments. And I doubt this will fix the issue you've described. Ok, I understand the ramifications with increasing the checkpoint timeout. But I will investigate more before I jump in. I ran bonnie++ on a slave node, doing active streaming replication but otherwise idle: http://batch-files-test.s3.amazonaws.com/sql03.prod.html bonnie++ on the master node: http://batch-files-test.s3.amazonaws.com/sql01.prod.html If I am reading this right, this is my first time using it, the numbers dont look too good. Are those instances equal, i.e. use tha same RAID10 config etc.? It surprises me a bit that the slave performs much better than the master, for example the sequential reads are much faster (210MB/s vs. 60MB/s) and it handles about twice the number of seeks (345 vs. 170). But this may be skewed because of the workload. Yes, these two machines are the same. sql01 is the master node and is quite busy. Running bonnie++ on it during its normal workload spiked I/O for the duration. sql03 is a pure slave and is quite idle, save for receiving WAL segments. Heres a gist of running iostat -x 3 for about a few minutes: https://gist.github.com/f94d98f2ef498a522ac2 Indeed, the %iowat and await values can spike up drastically. OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device, right? According to the log_checkpoint info, you're writing about 15000 (120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat you're writing up to 4MB/s, so it's not just about the checkpoints. What else is going on there? How much WAL do you write? Yes, dm-0 is the RAID10 device. The WAL config is: wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 checkpoint_timeout = 300 wal_keep_segments = 1024 Do you have iotop installed? That might give you a hint what processes are writing data etc. I do have iotop and have been watching it. The only I/O users are postgres and its backends. I dont see anything else consuming any I/O. By eyeballing iotop, big consumers of disk writes are: idle in transaction, SELECT, COMMIT The first two are what I would think would be largely read operations (certainly the SELECT) so its not clear why a SELECT consumes write time. Here is the output of some pg_stat_bgwriter stats from the last couple of hours: https://gist.github.com/41ee26caca01471a9b77 One thing that I might not have made very clear earlier is that this DB, especially a single table receives a very large number of UPDATEs. However, it seems to be well cached, I have shared_buffers = 3584MB and a view of pg_buffercache shows: https://gist.github.com/53c520571290cae14613 Is it possible that we're just trying to handle too many UPDATEs and they are all trying to hit disk all at once - causing this I/O contention? Here is a view of pg_stat_user_tables that shows the amount of live/dead tuples: https://gist.github.com/5ac1ae7d11facd72913f -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Large number of short lived connections - could a connection pool help?
We have anywhere from 60-80 background worker processes connecting to Postgres, performing a short task and then disconnecting. The lifetime of these tasks averages 1-3 seconds. I know that there is some connection overhead to Postgres, but I dont know what would be the best way to measure this overheard and/or to determine if its currently an issue at all. If there is a substantial overheard I would think that employing a connection pool like pgbouncer to keep a static list of these connections and then dole them out to the transient workers on demand. So the overall cumulative number of connections wouldnt change, I would just attempt to alleviate the setup/teardown of them so quickly. Is this something that I should look into or is it not much of an issue? Whats the best way to determine if I could benefit from using a connection pool? Thanks. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Large number of short lived connections - could a connection pool help?
On Mon, Nov 14, 2011 at 4:59 PM, Ben Chobot be...@silentmedia.com wrote: On Nov 14, 2011, at 4:42 PM, Cody Caughlan wrote: We have anywhere from 60-80 background worker processes connecting to Postgres, performing a short task and then disconnecting. The lifetime of these tasks averages 1-3 seconds. [snip] Is this something that I should look into or is it not much of an issue? Whats the best way to determine if I could benefit from using a connection pool? Yes, this is precisely a kind of situation a connection pooler will help with. Not only with the the connection set up/tear down overhead, but also by using resources on your server better you probably don't actually have 60-80 cores on your server, so reducing that number down to just a few that are actually working will the Postgres finish them faster to work on others. Basically, the queueing happens off the postgres server, letting postgres use the box with less interruptions. Now, is it a problem to not use a pooler? That depends on if it's causing you grief or not. But if you think you'll get more connection churn or larger numbers of workers, then a connection pooler will only help more. Thanks for your input. Its not causing me grief per se. The load on the pg machine is small. I guess I am just wondering if I am being stupid and leaving resources and/or performance on the table. But it sounds that as a whole it would be a good use case for pgbouncer and in the long run will prove beneficial. But no, its not obviously killing me right now. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance