Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-16 Thread Tomas Vondra
On 16 Listopad 2011, 5:27, Greg Smith wrote:
 On 11/14/2011 01:16 PM, Cody Caughlan wrote:
 We're starting to see some slow queries, especially COMMITs that are
 happening more frequently. The slow queries are against seemingly
 well-indexed tables.
 Slow commits like:

 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
   statement: COMMIT
 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
   statement: COMMIT

 These slow COMMITs are against tables that received a large number of
 UPDATEs and are growing fairly rapidly.

 Linux will fill its write cache with all of the writes coming out of
 each checkpoint.  With a 16GB instance, I would expect that 5% * 16GB ~=
 800MB of writes are batched up when your system is slow.  You should be
 able to confirm that by looking at the Dirty: line in /proc/meminfo

 With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls
 at the end of each checkpoint are sometimes blocking for multiple seconds:

 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
 recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
 files=60, longest=1.484 s, average=0.049 s
 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
 files=82, longest=2.943 s, average=0.050 s


 When an individual sync call gets stuck for that long, clients can
 easily get stuck behind it too.  There are a couple of techniques that
 might help:

The sync times I see there seem quite acceptable - 4.2s is not perfect,
but I wouldn't rate it as terrible. What actually annoys me is the amount
of data written - it's just 17000 pages, i.e. about 130 MB for a
checkpoint (spread over 5 minutes). So it's just like 0.5 MB/s.

 -Switch filesystems if you're running a slow one.  ext3 has bad latency
 behavior here, XFS and ext4 are better.

He's using xfs, IIRC. That's one of the better behaving ones, when it
comes to sync.

 -Lower the dirty_* tunables like dirty_background_ratio or its bytes
 version.  This will reduce average throughput, but can lower latency.
 -Spread checkpoints out more so that less average writes are happening.
 -Decrease shared_buffers so less data is getting pushed out at
 checkpoint time.
 -Reduce your reliability expectations and turn off synchronous_commit.

The question here probably is whether those high latencies are caused or
significantly influenced by the checkpoint, or are a feature of the
storage. Because if it's a feature, then all this is a futile attempt to
fix it.

I don't think he has problems with checkpoints - he's complaining about
regular queries being slow (even plain SELECT, i.e. something that usually
does not require a sync).

No doubt this may be connected, but a regular SELECT usually does not
perform a sync, right? It may need to fetch some data and if the I/O is
saturated by a checkpoint, this may take time. But again - those bonnie
results were collected with on a running system, i.e. with checkpoints in
progress and all of that.

And I'd expect most of the SELECT queries to be handled without actually
touching the devices, but by connecting
https://gist.github.com/5ac1ae7d11facd72913f and
https://gist.github.com/5ac1ae7d11facd72913f it seems that the larges
table (users) is almost completely in shared buffers, while the two other
large tables (external_user and facebook_friends) are cached by about 30%.
And I'd expect the rest of those tables to be in the page cache, so SELECT
queries on those tables should be fast.

A commit obviously requires a sync on the WAL - I wonder if moving the WAL
would improve the performance here.

This is obviously based on an incomplete set of stats, and maybe I'm
missing something.

 Your server is sometimes showing multi-second latency issues with
 bonnie++ too; that suggests how this problem is not even specific to
 PostgreSQL.  Linux is hard to tune for low latency under all

Don't forget those data were collected on a production system, i.e. it was
actually under load. That probably skews the results a lot.

 circumstances; fighting latency down under a heavy update workload is
 hard to do even with good hardware to accelerate write performance.  In
 an EC2 environment, it may not even be possible to do without making
 trade-offs like disabling synchronous writes.  I can easily get
 transactions hung for 10 to 15 seconds on one of their servers if I try
 to make that problem bad, you're only seeing the middle range of latency
 issues so far.

Are you talking about EBS or ephemeral storage? Because all this is about
ephemeral (something like a virtualized local storage).

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make 

Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-16 Thread Tomas Vondra
On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
 How did you build your RAID array? Maybe I have a fundamental flaw /
 misconfiguration. I am doing it via:

 $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
 $ pvcreate /dev/md0
 $ vgcreate lvm-raid10 /dev/md0
 $ lvcreate -l 215021 lvm-raid10 -n lvm0
 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
 $ mkfs.xfs -f /dev/lvm-raid10/lvm0
 $ mkdir -p /data  mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

I'm not using EC2 much, and those were my first attempts with ephemeral
storage, so this may be a stupid question, but why are you building a
RAID-10 array on an ephemeral storage, anyway?

You already have a standby, so if the primary instance fails you can
easily failover.

What are you going to do in case of a drive failure? With a server this is
rather easy - just put there a new drive and you're done, but can you do
that on EC2? I guess you can't do that when the instance is running, so
you'll have to switch to the standby anyway, right? Have you ever tried
this (how it affects the performance etc.)?

So what additional protection does that give you? Wouldn't a RAID-0 be a
better utilization of the resources?

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-16 Thread Cody Caughlan

On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote:

 On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
 How did you build your RAID array? Maybe I have a fundamental flaw /
 misconfiguration. I am doing it via:
 
 $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
 $ pvcreate /dev/md0
 $ vgcreate lvm-raid10 /dev/md0
 $ lvcreate -l 215021 lvm-raid10 -n lvm0
 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
 $ mkfs.xfs -f /dev/lvm-raid10/lvm0
 $ mkdir -p /data  mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data
 
 I'm not using EC2 much, and those were my first attempts with ephemeral
 storage, so this may be a stupid question, but why are you building a
 RAID-10 array on an ephemeral storage, anyway?
 
 You already have a standby, so if the primary instance fails you can
 easily failover.
 

Yes, the slave will become master if master goes down. We have no plan to try 
and resurrect the master in the case of failure, hence the choice of ephemeral 
vs EBS. 

We chose RAID10 over RAID0 to get the best combination of performance and 
minimizing probability of a single drive failure bringing down the house.

So, yes, RAID0 would ultimately deliver the best performance, with more risk.

 What are you going to do in case of a drive failure? With a server this is
 rather easy - just put there a new drive and you're done, but can you do
 that on EC2? I guess you can't do that when the instance is running, so
 you'll have to switch to the standby anyway, right? Have you ever tried
 this (how it affects the performance etc.)?
 

As far as I know one cannot alter the ephemeral drives in a running instance, 
so yes, the whole instance would have to be written off.

 So what additional protection does that give you? Wouldn't a RAID-0 be a
 better utilization of the resources?
 

Too much risk.

 Tomas
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-16 Thread Tomas Vondra
On 16 Listopad 2011, 18:31, Cody Caughlan wrote:

 On Nov 16, 2011, at 8:52 AM, Tomas Vondra wrote:

 On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
 How did you build your RAID array? Maybe I have a fundamental flaw /
 misconfiguration. I am doing it via:

 $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
 $ pvcreate /dev/md0
 $ vgcreate lvm-raid10 /dev/md0
 $ lvcreate -l 215021 lvm-raid10 -n lvm0
 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
 $ mkfs.xfs -f /dev/lvm-raid10/lvm0
 $ mkdir -p /data  mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

 I'm not using EC2 much, and those were my first attempts with ephemeral
 storage, so this may be a stupid question, but why are you building a
 RAID-10 array on an ephemeral storage, anyway?

 You already have a standby, so if the primary instance fails you can
 easily failover.


 Yes, the slave will become master if master goes down. We have no plan to
 try and resurrect the master in the case of failure, hence the choice of
 ephemeral vs EBS.

 We chose RAID10 over RAID0 to get the best combination of performance and
 minimizing probability of a single drive failure bringing down the house.

 So, yes, RAID0 would ultimately deliver the best performance, with more
 risk.

 What are you going to do in case of a drive failure? With a server this
 is
 rather easy - just put there a new drive and you're done, but can you do
 that on EC2? I guess you can't do that when the instance is running, so
 you'll have to switch to the standby anyway, right? Have you ever tried
 this (how it affects the performance etc.)?


 As far as I know one cannot alter the ephemeral drives in a running
 instance, so yes, the whole instance would have to be written off.

 So what additional protection does that give you? Wouldn't a RAID-0 be a
 better utilization of the resources?


 Too much risk.

Why? If I understand that correctly, the only case where a RAID-10
actually helps is when an ephemeral drive fails, but not the whole
instance. Do you have some numbers how often this happens, i.e. how often
a drive fails without the instance?

But you can't actually replace the failed drive, so the only option you
have is to failover to the standby - right? Sure - with async replication,
you could loose a the not-yet-sent transactions. I see two possible
solutions:

a) use sync rep, available in 9.1 (you already run 9.1.1)

b) place WAL on an EBS, mounted as part of the failover

The EBS are not exactly fast, but it seems (e.g.
http://www.mysqlperformanceblog.com/2009/08/06/ec2ebs-single-and-raid-volumes-io-bencmark/)
the sequential performance might be acceptable.

According to the stats you've posted, you've written about 5632 MB of WAL
data per hour. That's about 1.5 MB/s on average, and that might be handled
by the EBS. Yes, if you have a peak where you need to write much more
data, this is going to be a bottleneck.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-15 Thread Tomas Vondra
Dne 15.11.2011 01:13, Cody Caughlan napsal(a):
 The first two are what I would think would be largely read operations
 (certainly the SELECT) so its not clear why a SELECT consumes write
 time.
 
 Here is the output of some pg_stat_bgwriter stats from the last couple of 
 hours:
 
 https://gist.github.com/41ee26caca01471a9b77

Hmm, the difference between 22:00 and 23:00 seems a bit suspucious.
There are 11 checkpoints (timed + requested), i.e. 352 segments, i.e.
about 5632 MB of WAL data. The checkpoints had to write 504.135 buffers
(i.e. 4 GB of data) and background writer additional 10.494 buffers (100MB).

But the backends themselves had to write 173.208 buffers. That's way too
much I guess and it's probably one of the reasons why the queries take
so long.

So your primary goal probably should be to write less buffers from the
backends. Spread checkpoints are the most effective way and background
writer is fine.

Try to eliminate the requested checkpoints (increase the number of
checkpoint segments), and eliminate the buffers written by backends.
Increase the shared buffers and watch if that helps.

If that does not help, make the background writer a bit more aggressive.
Increase bgwriter_lru_maxpages / decrease bgwriter_delay, that should
write the buffers a bit more aggressive.

But if there was something extraordinary happening between 22:00 and
23:00 (e.g. a backup, a batch job), this may be completely nonsense.

 One thing that I might not have made very clear earlier is that this
 DB, especially a single table receives a very large number of UPDATEs.
 However, it seems to be well cached, I have shared_buffers = 3584MB
 and a view of pg_buffercache shows:
 https://gist.github.com/53c520571290cae14613

That means almost 100% of the buffers is used. But this is not a very
interesting view - the buffer may be occupied and not used. Can you
group the buffers by isdirty so we can see what portion of the buffers
was modified (and needs to be written)?

Much more interesting is the view from the other side - how many
requests were handled from shared buffers. For a database you can do
that like this

  select datname, 100* blks_hit / (blks_hit + blks_read + 1)
from pg_stat_database;

The +1 is there just because of zero values, and you should evaluate
that using two snapshots (difference of). And you can compute the same
thing (cache hit ratio) for tables/indexes.

The problem here is that there's a separate cache (page cache), and
that's not included here. So a read may be a miss and yet not require
an actual I/O.

 Is it possible that we're just trying to handle too many UPDATEs and
 they are all trying to hit disk all at once - causing this I/O

Not sure what you mean by trying to hit disk all at once? The updates
are written to a WAL (which is mostly sequential I/O) and the actual
pages are updated in memory (shared buffers). And those shared buffers
are written only when a checkpoint happens, but this is a single write
of the whole block, not many small writes.

Sure, the clients need to grab a lock on a buffer when modifying it, and
a lock on WAL, but that wouldn't demonstrate as an I/O utilization.

In short - I don't think this is happening here.

What actually matters here is that the dirty buffers are spread across
the drive - that's where the random I/O comes from. And the fact that
the backends need to flush the dirty buffers on their own.

 contention? Here is a view of pg_stat_user_tables that shows the
 amount of live/dead tuples:
 
 https://gist.github.com/5ac1ae7d11facd72913f

What could happen due to UPDATES is a table bloat (table growing due to
updates), but I think that's not happening there. The number of dead
tuples is very low - less than 1% in most cases.

For example the largest table users has less than 0.5% of dead tuples
and most of the updates are handled by HOT. So I don't think this is an
issue here.

kind regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-15 Thread Tomas Vondra
Dne 14.11.2011 22:58, Cody Caughlan napsal(a):
 I ran bonnie++ on a slave node, doing active streaming replication but
 otherwise idle:
 http://batch-files-test.s3.amazonaws.com/sql03.prod.html
 
 bonnie++ on the master node:
 http://batch-files-test.s3.amazonaws.com/sql01.prod.html
 
 If I am reading this right, this is my first time using it, the
 numbers dont look too good.

I've done some benchmarks on my own (m1.xlarge instance), and the
results are these (http://pastebin.com/T1LXHru0):

single drive

dd writes: 62 MB/s
dd reads: 110 MB/s
bonnie seq. writes: 55 MB/s
bonnie seq. rewrite: 33 MB/s
bonnie seq. reads: 91 MB/s
bonnie seeks: 370/s

raid 0 (4 devices)
-
dd writes: 220 MB/s
dd reads: 380 MB/s
bonnie seq. writes: 130 MB/s
bonnie seq. rewrite: 114 MB/s
bonnie seq. reads: 280 MB/s
bonnie seeks: 570/s

raid 10 (4 devices)
-
dd writes: 90 MB/s
dd reads: 200 MB/s
bonnie seq. writes: 49 MB/s
bonnie seq. rewrite: 56 MB/s
bonnie seq. reads: 160 MB/s
bonnie seeks: 590/s

So the results are rather different from your results (both master and
slave).

What surprises me a bit is the decrease of write performance between
sigle drive and RAID 10. I've used bonnie++ 1.03e, so I'm wondering if
the 1.96 would give different results ...

All those benchmarks were performed with ext3.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-15 Thread Cody Caughlan
On Tue, Nov 15, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Dne 14.11.2011 22:58, Cody Caughlan napsal(a):
 I ran bonnie++ on a slave node, doing active streaming replication but
 otherwise idle:
 http://batch-files-test.s3.amazonaws.com/sql03.prod.html

 bonnie++ on the master node:
 http://batch-files-test.s3.amazonaws.com/sql01.prod.html

 If I am reading this right, this is my first time using it, the
 numbers dont look too good.

 I've done some benchmarks on my own (m1.xlarge instance), and the
 results are these (http://pastebin.com/T1LXHru0):

 single drive
 
 dd writes: 62 MB/s
 dd reads: 110 MB/s
 bonnie seq. writes: 55 MB/s
 bonnie seq. rewrite: 33 MB/s
 bonnie seq. reads: 91 MB/s
 bonnie seeks: 370/s

 raid 0 (4 devices)
 -
 dd writes: 220 MB/s
 dd reads: 380 MB/s
 bonnie seq. writes: 130 MB/s
 bonnie seq. rewrite: 114 MB/s
 bonnie seq. reads: 280 MB/s
 bonnie seeks: 570/s

 raid 10 (4 devices)
 -
 dd writes: 90 MB/s
 dd reads: 200 MB/s
 bonnie seq. writes: 49 MB/s
 bonnie seq. rewrite: 56 MB/s
 bonnie seq. reads: 160 MB/s
 bonnie seeks: 590/s


Interesting. I spun up a new m1.xlarge and did the same RAID10 config
(4 drives) except with a chunk size of 512K (instead of 256K) and the
machine was completely idle. Bonnie:

http://batch-files-test.s3.amazonaws.com/idle-512k-chunk.html

Which has similar-ish performance as yours, except for worse seeks but
a bit better seq. reads.

The other bonnies I sent over were NOT on idle systems. This one is
the master, which receives a heavy stream of writes and some reads

http://batch-files-test.s3.amazonaws.com/sql01.prod.html

And this is the slave, which is all writes and no reads:
http://batch-files-test.s3.amazonaws.com/sql03.prod.html

How did you build your RAID array? Maybe I have a fundamental flaw /
misconfiguration. I am doing it via:

$ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
/dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
$ pvcreate /dev/md0
$ vgcreate lvm-raid10 /dev/md0
$ lvcreate -l 215021 lvm-raid10 -n lvm0
$ blockdev --setra 65536 /dev/lvm-raid10/lvm0
$ mkfs.xfs -f /dev/lvm-raid10/lvm0
$ mkdir -p /data  mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-15 Thread Tomas Vondra
On 16 Listopad 2011, 2:21, Cody Caughlan wrote:
 How did you build your RAID array? Maybe I have a fundamental flaw /
 misconfiguration. I am doing it via:

 $ yes | mdadm --create /dev/md0 --level=10 -c256 --raid-devices=4
 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde
 $ pvcreate /dev/md0
 $ vgcreate lvm-raid10 /dev/md0
 $ lvcreate -l 215021 lvm-raid10 -n lvm0
 $ blockdev --setra 65536 /dev/lvm-raid10/lvm0
 $ mkfs.xfs -f /dev/lvm-raid10/lvm0
 $ mkdir -p /data  mount -t xfs -o noatime /dev/lvm-raid10/lvm0 /data

 --

I don't think you have a flaw there. The workload probably skews the
results a bit on the master and slave, so it's difficult to compare it to
results from an idle instance. The amount of data written seems small, but
a random i/o can saturated the devices quite easily.

I went with a very simple raid config - no LVM, default stripe size
(better seeks, worse sequential performance),  default read-ahead (could
give better seq. performance).

Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-15 Thread Greg Smith

On 11/14/2011 01:16 PM, Cody Caughlan wrote:

We're starting to see some slow queries, especially COMMITs that are
happening more frequently. The slow queries are against seemingly
well-indexed tables.
Slow commits like:

2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
  statement: COMMIT
2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
  statement: COMMIT

These slow COMMITs are against tables that received a large number of
UPDATEs and are growing fairly rapidly.
   


Linux will fill its write cache with all of the writes coming out of 
each checkpoint.  With a 16GB instance, I would expect that 5% * 16GB ~= 
800MB of writes are batched up when your system is slow.  You should be 
able to confirm that by looking at the Dirty: line in /proc/meminfo


With 800MB queued up and I/O that is lucky to get 50MB/s, the sync calls 
at the end of each checkpoint are sometimes blocking for multiple seconds:



2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
files=60, longest=1.484 s, average=0.049 s
2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
files=82, longest=2.943 s, average=0.050 s
   


When an individual sync call gets stuck for that long, clients can 
easily get stuck behind it too.  There are a couple of techniques that 
might help:


-Switch filesystems if you're running a slow one.  ext3 has bad latency 
behavior here, XFS and ext4 are better.
-Lower the dirty_* tunables like dirty_background_ratio or its bytes 
version.  This will reduce average throughput, but can lower latency.

-Spread checkpoints out more so that less average writes are happening.
-Decrease shared_buffers so less data is getting pushed out at 
checkpoint time.

-Reduce your reliability expectations and turn off synchronous_commit.

Your server is sometimes showing multi-second latency issues with 
bonnie++ too; that suggests how this problem is not even specific to 
PostgreSQL.  Linux is hard to tune for low latency under all 
circumstances; fighting latency down under a heavy update workload is 
hard to do even with good hardware to accelerate write performance.  In 
an EC2 environment, it may not even be possible to do without making 
trade-offs like disabling synchronous writes.  I can easily get 
transactions hung for 10 to 15 seconds on one of their servers if I try 
to make that problem bad, you're only seeing the middle range of latency 
issues so far.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-14 Thread Cody Caughlan
Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a
dedicated master with 2 streaming replication nodes.

The machine has 16GB of RAM and 4 cores.

We're starting to see some slow queries, especially COMMITs that are
happening more frequently. The slow queries are against seemingly
well-indexed tables.

I have log_min_duration = 150ms

Slow commits like:

2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG:  duration: 3062.784 ms
 statement: COMMIT
2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG:  duration: 2593.351 ms
 statement: COMMIT

These slow COMMITs are against tables that received a large number of
UPDATEs and are growing fairly rapidly.

And slow queries like:

2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG:  duration: 1694.456 ms
statement: SELECT  facebook_wall_posts.* FROM facebook_wall_posts
WHERE facebook_wall_posts.token =
'984c44e75975b224b38197cf8f8fc76a' LIMIT 1

query plan: http://explain.depesz.com/s/wbm
The # of rows in facebook_wall_posts is 5841

Structure of facebook_wall_posts:
 Table public.facebook_wall_posts
   Column   |Type |
Modifiers
+-+--
 id | integer | not null default
nextval('facebook_wall_posts_id_seq'::regclass)
 album_id   | integer | not null
 token  | character varying(32)   | not null
 fb_recipient_id| character varying(64)   | not null
 post_id| character varying(100)  | not null
 fb_post_created_at | timestamp without time zone |
 data   | text|
 created_at | timestamp without time zone |
 updated_at | timestamp without time zone |
 fb_post_deleted_at | timestamp without time zone |
Indexes:
facebook_wall_posts_pkey PRIMARY KEY, btree (id)
index_facebook_wall_posts_on_token UNIQUE, btree (token)
index_facebook_wall_posts_on_album_id btree (album_id)


And another slow query:

2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG:  duration: 979.740 ms
statement: SELECT facebook_friends.friend_id FROM
facebook_friends  WHERE facebook_friends.user_id = 9134671

Query plan: http://explain.depesz.com/s/x1q
# of rows in facebook_friends is 27075088

Structure of facebook_friends:
 Table public.facebook_friends
   Column   |Type |
Modifiers
+-+---
 id | integer | not null default
nextval('facebook_friends_id_seq'::regclass)
 user_id| integer | not null
 friend_id  | integer | not null
 created_at | timestamp without time zone |
Indexes:
facebook_friends_pkey PRIMARY KEY, btree (id)
index_facebook_friends_on_user_id_and_friend_id UNIQUE, btree
(user_id, friend_id)

We have auto-vacuum enabled and running. But yesterday I manually ran
vacuum on the database. Autovacuum settings:

autovacuum  | on
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold| 50
autovacuum_freeze_max_age   | 2
autovacuum_max_workers  | 3
autovacuum_naptime  | 60
autovacuum_vacuum_cost_delay| 20
autovacuum_vacuum_cost_limit| -1
autovacuum_vacuum_scale_factor  | 0.2
autovacuum_vacuum_threshold | 50

other postgresql.conf settings:

shared_buffers = 3584MB
wal_buffers = 16MB
checkpoint_segments = 32
max_wal_senders = 10
checkpoint_completion_target = 0.9
wal_keep_segments = 1024
maintenance_work_mem = 256MB
work_mem = 88MB
shared_buffers = 3584MB
effective_cache_size = 10GB

The PGDATA dir is a RAID10 on 4 local (ephemeral in EC2 speak)
drives. I ran some dd tests and received the following output:

--- WRITING ---
root@sql03:/data# time sh -c dd if=/dev/zero of=/data/tmp/bigfile
bs=8k count=400  sync
400+0 records in
400+0 records out
3276800 bytes (33 GB) copied, 670.663 s, 48.9 MB/s

real11m52.199s
user0m2.720s
sys 0m45.330s


--- READING ---
root@sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k
400+0 records in
400+0 records out
3276800 bytes (33 GB) copied, 155.429 s, 211 MB/s

real2m35.434s
user0m2.400s
sys 0m33.160s


I have enabled log_checkpoints and here is a recent sample from the log:

2011-11-14 17:38:48 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8
recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync
files=60, longest=1.484 s, average=0.049 s
2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
recycled; 

Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-14 Thread Tomas Vondra
On 14 Listopad 2011, 19:16, Cody Caughlan wrote:
 shared_buffers = 3584MB
 wal_buffers = 16MB
 checkpoint_segments = 32
 max_wal_senders = 10
 checkpoint_completion_target = 0.9
 wal_keep_segments = 1024
 maintenance_work_mem = 256MB
 work_mem = 88MB
 shared_buffers = 3584MB
 effective_cache_size = 10GB

Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
usually too low).

 The PGDATA dir is a RAID10 on 4 local (ephemeral in EC2 speak)
 drives. I ran some dd tests and received the following output:

 --- WRITING ---
 root@sql03:/data# time sh -c dd if=/dev/zero of=/data/tmp/bigfile
 bs=8k count=400  sync
 400+0 records in
 400+0 records out
 3276800 bytes (33 GB) copied, 670.663 s, 48.9 MB/s

 real  11m52.199s
 user  0m2.720s
 sys   0m45.330s

This measures sequential write performance (and the same holds for the
read test). We need to know the random I/O performance too - use bonnie++
or similar tool.

Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for
sequential read/writes, and about twice that performance for a 4-drive
RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes
(50MB/s) are rather slow. Have you measured this on an idle system, or
when the db was running?

See for example this:

[1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes
[2]
http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html

 I have enabled log_checkpoints and here is a recent sample from the log:
 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
 files=82, longest=2.943 s, average=0.050 s

Nothing special here - this just says that the checkpoints were timed and
finished on time (the default checkpoint timeout is 5 minutes, with
completion target 0.9 the expected checkpoint time is about 270s). Not a
checkpoint issue, probably.

 I've been collecting random samples from pg_stat_bgwriter:
 https://gist.github.com/4faec2ca9a79ede281e1

Although it's a bit difficult to interpret this (collect the data in
regular intervals - e.g. every hour - and post the differences, please),
but it seems reasonable.

 So given all this information (if you need more just let me know), is
 there something fundamentally wrong or mis-configured? Do I have an
 I/O issue?

Probably - the discrepancy between read/write performance is a bit
suspicious.

Try to watch the I/O performance when this happens, i.e. run iostat -x
and watch the output (especially %util, r_await, w_await) and post several
lines of the output.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-14 Thread Cody Caughlan
Thanks for your response. Please see below for answers to your questions.

On Mon, Nov 14, 2011 at 11:22 AM, Tomas Vondra t...@fuzzy.cz wrote:
 On 14 Listopad 2011, 19:16, Cody Caughlan wrote:
 shared_buffers = 3584MB
 wal_buffers = 16MB
 checkpoint_segments = 32
 max_wal_senders = 10
 checkpoint_completion_target = 0.9
 wal_keep_segments = 1024
 maintenance_work_mem = 256MB
 work_mem = 88MB
 shared_buffers = 3584MB
 effective_cache_size = 10GB

 Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
 usually too low).

Ok, will do.


 The PGDATA dir is a RAID10 on 4 local (ephemeral in EC2 speak)
 drives. I ran some dd tests and received the following output:

 --- WRITING ---
 root@sql03:/data# time sh -c dd if=/dev/zero of=/data/tmp/bigfile
 bs=8k count=400  sync
 400+0 records in
 400+0 records out
 3276800 bytes (33 GB) copied, 670.663 s, 48.9 MB/s

 real  11m52.199s
 user  0m2.720s
 sys   0m45.330s

 This measures sequential write performance (and the same holds for the
 read test). We need to know the random I/O performance too - use bonnie++
 or similar tool.

 Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for
 sequential read/writes, and about twice that performance for a 4-drive
 RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes
 (50MB/s) are rather slow. Have you measured this on an idle system, or
 when the db was running?


I ran bonnie++ on a slave node, doing active streaming replication but
otherwise idle:
http://batch-files-test.s3.amazonaws.com/sql03.prod.html

bonnie++ on the master node:
http://batch-files-test.s3.amazonaws.com/sql01.prod.html

If I am reading this right, this is my first time using it, the
numbers dont look too good.

 See for example this:

 [1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes
 [2]
 http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html

 I have enabled log_checkpoints and here is a recent sample from the log:
 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
 files=82, longest=2.943 s, average=0.050 s

 Nothing special here - this just says that the checkpoints were timed and
 finished on time (the default checkpoint timeout is 5 minutes, with
 completion target 0.9 the expected checkpoint time is about 270s). Not a
 checkpoint issue, probably.

 I've been collecting random samples from pg_stat_bgwriter:
 https://gist.github.com/4faec2ca9a79ede281e1

 Although it's a bit difficult to interpret this (collect the data in
 regular intervals - e.g. every hour - and post the differences, please),
 but it seems reasonable.

Ok, I have a cron running every hour to grab this data. I will post
back in a few hours or tomorrow.


 So given all this information (if you need more just let me know), is
 there something fundamentally wrong or mis-configured? Do I have an
 I/O issue?

 Probably - the discrepancy between read/write performance is a bit
 suspicious.

 Try to watch the I/O performance when this happens, i.e. run iostat -x
 and watch the output (especially %util, r_await, w_await) and post several
 lines of the output.


Heres a gist of running iostat -x 3 for about a few minutes:

https://gist.github.com/f94d98f2ef498a522ac2

Indeed, the %iowat and await values can spike up drastically.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-14 Thread Tomas Vondra
On 14 Listopad 2011, 22:58, Cody Caughlan wrote:
 Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
 usually too low).

 Ok, will do.

Yes, but find out what that means and think about the possible impact
first. It usually improves the checkpoint behaviour but increases the
recovery time and you may need more checkpoint segments. And I doubt this
will fix the issue you've described.

 I ran bonnie++ on a slave node, doing active streaming replication but
 otherwise idle:
 http://batch-files-test.s3.amazonaws.com/sql03.prod.html

 bonnie++ on the master node:
 http://batch-files-test.s3.amazonaws.com/sql01.prod.html

 If I am reading this right, this is my first time using it, the
 numbers dont look too good.

Are those instances equal, i.e. use tha same RAID10 config etc.? It
surprises me a bit that the slave performs much better than the master,
for example the sequential reads are much faster (210MB/s vs. 60MB/s) and
it handles about twice the number of seeks (345 vs. 170). But this may be
skewed because of the workload.

 Heres a gist of running iostat -x 3 for about a few minutes:

 https://gist.github.com/f94d98f2ef498a522ac2

 Indeed, the %iowat and await values can spike up drastically.

OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device,
right? According to the log_checkpoint info, you're writing about 15000
(120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat
you're writing up to 4MB/s, so it's not just about the checkpoints.

What else is going on there? How much WAL do you write?

Do you have iotop installed? That might give you a hint what processes are
writing data etc.

I'm a bit confused that the w/s don't add up while r/s do.

Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow queries / commits, mis-configuration or hardware issues?

2011-11-14 Thread Cody Caughlan
On Mon, Nov 14, 2011 at 2:57 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 14 Listopad 2011, 22:58, Cody Caughlan wrote:
 Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
 usually too low).

 Ok, will do.

 Yes, but find out what that means and think about the possible impact
 first. It usually improves the checkpoint behaviour but increases the
 recovery time and you may need more checkpoint segments. And I doubt this
 will fix the issue you've described.


Ok, I understand the ramifications with increasing the checkpoint
timeout. But I will investigate more before I jump in.


 I ran bonnie++ on a slave node, doing active streaming replication but
 otherwise idle:
 http://batch-files-test.s3.amazonaws.com/sql03.prod.html

 bonnie++ on the master node:
 http://batch-files-test.s3.amazonaws.com/sql01.prod.html

 If I am reading this right, this is my first time using it, the
 numbers dont look too good.

 Are those instances equal, i.e. use tha same RAID10 config etc.? It
 surprises me a bit that the slave performs much better than the master,
 for example the sequential reads are much faster (210MB/s vs. 60MB/s) and
 it handles about twice the number of seeks (345 vs. 170). But this may be
 skewed because of the workload.

Yes, these two machines are the same. sql01 is the master node and is
quite busy. Running bonnie++ on it during its normal workload spiked
I/O for the duration. sql03 is a pure slave and is quite idle, save
for receiving WAL segments.


 Heres a gist of running iostat -x 3 for about a few minutes:

 https://gist.github.com/f94d98f2ef498a522ac2

 Indeed, the %iowat and await values can spike up drastically.

 OK, so xvdb-xvde are individual drives and dm-0 is the RAID10 device,
 right? According to the log_checkpoint info, you're writing about 15000
 (120MB) buffers in 270s, i.e. about 440kB/s. But according to the iostat
 you're writing up to 4MB/s, so it's not just about the checkpoints.

 What else is going on there? How much WAL do you write?

Yes, dm-0 is the RAID10 device. The WAL config is:

wal_buffers = 16MB
checkpoint_segments = 32
max_wal_senders = 10
checkpoint_completion_target = 0.9
checkpoint_timeout = 300
wal_keep_segments = 1024


 Do you have iotop installed? That might give you a hint what processes are
 writing data etc.

I do have iotop and have been watching it. The only I/O users are
postgres and its backends. I dont see anything else consuming any I/O.
By eyeballing iotop, big consumers of disk writes are:

idle in transaction, SELECT, COMMIT

The first two are what I would think would be largely read operations
(certainly the SELECT) so its not clear why a SELECT consumes write
time.

Here is the output of some pg_stat_bgwriter stats from the last couple of hours:

https://gist.github.com/41ee26caca01471a9b77

One thing that I might not have made very clear earlier is that this
DB, especially a single table receives a very large number of UPDATEs.
However, it seems to be well cached, I have shared_buffers = 3584MB
and a view of pg_buffercache shows:
https://gist.github.com/53c520571290cae14613

Is it possible that we're just trying to handle too many UPDATEs and
they are all trying to hit disk all at once - causing this I/O
contention? Here is a view of pg_stat_user_tables that shows the
amount of live/dead tuples:

https://gist.github.com/5ac1ae7d11facd72913f

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance