Re: [PERFORM] Occasional Slow Commit

2008-11-06 Thread David Rees
On Thu, Nov 6, 2008 at 2:21 AM, Peter Schuller
[EMAIL PROTECTED] wrote:
 I also found that my write cache was set to WriteThrough instead of
 WriteBack, defeating the purpose of having a BBU and that my secondary
 server apparently doesn't have a BBU on it. :-(

 Note also that several RAID controllers will periodically drop the
 write-back mode during battery capacity tests. If you care about
 consistently/deterministically having full performance (with
 white-backed battery protected caching), you probably want to confirm
 your controller behavior here.

 (I've seen this on at least LSI based controllers in Dell 2950:s, and
 also on some 3wares.)

I can confirm that this is the case by reviewing the logs stored on
the MegaRAID controller that have a BBU and had WriteBack configured.
The controller also lets you know (using the MegaCli utility) what
setting is configured, and what setting is in effect.

In the case of the machines without a BBU on them, they are configured
to be in WriteBack, but are actually running in WriteThrough.

-Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-11-06 Thread Scott Marlowe
On Thu, Nov 6, 2008 at 8:47 AM, David Rees [EMAIL PROTECTED] wrote:

 In the case of the machines without a BBU on them, they are configured
 to be in WriteBack, but are actually running in WriteThrough.

I'm pretty sure the LSIs will refuse to actually run in writeback without a BBU.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-11-06 Thread David Rees
On Thu, Nov 6, 2008 at 8:07 AM, Scott Marlowe [EMAIL PROTECTED] wrote:
 On Thu, Nov 6, 2008 at 8:47 AM, David Rees [EMAIL PROTECTED] wrote:

 In the case of the machines without a BBU on them, they are configured
 to be in WriteBack, but are actually running in WriteThrough.

 I'm pretty sure the LSIs will refuse to actually run in writeback without a 
 BBU.

It's configurable (at least on the MegaRAID cards I'm using). There's
an option you can turn on that lets you run in writeback when the BBU
is offline.

-Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-11-05 Thread David Rees
On Fri, Oct 31, 2008 at 4:14 PM, David Rees [EMAIL PROTECTED] wrote:
 Well, I'm pretty sure the delays are not checkpoint related. None of
 the slow commits line up at all with the end of checkpoints.

 The period of high delays occur during the same period of time each
 week, but it's not during a particularly high load period on the
 systems.

 It really seems like there must be something running in the background
 that is not showing up on the system activity logs, like a background
 RAID scrub or something.

OK, I finally had a chance to dig at this problem some more, and after
futzing around with the MegaCli tools (major PITA, btw), I was able to
confirm that there is a feature called Patrol Read on this LSI
Megaraid SAS card which runs a weekly background read scan of the
disks looking for errors. It is during this time period that I get
lots of slow commits and transactions.

FWIW, the card identifies itself from lspci as this:

LSI Logic / Symbios Logic MegaRAID SAS
Subsystem: Intel Corporation SROMBSAS18E RAID Controller

I also found that my write cache was set to WriteThrough instead of
WriteBack, defeating the purpose of having a BBU and that my secondary
server apparently doesn't have a BBU on it. :-(

Anyway, has anyone done any benchmarking of MegaRAID SAS controllers?
I am configuring my arrays to use these settings:

Read Policy: Normal (Normal, Read ahead  Adaptive read head)
Write Policy: Writeback (Writeback, Writethrough)
Disable Writeback if bad BBU
IO Policy: Direct (Direct, Cached)
Disk Cache: Enable (Enable, Disable, Unchanged)

The only setting I'm really concerned about is the Disk Cache setting
- is it safe to assume that the controller will do the right thing
with regards to flushing the disk cache when appropriate to avoid data
loss? LSI RAID cards seem to be pretty well respected, so I'd have to
guess yes.

Thanks

Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-10-29 Thread Greg Smith

On Mon, 27 Oct 2008, David Rees wrote:


Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database
in question is replicated using slony)
Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller


The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that 
much RAM, and the whole thing can get slammed onto disk during the final 
fsync portion of the checkpoint.  What you should do first is confirm 
whether or not the slow commits line up with the end of the checkpoint, 
which is easy to see if you turn on log_checkpoints.  That gives you 
timings for the write and fsync phases of the checkpoint which can also be 
informative.



Reading this page[2] indicates that I may want to increase my
checkpoint_segments, checkpoint_timeout and bgwriter settings, but
looking at pg_stat_bgwriter seems to indicate that my current settings
are probably OK?

# select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint |
 3834 | 105 |3,091,905 |
buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
25876 |  110 | 2,247,576 |   2,889,873


I reformatted the above to show what's happening a bit better.  Most of 
your checkpoints are the timed ones, which are unlikely to cause 
interference from a slow commit period (the writes are spread out over 5 
minutes in those cases).  It's quite possible the slow periods are coming 
only from the occasional requested checkpoints, which normally show up 
because checkpoint_segments is too low and you chew through segments too 
fast.  If you problems line up with checkpoint time, you would likely gain 
some benefit from increasing checkpoint_segments to spread out the 
checkpoint writes further; the 10 you're using now is on the low side for 
your hardware.


If the problems don't go away after that, you may be suffering from 
excessive Linux kernel buffering instead.  I've got a blog entry showing 
how I tracked down a similar long pause on a Linux server at 
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you 
may find helpful for determining if your issue is this one (which is 
pretty common on RHEL systems having relatively large amounts of RAM) or 
if it's something else, like the locking you mentioned.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-10-29 Thread David Rees
On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith [EMAIL PROTECTED] wrote:
 The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that
 much RAM, and the whole thing can get slammed onto disk during the final
 fsync portion of the checkpoint.  What you should do first is confirm
 whether or not the slow commits line up with the end of the checkpoint,
 which is easy to see if you turn on log_checkpoints.  That gives you timings
 for the write and fsync phases of the checkpoint which can also be
 informative.

OK, log_checkpoints is turned on to see if any delays correspond to
checkpoint activity...

 Reading this page[2] indicates that I may want to increase my
 checkpoint_segments, checkpoint_timeout and bgwriter settings, but
 looking at pg_stat_bgwriter seems to indicate that my current settings
 are probably OK?

 # select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint |
 3834 | 105 |3,091,905 |
 buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
 25876 |  110 | 2,247,576 |   2,889,873

 I reformatted the above to show what's happening a bit better.

Sorry, gmail killed the formatting.

 Most of your
 checkpoints are the timed ones, which are unlikely to cause interference
 from a slow commit period (the writes are spread out over 5 minutes in those
 cases).  It's quite possible the slow periods are coming only from the
 occasional requested checkpoints, which normally show up because
 checkpoint_segments is too low and you chew through segments too fast.  If
 you problems line up with checkpoint time, you would likely gain some
 benefit from increasing checkpoint_segments to spread out the checkpoint
 writes further; the 10 you're using now is on the low side for your
 hardware.

OK, I've also bumped up checkpoint_segments to 20 and
checkpoint_completion_target to 0.7 in an effort to reduce the effect
of checkpoints.

 If the problems don't go away after that, you may be suffering from
 excessive Linux kernel buffering instead.  I've got a blog entry showing how
 I tracked down a similar long pause on a Linux server at
 http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you
 may find helpful for determining if your issue is this one (which is pretty
 common on RHEL systems having relatively large amounts of RAM) or if it's
 something else, like the locking you mentioned.

Ah, interesting. I've also turned down the dirty_ratio and
dirty_background_ratio as suggested, but I don't think this would be
affecting things here. The rate of IO on this server is very low
compared to what it's capable of.

Thanks for the suggestions, I'll report back with results.

-Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-10-28 Thread Merlin Moncure
On Mon, Oct 27, 2008 at 8:23 PM, David Rees [EMAIL PROTECTED] wrote:
 Hi,

 I've got an OLTP application which occasionally suffers from slow
 commit time. The process in question does something like this:

 1. Do work
 2. begin transaction
 3. insert record
 4. commit transaction
 5. Do more work
 6. begin transaction
 7. update record
 8. commit transaction
 9. Do more work

 The vast majority of the time, everything runs very quickly. The
 median processing time for the whole thing is 7ms.

 However, occasionally, processing time will jump up significantly -
 the average processing time is around 20ms with the maximum processing
 time taking 2-4 seconds for a small percentage of transactions. Ouch!

 Turning on statement logging and analyzing the logs of the application
 itself shows that step #4 is the culprit of the vast majority of the
 slow transactions.

 Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database
 in question is replicated using slony)

 Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller

 Notable configuration changes:
 shared_buffers = 800MB
 temp_buffers = 200MB
 work_mem = 16M
 maintenance_work_mem = 800MB
 vacuum_cost_delay = 10
 checkpoint_segments = 10
 effective_cache_size = 2500MB

 I found this post[1] from a while back which was informative:

 Both situations affect me in that I have Slony which I believe
 executes triggers upon commit, and looking at the disk IO stats, there
 is an elevated level of IO activity during this time, but it doesn't
 appear to be heavy enough to cause the type of delays I am seeing.

 Reading this page[2] indicates that I may want to increase my
 checkpoint_segments, checkpoint_timeout and bgwriter settings, but
 looking at pg_stat_bgwriter seems to indicate that my current settings
 are probably OK?

 # select * from pg_stat_bgwriter;
  checkpoints_timed | checkpoints_req | buffers_checkpoint |
 buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
 ---+-++---+--+-+---
  3834 | 105 |3091905 |
 25876 |  110 | 2247576 |   2889873

 Any suggestions on how to proceed and debug the problem from here?

 My only other guess is that there is some sort of locking issues going
 on which is slowing things down and that it may also be slony related,
 as I also see a high number of slony related queries taking longer
 than 1 second...

I bet your problem is disk syncing.  Your xlogs are on the data volume
so any type of burst activity can push back commit times.  If this is
the case, you have basically three solutions to this problem:
*) buy more disks (i's start with pushing the xlogs out to dedicated volume)
*) disable fsync (very unsafe) or synchronous commit (somewhat less unsafe)
*) checkpoint/bgwriter tuning: can provide incremental gains.  This is
not magic...at best you can smooth out bursty checkpoints.  If your
problems are really serious (yours don't seem to be), you have to look
at the previous options.

Have you temporarily disabling slony to see if the problem goes away?
(My guess is it's not slony).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-10-28 Thread Andrew Sullivan
On Mon, Oct 27, 2008 at 05:23:37PM -0700, David Rees wrote:

 However, occasionally, processing time will jump up significantly -
 the average processing time is around 20ms with the maximum processing
 time taking 2-4 seconds for a small percentage of transactions. Ouch!
 
 Turning on statement logging and analyzing the logs of the application
 itself shows that step #4 is the culprit of the vast majority of the
 slow transactions.

My bet is that you're waiting on checkpoints.  Given that you're on
8.3, start fiddling with the checkpoint_completion_target parameter.
0.7 might help.

A

-- 
Andrew Sullivan
[EMAIL PROTECTED]
+1 503 667 4564 x104
http://www.commandprompt.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Occasional Slow Commit

2008-10-27 Thread David Rees
Hi,

I've got an OLTP application which occasionally suffers from slow
commit time. The process in question does something like this:

1. Do work
2. begin transaction
3. insert record
4. commit transaction
5. Do more work
6. begin transaction
7. update record
8. commit transaction
9. Do more work

The vast majority of the time, everything runs very quickly. The
median processing time for the whole thing is 7ms.

However, occasionally, processing time will jump up significantly -
the average processing time is around 20ms with the maximum processing
time taking 2-4 seconds for a small percentage of transactions. Ouch!

Turning on statement logging and analyzing the logs of the application
itself shows that step #4 is the culprit of the vast majority of the
slow transactions.

Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database
in question is replicated using slony)

Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller

Notable configuration changes:
shared_buffers = 800MB
temp_buffers = 200MB
work_mem = 16M
maintenance_work_mem = 800MB
vacuum_cost_delay = 10
checkpoint_segments = 10
effective_cache_size = 2500MB

I found this post[1] from a while back which was informative:

Both situations affect me in that I have Slony which I believe
executes triggers upon commit, and looking at the disk IO stats, there
is an elevated level of IO activity during this time, but it doesn't
appear to be heavy enough to cause the type of delays I am seeing.

Reading this page[2] indicates that I may want to increase my
checkpoint_segments, checkpoint_timeout and bgwriter settings, but
looking at pg_stat_bgwriter seems to indicate that my current settings
are probably OK?

# select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint |
buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
---+-++---+--+-+---
  3834 | 105 |3091905 |
25876 |  110 | 2247576 |   2889873

Any suggestions on how to proceed and debug the problem from here?

My only other guess is that there is some sort of locking issues going
on which is slowing things down and that it may also be slony related,
as I also see a high number of slony related queries taking longer
than 1 second...

Thanks

-Dave

[1] http://archives.postgresql.org/pgsql-performance/2008-01/msg5.php
[2] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance