Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-14 Thread Gregory Stark
"PFC" <[EMAIL PROTECTED]> writes:

> Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just
> like postgres indexes) pages are split and scanning the pages in btree order
> becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless
> periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item
> "implement automatic table clustering"...)

Heikki already posted a patch which goes a long way towards implementing what
I think this patch refers to: trying to maintaining the cluster ordering on
updates and inserts.

It does it without changing the basic table structure at all. On updates and
inserts it consults the indexam of the clustered index to ask if for a
suggested block. If the index's suggested block has enough free space then the
tuple is put there.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-13 Thread PFC

>If we extended relations by more than 8k at a time, we would know a lot
>more about disk layout, at least on filesystems with a decent amount of
>free space.

I doubt it makes that much difference. If there was a significant amount
of fragmentation, we'd hear more complaints about seq scan performance.

The issue here is that we don't know which relations are on which drives
and controllers, how they're striped, mirrored etc.


Actually, isn't pre-allocation one of the tricks that Greenplum uses to
get it's seqscan performance?


	My tests here show that, at least on reiserfs, after a few hours of  
benchmark torture (this represents several million write queries), table  
files become significantly fragmented. I believe the table and index files  
get extended more or less simultaneously and end up somehow a bit mixed up  
on disk. Seq scan perf suffers. reiserfs doesn't have an excellent  
fragmentation behaviour... NTFS is worse than hell in this respect. So,  
pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to  
somewhere and back) gets seq scan perf back to disk throughput.


	Also, by the way, InnoDB uses a BTree organized table. The advantage is  
that data is always clustered on the primary key (which means you have to  
use something as your primary key that isn't necessary "natural", you have  
to choose it to get good clustering, and you can't always do it right, so  
it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is  
very slow because, as the btree grows (just like postgres indexes) pages  
are split and scanning the pages in btree order becomes a mess of seeks.  
So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is  
applied. (caveat to the postgres TODO item "implement automatic table  
clustering"...)


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-13 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Jim C. Nasby wrote:

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the 
best way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.


I doubt it makes that much difference. If there was a significant amount 
of fragmentation, we'd hear more complaints about seq scan performance.


OTOH, extending a relation that uses N pages by something like
min(ceil(N/1024), 1024)) pages might help some filesystems to
avoid fragmentation, and hardly introduce any waste (about 0.1%
in the worst case). So if it's not too hard to do it might
be worthwhile, even if it turns out that most filesystems deal
well with the current allocation pattern.

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-13 Thread Jim C. Nasby
On Sun, Jun 10, 2007 at 08:49:24PM +0100, Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> >On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> >>Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> >>>Thinking about this whole idea a bit more, it occured to me that the 
> >>>current approach to write all, then fsync all is really a historical 
> >>>artifact of the fact that we used to use the system-wide sync call 
> >>>instead of fsyncs to flush the pages to disk. That might not be the best 
> >>>way to do things in the new load-distributed-checkpoint world.
> >>>How about interleaving the writes with the fsyncs?
> >>I don't think it's a historical artifact at all: it's a valid reflection
> >>of the fact that we don't know enough about disk layout to do low-level
> >>I/O scheduling.  Issuing more fsyncs than necessary will do little
> >>except guarantee a less-than-optimal scheduling of the writes.
> >
> >If we extended relations by more than 8k at a time, we would know a lot
> >more about disk layout, at least on filesystems with a decent amount of
> >free space.
> 
> I doubt it makes that much difference. If there was a significant amount 
> of fragmentation, we'd hear more complaints about seq scan performance.
> 
> The issue here is that we don't know which relations are on which drives 
> and controllers, how they're striped, mirrored etc.

Actually, isn't pre-allocation one of the tricks that Greenplum uses to
get it's seqscan performance?
-- 
Jim Nasby  [EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgp9v0jJYJxA0.pgp
Description: PGP signature


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-11 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> wrote:

True. On the other hand, if we issue writes in essentially random order, 
we might fill the kernel buffers with random blocks and the kernel needs 
to flush them to disk as almost random I/O. If we did the writes in 
groups, the kernel has better chance at coalescing them.


If the kernel can treat sequential writes better than random writes, 
is it worth sorting dirty buffers in block order per file at the start

of checkpoints? Here is the pseudo code:

  buffers_to_be_written =
  SELECT buf_id, tag FROM BufferDescriptors
WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
  for { buf_id, tag } in buffers_to_be_written:
  if BufferDescriptors[buf_id].tag == tag:
  FlushBuffer(&BufferDescriptors[buf_id])

We can also avoid writing buffers newly dirtied after the checkpoint was
started with this method.


That's worth testing, IMO. Probably won't happen for 8.3, though.

I tend to agree that if the goal is to finish the checkpoint as quickly 
as possible, the current approach is better. In the context of load 
distributed checkpoints, however, it's unlikely the kernel can do any 
significant overlapping since we're trickling the writes anyway.


Some kernels or storage subsystems treat all I/Os too fairly so that user
transactions waiting for reads are blocked by checkpoints writes. It is
unavoidable behavior though, but we can split writes in small batches.


That's really the heart of our problems. If the kernel had support for 
prioritizing the normal backend activity and LRU cleaning over the 
checkpoint I/O, we wouldn't need to throttle the I/O ourselves. The 
kernel has the best knowledge of what it can and can't do, and how busy 
the I/O subsystems are. Recent Linux kernels have some support for read 
I/O priorities, but not for writes.


I believe the best long term solution is to add that support to the 
kernel, but it's going to take a long time until that's universally 
available, and we have a lot of platforms to support.


I'm starting to feel we should give up on smoothing the fsyncs and 
distribute the writes only, for 8.3. As we get more experience with that 
and it's shortcomings, we can enhance our checkpoints further in 8.4.


I agree with the only writes distribution for 8.3. The new parameters
introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
will continue to be alive without major changes in the future, but other
parameters seem to be volatile.


I'm going to start testing with just distributing the writes. Let's see 
how far that gets us.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-11 Thread Greg Smith

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:

If the kernel can treat sequential writes better than random writes, is 
it worth sorting dirty buffers in block order per file at the start of 
checkpoints?


I think it has the potential to improve things.  There are three obvious 
and one subtle argument against it I can think of:


1) Extra complexity for something that may not help.  This would need some 
good, robust benchmarking improvements to justify its use.


2) Block number ordering may not reflect actual order on disk.  While 
true, it's got to be better correlated with it than writing at random.


3) The OS disk elevator should be dealing with this issue, particularly 
because it may really know the actual disk ordering.


Here's the subtle thing:  by writing in the same order the LRU scan occurs 
in, you are writing dirty buffers in the optimal fashion to eliminate 
client backend writes during BuferAlloc.  This makes the checkpoint a 
really effective LRU clearing mechanism.  Writing in block order will 
change that.


I spent some time trying to optimize the elevator part of this operation, 
since I knew that on the system I was using block order was actual order. 
I found that under Linux, the behavior of the pdflush daemon that manages 
dirty memory had a more serious impact on writing behavior at checkpoint 
time than playing with the elevator scheduling method did.  The way 
pdflush works actually has several interesting implications for how to 
optimize this patch.  For example, how writes get blocked when the dirty 
memory reaches certain thresholds means that you may not get the full 
benefit of the disk elevator at checkpoint time the way most would expect.


Since much of that was basically undocumented, I had to write my own 
analysis of the actual workings, which is now available at 
http://www.westnet.com/~gsmith/content/linux-pdflush.htm  I hope that 
anyone who wants more information about how Linux kernel parameters like 
dirty_background_ratio actually work, and how they impact the writing 
strategy, should find that article uniquely helpful.


Some kernels or storage subsystems treat all I/Os too fairly so that 
user transactions waiting for reads are blocked by checkpoints writes.


In addition to that (which I've seen happen quite a bit), in the Linux 
case another fairness issue is that the code that handles writes allows a 
single process writing a lot of data to block writes for everyone else. 
That means that in addition to being blocked on actual reads, if a client 
backend starts a write in order to complete a buffer allocation to hold 
new information, that can grind to a halt because of the checkpoint 
process as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-10 Thread ITAGAKI Takahiro
Heikki Linnakangas <[EMAIL PROTECTED]> wrote:

> True. On the other hand, if we issue writes in essentially random order, 
> we might fill the kernel buffers with random blocks and the kernel needs 
> to flush them to disk as almost random I/O. If we did the writes in 
> groups, the kernel has better chance at coalescing them.

If the kernel can treat sequential writes better than random writes, 
is it worth sorting dirty buffers in block order per file at the start
of checkpoints? Here is the pseudo code:

  buffers_to_be_written =
  SELECT buf_id, tag FROM BufferDescriptors
WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
  for { buf_id, tag } in buffers_to_be_written:
  if BufferDescriptors[buf_id].tag == tag:
  FlushBuffer(&BufferDescriptors[buf_id])

We can also avoid writing buffers newly dirtied after the checkpoint was
started with this method.


> I tend to agree that if the goal is to finish the checkpoint as quickly 
> as possible, the current approach is better. In the context of load 
> distributed checkpoints, however, it's unlikely the kernel can do any 
> significant overlapping since we're trickling the writes anyway.

Some kernels or storage subsystems treat all I/Os too fairly so that user
transactions waiting for reads are blocked by checkpoints writes. It is
unavoidable behavior though, but we can split writes in small batches.


> I'm starting to feel we should give up on smoothing the fsyncs and 
> distribute the writes only, for 8.3. As we get more experience with that 
> and it's shortcomings, we can enhance our checkpoints further in 8.4.

I agree with the only writes distribution for 8.3. The new parameters
introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
will continue to be alive without major changes in the future, but other
parameters seem to be volatile.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-10 Thread Heikki Linnakangas

Jim C. Nasby wrote:

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the best 
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.


I doubt it makes that much difference. If there was a significant amount 
of fragmentation, we'd hear more complaints about seq scan performance.


The issue here is that we don't know which relations are on which drives 
and controllers, how they're striped, mirrored etc.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-09 Thread Jim C. Nasby
On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > Thinking about this whole idea a bit more, it occured to me that the 
> > current approach to write all, then fsync all is really a historical 
> > artifact of the fact that we used to use the system-wide sync call 
> > instead of fsyncs to flush the pages to disk. That might not be the best 
> > way to do things in the new load-distributed-checkpoint world.
> 
> > How about interleaving the writes with the fsyncs?
> 
> I don't think it's a historical artifact at all: it's a valid reflection
> of the fact that we don't know enough about disk layout to do low-level
> I/O scheduling.  Issuing more fsyncs than necessary will do little
> except guarantee a less-than-optimal scheduling of the writes.

If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.
-- 
Jim Nasby  [EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpD131BxuJOC.pgp
Description: PGP signature


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Bruce Momjian
Andrew Sullivan wrote:
> On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> > they'd take care of that as part of routine server setup.  What wouldn't 
> > be reasonable is to expect them to tune obscure parts of the kernel just 
> > for your application.
> 
> Well, I suppose it'd depend on what kind of hosting environment
> you're in (if I'm paying for dedicated hosting, you better believe
> I'm going to insist they tune the kernel the way I want), but you're
> right that in shared hosting for $25/mo, it's not going to happen.

And consider other operating systems that don't have the same knobs.  We
should tune as best we can first without kernel knobs.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Andrew Sullivan
On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> they'd take care of that as part of routine server setup.  What wouldn't 
> be reasonable is to expect them to tune obscure parts of the kernel just 
> for your application.

Well, I suppose it'd depend on what kind of hosting environment
you're in (if I'm paying for dedicated hosting, you better believe
I'm going to insist they tune the kernel the way I want), but you're
right that in shared hosting for $25/mo, it's not going to happen.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
"The year's penultimate month" is not in truth a good way of saying
November.
--H.W. Fowler

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Greg Smith

On Fri, 8 Jun 2007, Andrew Sullivan wrote:


Do you mean "change the OS settings" or something else?  (I'm not
sure it's true in any case, because shared memory kernel settings
have to be fiddled with in many instances, but I thought I'd ask for
clarification.)


In a situation where a hosting provider of some sort is providing 
PostgreSQL, they should know that parameters like SHMMAX need to be 
increased before customers can create a larger installation.  You'd expect 
they'd take care of that as part of routine server setup.  What wouldn't 
be reasonable is to expect them to tune obscure parts of the kernel just 
for your application.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Heikki Linnakangas

Andrew Sullivan wrote:

On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:

dynamics change. But we must also keep in mind that average DBA doesn't 
change any settings, and might not even be able or allowed to. That 
means the defaults should work reasonably well without tweaking the OS 
settings.


Do you mean "change the OS settings" or something else?  (I'm not
sure it's true in any case, because shared memory kernel settings
have to be fiddled with in many instances, but I thought I'd ask for
clarification.)


Yes, that's what I meant. An average DBA is not likely to change OS 
settings.


You're right on the shmmax setting, though.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Andrew Sullivan
On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:

> dynamics change. But we must also keep in mind that average DBA doesn't 
> change any settings, and might not even be able or allowed to. That 
> means the defaults should work reasonably well without tweaking the OS 
> settings.

Do you mean "change the OS settings" or something else?  (I'm not
sure it's true in any case, because shared memory kernel settings
have to be fiddled with in many instances, but I thought I'd ask for
clarification.)

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
Users never remark, "Wow, this software may be buggy and hard 
to use, but at least there is a lot of code underneath."
--Damien Katz

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-08 Thread Heikki Linnakangas

Greg Smith wrote:

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:


So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time.
Are both interesting use cases, or is it enough to cater for just one 
of them? I think 2 is easier to tune.


The motivation for the (1) case is that you've got a system that's 
dirtying the buffer cache very fast in normal use, where even the 
background writer is hard pressed to keep the buffer pool clean.  The 
checkpoint is the most powerful and efficient way to clean up many dirty 
buffers out of such a buffer cache in a short period of time so that 
you're back to having room to work in again.  In that situation, since 
there are many buffers to write out, you'll also be suffering greatly 
from fsync pauses.  Being able to synchronize writes a little better 
with the underlying OS to smooth those out is a huge help.


ISTM the bgwriter just isn't working hard enough in that scenario. 
Assuming we get the lru autotuning patch in 8.3, do you think there's 
still merit in using the checkpoints that way?


I'm completely biased because of the workloads I've been dealing with 
recently, but I consider (2) so much easier to tune for that it's barely 
worth worrying about.  If your system is so underloaded that you can let 
the checkpoints take their own sweet time, I'd ask if you have enough 
going on that you're suffering very much from checkpoint performance 
issues anyway.  I'm used to being in a situation where if you don't push 
out checkpoint data as fast as physically possible, you end up fighting 
with the client backends for write bandwidth once the LRU point moves 
past where the checkpoint has written out to already.  I'm not sure how 
much always running the LRU background writer will improve that situation.


I'd think it eliminates the problem. Assuming we keep the LRU cleaning 
running as usual, I don't see how writing faster during checkpoints 
could ever be beneficial for concurrent activity. The more you write, 
the less bandwidth there's available for others.


Doing the checkpoint as quickly as possible might be slightly better for 
average throughput, but that's a different matter.


On every system I've ever played with Postgres write performance on, I 
discovered that the memory-based parameters like dirty_background_ratio 
were really driving write behavior, and I almost ignore the expire 
timeout now.  Plotting the "Dirty:" value in /proc/meminfo as you're 
running tests is extremely informative for figuring out what Linux is 
really doing underneath the database writes.


Interesting. I haven't touched any of the kernel parameters yet in my 
tests. It seems we need to try different parameters and see how the 
dynamics change. But we must also keep in mind that average DBA doesn't 
change any settings, and might not even be able or allowed to. That 
means the defaults should work reasonably well without tweaking the OS 
settings.


The influence of the congestion code is why I made the comment about 
watching how long writes are taking to gauge how fast you can dump data 
onto the disks.  When you're suffering from one of the congestion 
mechanisms, the initial writes start blocking, even before the fsync. 
That behavior is almost undocumented outside of the relevant kernel 
source code.


Yeah, that's controlled by dirty_ratio, if I've understood the 
parameters correctly. If we spread out the writes enough, we shouldn't 
hit that limit or congestion. That's the point of the patch.


Do you have time / resources to do testing? You've clearly spent a lot 
of time on this, and I'd be very interested to see some actual numbers 
from your tests with various settings.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Joshua D. Drake


This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw 
more hardware at it.  I'm perversely glad to hear this is torturing more 
people than just me as it improves the odds the situation will improve.


It tortures pretty much any high velocity postgresql db of which there 
are more and more every day.


Joshua D. Drake




--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly




--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Greg Smith

On Thu, 7 Jun 2007, Gregory Stark wrote:


You seem to have imagined that letting the checkpoint take longer will slow
down transactions.


And you seem to have imagined that I have so much spare time that I'm just 
making stuff up to entertain myself and sow confusion.


I observed some situations where delaying checkpoints too long ends up 
slowing down both transaction rate and response time, using earlier 
variants of the LDC patch and code with similar principles I wrote.  I'm 
trying to keep the approach used here out of the worst of the corner cases 
I ran into, or least to make it possible for people in those situations to 
have some ability to tune out of the bad spots.  I am unfortunately not 
free to disclose all those test results, and since that project is over I 
can't see how the current LDC compares to what I tested at the time.


I plainly stated I had a bias here, one that's not even close to the 
average case.  My concern here was that Heikki would end up optimizing in 
a direction where a really wide spread across the active checkpoint 
interval was strongly preferred.  I wanted to offer some suggestions on 
the type of situation where that might not be true, but where a different 
tuning of LDC would still be an improvement over the current behavior. 
There are some tuning knobs there that I don't want to see go away until 
there's been a wider range of tests to prove they aren't effective.



Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.


No doubt; I've seen things get close to that bad myself, both on the high 
and low end. I collided with the issue in a situation of "maxing out your 
i/o bandwidth, couldn't buy a faster controller" at one point, which is 
what kicked off my working in this area.  It turned out there were still 
some software tunables left that pulled the worst case down to the 2-5 
second range instead.  With more checkpoint_segments to decrease the 
frequency, that was just enough to make the problem annoying rather than 
crippling.  But after that, I could easily imagine a different application 
scenario where the behavior you describe is the best case.


This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw more 
hardware at it.  I'm perversely glad to hear this is torturing more people 
than just me as it improves the odds the situation will improve.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Gregory Stark
"Greg Smith" <[EMAIL PROTECTED]> writes:

> I'm completely biased because of the workloads I've been dealing with 
> recently,
> but I consider (2) so much easier to tune for that it's barely worth worrying
> about.  If your system is so underloaded that you can let the checkpoints take
> their own sweet time, I'd ask if you have enough going on that you're 
> suffering
> very much from checkpoint performance issues anyway.  I'm used to being in a
> situation where if you don't push out checkpoint data as fast as physically
> possible, you end up fighting with the client backends for write bandwidth 
> once
> the LRU point moves past where the checkpoint has written out to already.  I'm
> not sure how much always running the LRU background writer will improve that
> situation.

I think you're working from a faulty premise.

There's no relationship between the volume of writes and how important the
speed of checkpoint is. In either scenario you should assume a system that is
close to the max i/o bandwidth. The only question is which task the admin
would prefer take the hit for maxing out the bandwidth, the transactions or
the checkpoint.

You seem to have imagined that letting the checkpoint take longer will slow
down transactions. In fact that's precisely the effect we're trying to avoid.
Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

That one-minute outage is a direct consequence of trying to finish the
checkpoint as quick as possible. If we spread it out then it might increase
the average i/o load if you sum it up over time, but then you just need a
faster i/o controller. 

The only scenario where you would prefer the absolute lowest i/o rate summed
over time would be if you were close to maxing out your i/o bandwidth,
couldn't buy a faster controller, and response time was not a factor, only
sheer volume of transactions processed mattered. That's a much less common
scenario than caring about the response time.

The flip side of having to worry about response time buying a faster
controller doesn't even help. It would shorten the duration of the checkpoint
but not eliminate it. A 30-second outage every half hour is just as
unacceptable as a 1-minute outage every half hour.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Greg Smith

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:


So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time.
Are both interesting use cases, or is it enough to cater for just one of 
them? I think 2 is easier to tune.


The motivation for the (1) case is that you've got a system that's 
dirtying the buffer cache very fast in normal use, where even the 
background writer is hard pressed to keep the buffer pool clean.  The 
checkpoint is the most powerful and efficient way to clean up many dirty 
buffers out of such a buffer cache in a short period of time so that 
you're back to having room to work in again.  In that situation, since 
there are many buffers to write out, you'll also be suffering greatly from 
fsync pauses.  Being able to synchronize writes a little better with the 
underlying OS to smooth those out is a huge help.


I'm completely biased because of the workloads I've been dealing with 
recently, but I consider (2) so much easier to tune for that it's barely 
worth worrying about.  If your system is so underloaded that you can let 
the checkpoints take their own sweet time, I'd ask if you have enough 
going on that you're suffering very much from checkpoint performance 
issues anyway.  I'm used to being in a situation where if you don't push 
out checkpoint data as fast as physically possible, you end up fighting 
with the client backends for write bandwidth once the LRU point moves past 
where the checkpoint has written out to already.  I'm not sure how much 
always running the LRU background writer will improve that situation.


On a Linux system, one way to model it is that the OS flushes dirty buffers 
to disk at the same rate as we write them, but delayed by 
dirty_expire_centisecs. That should hold if the writes are spread out enough.


If they're really spread out, sure.  There is congestion avoidance code 
inside the Linux kernel that makes dirty_expire_centisecs not quite work 
the way it is described under load.  All you can say in the general case 
is that when dirty_expire_centisecs has passed, the kernel badly wants to 
write the buffers out as quickly as possible; that could still be many 
seconds after the expiration time on a busy system, or on one with slow 
I/O.


On every system I've ever played with Postgres write performance on, I 
discovered that the memory-based parameters like dirty_background_ratio 
were really driving write behavior, and I almost ignore the expire timeout 
now.  Plotting the "Dirty:" value in /proc/meminfo as you're running tests 
is extremely informative for figuring out what Linux is really doing 
underneath the database writes.


The influence of the congestion code is why I made the comment about 
watching how long writes are taking to gauge how fast you can dump data 
onto the disks.  When you're suffering from one of the congestion 
mechanisms, the initial writes start blocking, even before the fsync. 
That behavior is almost undocumented outside of the relevant kernel source 
code.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


I'm not proposing to issue any more fsyncs. I'm proposing to change the 
ordering so that instead of first writing all dirty buffers and then 
fsyncing all files, we'd write all buffers belonging to a file, fsync 
that file only, then write all buffers belonging to next file, fsync, 
and so forth.


But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.


True. On the other hand, if we issue writes in essentially random order, 
we might fill the kernel buffers with random blocks and the kernel needs 
to flush them to disk as almost random I/O. If we did the writes in 
groups, the kernel has better chance at coalescing them.


I tend to agree that if the goal is to finish the checkpoint as quickly 
as possible, the current approach is better. In the context of load 
distributed checkpoints, however, it's unlikely the kernel can do any 
significant overlapping since we're trickling the writes anyway.


Do we need both strategies?

I'm starting to feel we should give up on smoothing the fsyncs and 
distribute the writes only, for 8.3. As we get more experience with that 
and it's shortcomings, we can enhance our checkpoints further in 8.4.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> I don't think it's a historical artifact at all: it's a valid reflection
>> of the fact that we don't know enough about disk layout to do low-level
>> I/O scheduling.  Issuing more fsyncs than necessary will do little
>> except guarantee a less-than-optimal scheduling of the writes.

> I'm not proposing to issue any more fsyncs. I'm proposing to change the 
> ordering so that instead of first writing all dirty buffers and then 
> fsyncing all files, we'd write all buffers belonging to a file, fsync 
> that file only, then write all buffers belonging to next file, fsync, 
> and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the best 
way to do things in the new load-distributed-checkpoint world.



How about interleaving the writes with the fsyncs?


I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


I'm not proposing to issue any more fsyncs. I'm proposing to change the 
ordering so that instead of first writing all dirty buffers and then 
fsyncing all files, we'd write all buffers belonging to a file, fsync 
that file only, then write all buffers belonging to next file, fsync, 
and so forth.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Thinking about this whole idea a bit more, it occured to me that the 
> current approach to write all, then fsync all is really a historical 
> artifact of the fact that we used to use the system-wide sync call 
> instead of fsyncs to flush the pages to disk. That might not be the best 
> way to do things in the new load-distributed-checkpoint world.

> How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Heikki Linnakangas
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the best 
way to do things in the new load-distributed-checkpoint world.


How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages, 
and buffers belonging to them


2.
foreach(file in list)
{
  foreach(buffer belonging to file)
  {
write();
sleep(); /* to throttle the I/O rate */
  }
  sleep(); /* to give the OS a chance to flush the writes at it's own 
pace */

  fsync()
}

This would spread out the fsyncs in a natural way, making the knob to 
control the duration of the sync phase unnecessary.


At some point we'll also need to fsync all files that have been modified 
since the last checkpoint, but don't have any dirty buffers in the 
buffer cache. I think it's a reasonable assumption that fsyncing those 
files doesn't generate a lot of I/O. Since the writes have been made 
some time ago, the OS has likely already flushed them to disk.


Doing the 1st phase of just scanning the buffers to see which ones are 
dirty also effectively implements the optimization of not writing 
buffers that were dirtied after the checkpoint start. And grouping the 
writes per file gives the OS a better chance to group the physical writes.


One problem is that currently the segmentation of relations to 1GB files 
is handled at a low level inside md.c, and we don't really have any 
visibility into that in the buffer manager. ISTM that some changes to 
the smgr interfaces would be needed for this to work well, though just 
doing it on a relation per relation basis would also be better than the 
current approach.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Hannu Krosing
Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > GUC summary and suggested default values
> > 
> > checkpoint_write_percent = 50   # % of checkpoint interval to 
> > spread out 
> > writes
> > checkpoint_write_min_rate = 1000# minimum I/O rate to write dirty 
> > buffers at checkpoint (KB/s)
> > checkpoint_nap_duration = 2 # delay between write and sync 
> > phase, in 
> > seconds
> > checkpoint_fsync_period = 30# duration of the sync phase, 
> > in seconds
> > checkpoint_fsync_delay = 500# max. delay between fsyncs
> 
> > I don't like adding that many GUC variables, but I don't really see a 
> > way to tune them automatically.
> 
> If we don't know how to tune them, how will the users know?  

He talked about doing it _automatically_.

If the knobns are available, it will be possible to determine "good"
values even by brute-force performance testing, given enough time and
manpower is available.

> Having to
> add that many variables to control one feature says to me that we don't
> understand the feature.

The feature has lots of complex dependencies to things outside postgres,
so learning to understand it takes time. Having the knows available
helps as more people ar willing to do turn-the-knobs-and-test vs.
recompile-and-test.

> Perhaps what we need is to think about how it can auto-tune itself.

Sure.

---
Hannu Krosing


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-07 Thread Heikki Linnakangas

Greg Smith wrote:

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

The original patch uses bgwriter_all_max_pages to set the minimum 
rate. I think we should have a separate variable, 
checkpoint_write_min_rate, in KB/s, instead.


Completely agreed.  There shouldn't be any coupling with the background 
writer parameters, which may be set for a completely different set of 
priorities than the checkpoint has.  I have to look at this code again 
to see why it's a min_rate instead of a max, that seems a little weird.


It's min rate, because it never writes slower than that, and it can 
write faster if the next checkpoint is due soon so that we wouldn't 
finish before it's time to start the next one. (Or to be precise, before 
the next checkpoint is closer than 100-(checkpoint_write_percent)% of 
the checkpoint interval)


Nap phase:  We should therefore give the delay as a number of seconds 
instead of as a percentage of checkpoint interval.


Again, the setting here should be completely decoupled from another GUC 
like the interval.  My main complaint with the original form of this 
patch was how much it tried to syncronize the process with the interval; 
since I don't even have a system where that value is set to something, 
because it's all segment based instead, that whole idea was incompatible.


checkpoint_segments is taken into account as well as checkpoint_timeout. 
I used the term "checkpoint interval" to mean the real interval at which 
the checkpoints occur, whether it's because of segments or timeout.


The original patch tried to spread the load out as evenly as possible 
over the time available.  I much prefer thinking in terms of getting it 
done as quickly as possible while trying to bound the I/O storm.


Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much. Set checkpoint_write_percent to a high number, and 
set checkpoint_min_rate to define "too much".
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a 
low number, and checkpoint_write_percent to define "reasonable time"


Are both interesting use cases, or is it enough to cater for just one of 
them? I think 2 is easier to tune. Defining the min_rate properly can be 
difficult and depends a lot on your hardware and application, but a 
default value of say 50% for checkpoint_write_percent to tune for use 
case 2 should work pretty well for most people.


In any case, the checkpoint better finish before it's time to start 
another one. Or would you rather delay the next checkpoint, and let 
checkpoint take as long as it takes to finish at the min_rate?


And we don't know how much work an fsync performs. The patch uses the 
file size as a measure of that, but as we discussed that doesn't 
necessarily have anything to do with reality. fsyncing a 1GB file with 
one dirty block isn't any more expensive than fsyncing a file with a 
single block.


On top of that, if you have a system with a write cache, the time an 
fsync takes can greatly depend on how full it is at the time, which 
there is no way to measure or even model easily.


Is there any way to track how many dirty blocks went into each file 
during the checkpoint write?  That's your best bet for guessing how long 
the fsync will take.


I suppose it's possible, but the OS has hopefully started flushing them 
to disk almost as soon as we started the writes, so even that isn't very 
good a measure.


On a Linux system, one way to model it is that the OS flushes dirty 
buffers to disk at the same rate as we write them, but delayed by 
dirty_expire_centisecs. That should hold if the writes are spread out 
enough. Then the amount of dirty buffers in OS cache at the end of write 
phase is roughly constant, as long as the write phase lasts longer than 
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after 
the write phase, the fsyncs should be effectively no-ops, except that 
they will flush any other writes the bgwriter lru-sweep and other 
backends performed during the nap.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-06 Thread Greg Smith

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

The original patch uses bgwriter_all_max_pages to set the minimum rate. I 
think we should have a separate variable, checkpoint_write_min_rate, in KB/s, 
instead.


Completely agreed.  There shouldn't be any coupling with the background 
writer parameters, which may be set for a completely different set of 
priorities than the checkpoint has.  I have to look at this code again to 
see why it's a min_rate instead of a max, that seems a little weird.


Nap phase:  We should therefore give the delay as a number of seconds 
instead of as a percentage of checkpoint interval.


Again, the setting here should be completely decoupled from another GUC 
like the interval.  My main complaint with the original form of this patch 
was how much it tried to syncronize the process with the interval; since I 
don't even have a system where that value is set to something, because 
it's all segment based instead, that whole idea was incompatible.


The original patch tried to spread the load out as evenly as possible over 
the time available.  I much prefer thinking in terms of getting it done as 
quickly as possible while trying to bound the I/O storm.


And we don't know how much work an fsync performs. The patch uses the file 
size as a measure of that, but as we discussed that doesn't necessarily have 
anything to do with reality. fsyncing a 1GB file with one dirty block isn't 
any more expensive than fsyncing a file with a single block.


On top of that, if you have a system with a write cache, the time an fsync 
takes can greatly depend on how full it is at the time, which there is no 
way to measure or even model easily.


Is there any way to track how many dirty blocks went into each file during 
the checkpoint write?  That's your best bet for guessing how long the 
fsync will take.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-06 Thread Greg Smith

On Wed, 6 Jun 2007, Tom Lane wrote:


If we don't know how to tune them, how will the users know?


I can tell you a good starting set for them to on a Linux system, but you 
first have to let me know how much memory is in the OS buffer cache, the 
typical I/O rate the disks can support, how many buffers are expected to 
be written out by BGW/other backends at heaviest load, and the current 
setting for /proc/sys/vm/dirty_background_ratio.  It's not a coincidence 
that there are patches applied to 8.3 or in the queue to measure all of 
the Postgres internals involved in that computation; I've been picking 
away at the edges of this problem.


Getting this sort of tuning right takes that level of information about 
the underlying system.  If there's a way to internally auto-tune the 
values this patch operates on (which I haven't found despite months of 
trying), it would be in the form of some sort of measurement/feedback loop 
based on how fast data is being written out.  There really are way too 
many things involved to try and tune it based on anything else; the 
underlying OS/hardware mechanisms that determine how this will go are 
complicated enough that it might as well be a black box for most people.


One of the things I've been fiddling with the design of is a testing 
program that simulates database activity at checkpoint time under load. 
I think running some tests like that is the most straightforward way to 
generate useful values for these tunables; it's much harder to try and 
determine them from within the backends because there's so much going on 
to keep track of.


I view the LDC mechanism as being in the same state right now as the 
background writer:  there are a lot of complicated knobs to tweak, they 
all do *something* useful for someone, and eliminating them will require a 
data-collection process across a much wider sample of data than can be 
collected quickly.  If I had to make a guess how this will end up, I'd 
expect there to be more knobs in LDC than everyone would like for the 8.3 
release, along with fairly verbose logging of what is happening at 
checkpoint time (that's why I've been nudging development in that area, 
along with making logs easier to aggregate).  Collect up enough of that 
information, then you're in a position to talk about useful automatic 
tuning--right around the 8.4 timeframe I suspect.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-06 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> GUC summary and suggested default values
> 
> checkpoint_write_percent = 50 # % of checkpoint interval to 
> spread out 
> writes
> checkpoint_write_min_rate = 1000  # minimum I/O rate to write dirty 
> buffers at checkpoint (KB/s)
> checkpoint_nap_duration = 2   # delay between write and sync phase, 
> in 
> seconds
> checkpoint_fsync_period = 30  # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500  # max. delay between fsyncs

> I don't like adding that many GUC variables, but I don't really see a 
> way to tune them automatically.

If we don't know how to tune them, how will the users know?  Having to
add that many variables to control one feature says to me that we don't
understand the feature.

Perhaps what we need is to think about how it can auto-tune itself.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-06 Thread Gregory Stark
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> GUC summary and suggested default values
> 
> checkpoint_write_percent = 50 # % of checkpoint interval to 
> spread out writes
> checkpoint_write_min_rate = 1000  # minimum I/O rate to write dirty
> buffers at checkpoint (KB/s)

I don't understand why this is a min_rate rather than a max_rate.


> checkpoint_nap_duration = 2   # delay between write and sync phase, 
> in seconds

Not a comment on the choice of guc parameters, but don't we expect useful
values of this to be much closer to 30 than 0? I understand it might not be
exactly 30.

Actually, it's not so much whether there's any write traffic to the data files
during the nap that matters, it's whether there's more traffic during the nap
than during the 30s or so prior to the nap. As long as it's a steady-state
condition it shouldn't matter how long we wait, should it?

> checkpoint_fsync_period = 30  # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500  # max. delay between fsyncs

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Controlling Load Distributed Checkpoints

2007-06-06 Thread Heikki Linnakangas
I'm again looking at way the GUC variables work in load distributed 
checkpoints patch. We've discussed them a lot already, but I don't think 
they're still quite right.


Write-phase
---
I like the way the write-phase is controlled in general. Writes are 
throttled so that we spend the specified percentage of checkpoint 
interval doing the writes. But we always write at a specified minimum 
rate to avoid spreading out the writes unnecessarily when there's little 
work to do.


The original patch uses bgwriter_all_max_pages to set the minimum rate. 
I think we should have a separate variable, checkpoint_write_min_rate, 
in KB/s, instead.


Nap phase
-
This is trickier. The purpose of the sleep between writes and fsyncs is 
to give the OS a chance to flush the pages to disk in it's own pace, 
hopefully limiting the affect on concurrent activity. The sleep 
shouldn't last too long, because any concurrent activity can be dirtying 
and writing more pages, and we might end up fsyncing more than necessary 
which is bad for performance. The optimal delay depends on many factors, 
but I believe it's somewhere between 0-30 seconds in any reasonable system.


In the current patch, the duration of the sleep between the write and 
sync phases is controlled as a percentage of checkpoint interval. Given 
that the optimal delay is in the range of seconds, and 
checkpoint_timeout can be up to 60 minutes, the useful values of that 
percentage would be very small, like 0.5% or even less. Furthermore, the 
optimal value doesn't depend that much on the checkpoint interval, it's 
more dependent on your OS and memory configuration.


We should therefore give the delay as a number of seconds instead of as 
a percentage of checkpoint interval.


Sync phase
--
This is also tricky. As with the nap phase, we don't want to spend too 
much time fsyncing, because concurrent activity will write more dirty 
pages and we might just end up doing more work.


And we don't know how much work an fsync performs. The patch uses the 
file size as a measure of that, but as we discussed that doesn't 
necessarily have anything to do with reality. fsyncing a 1GB file with 
one dirty block isn't any more expensive than fsyncing a file with a 
single block.


Another problem is the granularity of an fsync. If we fsync a 1GB file 
that's full of dirty pages, we can't limit the affect on other activity. 
The best we can do is to sleep between fsyncs, but sleeping more than a 
few seconds is hardly going to be useful, no matter how bad an I/O storm 
each fsync causes.


Because of the above, I'm thinking we should ditch the 
checkpoint_sync_percentage variable, in favor of:

checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay  # max. sleep between fsyncs, in milliseconds


In all phases, the normal bgwriter activities are performed: 
lru-cleaning and switching xlog segments if archive_timeout expires. If 
a new checkpoint request arrives while the previous one is still in 
progress, we skip all the delays and finish the previous checkpoint as 
soon as possible.



GUC summary and suggested default values

checkpoint_write_percent = 50 		# % of checkpoint interval to spread out 
writes
checkpoint_write_min_rate = 1000	# minimum I/O rate to write dirty 
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 		# delay between write and sync phase, in 
seconds

checkpoint_fsync_period = 30# duration of the sync phase, in seconds
checkpoint_fsync_delay = 500# max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a 
way to tune them automatically. Maybe we could just hard-code the last 
one, it doesn't seem that critical, but that still leaves us 4 variables.


Thoughts?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster