Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

The only thing I don't understand is the naming of 'checkpoint_smoothing'.
Can users imagine the unit of 'smoothing' is a fraction?

You explain the paremeter with the word 'fraction'.
Why don't you simply name it 'checkpoint_fraction' ?
| Specifies the target length of checkpoints, as a fraction of
| the checkpoint interval. The default is 0.3.


I chose checkpoint_smoothing because it tells you what the parameter is 
for. If you want more smoothing, tune it up, and if you want less, tune 
it down. checkpoint_fraction makes you wonder what you can do with it 
and why you would change it.



Sorry if I'm missing discussions abount the naming.


No, I chose _smoothing on my own. People didn't like 
checkpoint_write_percent either (including).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I added a spinlock to protect the signaling fields between bgwriter and 
backends. The current non-locking approach gets really difficult as the 
patch adds two new flags, and both are more important than the existing 
ckpt_time_warn flag.


That may be, but you could minimize the cost and notational ugliness by
not using the spinlock where you don't have to.  Put the sig_atomic_t
fields back the way they were, and many of the uses of the spinlock go
away.  All you really need it for is the places where the additional
flags are set or read.


I find it easier to understand if it's used whenever any of the fields 
are accessed. You don't need to read and write ckpt_failed and 
ckpt_started/ckpt_done in specific order anymore, for example.



Some other comments:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.


Really? I thought everyone is happy with the current combination, and 
that it was just the old trio of parameters controlling the write, nap 
and sync phases that was ugly. One particularly nice thing about 
defining the duration as a fraction of checkpoint interval is that we 
can come up with a reasonable default value that doesn't depend on your 
hardware.


How would a min and max rate work?

Anyone else have an opinion on the parameters?


The flow of control and responsibility between xlog.c, bgwriter.c and
bufmgr.c seems to have reached the breaking point of unintelligibility.
Can you think of a refactoring that would simplify it?  We might have
to resign ourselves to this much messiness, but I'd rather not...


The problem we're trying to solve is doing a checkpoint while running 
the normal bgwriter activity at the same time. The normal way to do two 
things simultaneously is to have two different processes (or threads). I 
thought about having a separate checkpoint process, but I gave up on 
that thought because the communication needed between backends, bgwriter 
and the checkpointer seems like a mess. The checkpointer would need the 
pendingOpsTable so that it can do the fsyncs, and it would also need to 
receive the forget-messages to that table. We could move that table 
entirely to the checkpointer, but since bgwriter is presumably doing 
most of the writes, there would be a lot of chatter between bgwriter and 
the checkpointer.


The current approach is like co-operative multitasking. BufferSyncs 
yields control to bgwriter every now and then.


The division of labor between xlog.c and other modules is not that bad, 
IMO. CreateCheckPoint is the main entry point to create a checkpoint, 
and it calls other modules to do their stuff, including bufmgr.c.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

In fact, I think there's a small race condition in CVS HEAD:


Yeah, probably --- the original no-locking design didn't have any side
flags.  The reason you need the lock is for a backend to be sure that
a newly-started checkpoint is using its requested flags.  But the
detection of checkpoint termination is still the same.


Actually, the race condition I outlined isn't related to the flags. It's 
possible because RequestCheckpoint doesn't guarantee that a checkpoint 
is performed when there's been no WAL activity since last one.


I did use a new force-flag to fix it, but I'm sure there is other ways.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I tend to agree with whoever said upthread that the combination of GUC
 variables proposed here is confusing and ugly.  It'd make more sense to
 have min and max checkpoint rates in KB/s, with the max checkpoint rate
 only honored when we are predicting we'll finish before the next
 checkpoint time.

 Really? I thought everyone is happy with the current combination, and 
 that it was just the old trio of parameters controlling the write, nap 
 and sync phases that was ugly. One particularly nice thing about 
 defining the duration as a fraction of checkpoint interval is that we 
 can come up with a reasonable default value that doesn't depend on your 
 hardware.

That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch.  Do we need the min rate
at all?  If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?

 How would a min and max rate work?

Pretty much the same as the code does now, no?  You either delay, or not.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.


Really? I thought everyone is happy with the current combination, and 
that it was just the old trio of parameters controlling the write, nap 
and sync phases that was ugly. One particularly nice thing about 
defining the duration as a fraction of checkpoint interval is that we 
can come up with a reasonable default value that doesn't depend on your 
hardware.


That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch.  Do we need the min rate
at all?  If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?


How would a min and max rate work?


Pretty much the same as the code does now, no?  You either delay, or not.


I don't think you understand how the settings work. Did you read the 
documentation? If you did, it's apparently not adequate.


The main tuning knob is checkpoint_smoothing, which is defined as a 
fraction of the checkpoint interval (both checkpoint_timeout and 
checkpoint_segments are taken into account). Normally, the write phase 
of a checkpoint takes exactly that much time.


So the length of a checkpoint stays the same regardless of how many 
dirty buffers there is (assuming you don't exceed the bandwidth of your 
hardware), but the I/O rate varies.


There's another possible strategy: keep the I/O rate constant, but vary 
the length of the checkpoint. checkpoint_rate allows you to do that.


I'm envisioning we set the defaults so that checkpoint_smoothing is the 
effective control in a relatively busy system, and checkpoint_rate 
ensures that we don't unnecessarily prolong checkpoints on an idle system.


Now how would you replace checkpoint_smoothing with a max I/O rate? The 
only real alternative way of defining it is directly as a time and/or 
segments, similar to checkpoint_timeout and checkpoint_segments, but 
then we'd really need two knobs instead of one.


Though maybe we could just hard-code it to 0.8, for example, and tell 
people to tune checkpoint_rate if they want more aggressive checkpoints.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I don't think you understand how the settings work. Did you read the 
 documentation? If you did, it's apparently not adequate.

I did read the documentation, and I'm not complaining that I don't
understand it.  I'm complaining that I don't like the presented API
because it's self-inconsistent.  You've got two parameters that are in
effect upper and lower bounds for the checkpoint write rate, but they
are named inconsistently and not even measured in the same kind of unit.
Nor do I agree that the inconsistency buys any ease of use.

 The main tuning knob is checkpoint_smoothing, which is defined as a 
 fraction of the checkpoint interval (both checkpoint_timeout and 
 checkpoint_segments are taken into account). Normally, the write phase 
 of a checkpoint takes exactly that much time.

So the question is, why in the heck would anyone want the behavior that
checkpoints take exactly X time??  The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries.  I don't see why people wouldn't prefer
checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second.

Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1.  You might as well make it a bool.

 There's another possible strategy: keep the I/O rate constant, but vary 
 the length of the checkpoint. checkpoint_rate allows you to do that.

But only from the lower side.

 Now how would you replace checkpoint_smoothing with a max I/O rate?

I don't see why you think that's hard.  It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate.  Then you either delay
or not.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
The main tuning knob is checkpoint_smoothing, which is defined as a 
fraction of the checkpoint interval (both checkpoint_timeout and 
checkpoint_segments are taken into account). Normally, the write phase 
of a checkpoint takes exactly that much time


So the question is, why in the heck would anyone want the behavior that
checkpoints take exactly X time??  The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries.  I don't see why people wouldn't prefer
checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second.


Because it's easier to tune. You don't need to know how much checkpoint 
I/O you can tolerate. The system will use just enough I/O bandwidth to 
meet the deadline, but not more than that.



Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1.  You might as well make it a bool.


Well that's one option. It feels like a good thing to be able to control 
  how much headroom you have until the next checkpoint, but maybe we 
can just hardcode it close to 1. It's also good to avoid spreading the 
checkpoints unnecessarily, to keep recovery times lower, but you can 
control that with the min rate setting as well.


There's another possible strategy: keep the I/O rate constant, but vary 
the length of the checkpoint. checkpoint_rate allows you to do that.


But only from the lower side.


Now how would you replace checkpoint_smoothing with a max I/O rate?


I don't see why you think that's hard.  It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate.  Then you either delay
or not.


Let me put it this way: If you define a min and a max I/O rate, when 
would the max I/O rate limit take effect? If there's few dirty buffers 
in the pool, so that you'll finish the checkpoint in time before the 
next one is due writing at the min rate, that's what you'll use. If 
there's more, you'll need to write fast enough that you'll finish the 
checkpoint in time, regardless of the max rate. Or would you let the 
next checkpoint slip and keep writing at the max rate? That seems like a 
footgun if someone sets it to a too low value.


Or are you thinking that we have just one setting: checkpoint_rate? You 
describe it as a maximum, but I've been thinking of it as a minimum 
because you *will* exceed it if the next checkpoint is due soon.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Transaction Guarantee, updated version

2007-06-21 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Completed all of the agreed changes for TG:

I've just realized that there's a fatal problem with this design.
We've now got tqual.c setting page LSN when it holds only share lock
on the buffer.  That will absolutely not work, eg two backends might
concurrently set different values and end up with garbage (since it's
unlikely that LSN store is atomic).

Can we fix it to be a read test instead of a write test, that is, if
we know WAL has been flushed through the target LSN, it's safe to set
the hint bit, else not?

In general, I think a transaction abort should not need to flush
anything, since the default assumption is that it crashed anyway.
Hence for instance recording a transaction abort needn't advance
the LSN of the clog page.  (You seem to have it flushing through
the last xlog record written by the backend, which is exactly what
it doesn't need to do.)  By extension, it should be OK to set INVALID
(aborted) hint bits in a tuple header without any concerns about
flushing.

Also, I'm sort of wondering if we really need a separate walwriter
process; that code seems awfully duplicative.  Is there a reason
not to have the bgwriter include this functionality?


In lesser news:

The caching logic in TransactionGetCommitLSN is obviously broken.

Is there really a use-case for adding a pgstat counter for guaranteed
transactions?  That adds pgstat overhead, and bloats the patch
noticeably, and I don't entirely see the value of it.

There's some padding junk inserted in XLogCtlData, which as far as I
recall was never discussed, and is certainly not an integral part of the
delayed-commit feature.  If you want that you should submit and defend
it separately.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 So the question is, why in the heck would anyone want the behavior that
 checkpoints take exactly X time??

 Because it's easier to tune. You don't need to know how much checkpoint 
 I/O you can tolerate. The system will use just enough I/O bandwidth to 
 meet the deadline, but not more than that.

Uh, not really.  After consuming more caffeine and reading the patch
more carefully, I think there are several problems here:

1. checkpoint_rate is used thusly:

writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

where writes_per_nap is the max number of dirty blocks to write before
taking a bgwriter nap.  Now surely this is completely backward: if
BgWriterDelay is increased, the number of writes to allow per nap
decreases?  If you think checkpoint_rate is expressed in some kind of
physical bytes/sec unit, that cannot be right; the number of blocks
per nap has to increase if the naps get longer.  (BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)

2. checkpoint_smoothing is used thusly:

/* scale progress according to CheckPointSmoothing */
progress *= CheckPointSmoothing;

where the progress value being scaled is the fraction so far completed
of the total number of dirty pages we expect to have to write.  This
is then compared against estimates of the total fraction of the time-
between-checkpoints that has elapsed; if less, we are behind schedule
and should not nap, if more, we are ahead of schedule and may nap.
(This is a bit odd, but I guess it's all right because it's equivalent
to dividing the elapsed-time estimate by CheckPointSmoothing, which
seems a more natural way of thinking about what needs to happen.)

What's bugging me about this is that we are either going to be writing
at checkpoint_rate if ahead of schedule, or max possible rate if behind;
that's not smoothing to me, that's introducing some pretty bursty
behavior.  ISTM that actual smoothing would involve adjusting
writes_per_nap up or down according to whether we are ahead or behind
schedule, so as to have a finer degree of control over the I/O rate.
(I'd also consider saving the last writes_per_nap value across
checkpoints so as to have a more nearly accurate starting value next
time.)

In any case I still concur with Takahiro-san that smoothing doesn't
seem the most appropriate name for the parameter.  Something along the
lines of checkpoint_completion_target would convey more about what it
does, I think.

And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all, because as the code stands, with the default BgWriterDelay you
would have to increase checkpoint_rate to 4x its proposed default before
writes_per_nap moves off its minimum of 1.  This says to me that the
system's tested behavior has been so insensitive to checkpoint_rate
that we probably need not expose such a parameter at all; just hardwire
the minimum writes_per_nap at 1.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] psql: flush output in cursor-fetch mode

2007-06-21 Thread Neil Conway
On Wed, 2007-06-20 at 15:51 -0700, Neil Conway wrote:
 Attached is a patch that fixes this, by calling fflush() on the psql
 output stream after each call to printQuery() in ExecQueryUsingCursor().

Applied to HEAD.

-Neil



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] psql: flush output in cursor-fetch mode

2007-06-21 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Wed, 2007-06-20 at 15:51 -0700, Neil Conway wrote:
 Attached is a patch that fixes this, by calling fflush() on the psql
 output stream after each call to printQuery() in ExecQueryUsingCursor().

 Applied to HEAD.

Seems reasonable to back-patch into 8.2 too...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Cancel autovacuum conflicting with DROP TABLE

2007-06-21 Thread ITAGAKI Takahiro
ITAGAKI Takahiro [EMAIL PROTECTED] wrote:

 Here is a patch that cancels autovacuum workers conflicting with
 DROP TABLE, TRUNCATE and CLUSTER. It was discussed here:
 http://archives.postgresql.org/pgsql-hackers/2007-06/msg00556.php

I made an adjustment for the latest 'more autovacuum fixes' patch.
http://archives.postgresql.org/pgsql-patches/2007-06/msg00217.php

Now, autovacuum workers handles ProcDie signals using ERROR
instead of FATAL. The exception is caught by the worker and
converted to the following logs.

  SIGINT -- Cancel the current job.
  LOG: autovacuum on db.schema.table is canceled
  SIGTERM -- Cancel all jobs.
  LOG: autovacuum on db is canceled

We are planning to ship 8.3 with autovacuum=on, so users will be
more likely to see conflicts between autovacuum and their commands.
This makes autovacuum more gentle.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



cancel_autovac_on_drop-2.patch
Description: Binary data

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] psql: flush output in cursor-fetch mode

2007-06-21 Thread Neil Conway
On Thu, 2007-21-06 at 22:09 -0400, Tom Lane wrote:
 Seems reasonable to back-patch into 8.2 too...

Okay, done.

-Neil



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster