Re: [PATCHES] Load Distributed Checkpoints, take 3
ITAGAKI Takahiro wrote: The only thing I don't understand is the naming of 'checkpoint_smoothing'. Can users imagine the unit of 'smoothing' is a fraction? You explain the paremeter with the word 'fraction'. Why don't you simply name it 'checkpoint_fraction' ? | Specifies the target length of checkpoints, as a fraction of | the checkpoint interval. The default is 0.3. I chose checkpoint_smoothing because it tells you what the parameter is for. If you want more smoothing, tune it up, and if you want less, tune it down. checkpoint_fraction makes you wonder what you can do with it and why you would change it. Sorry if I'm missing discussions abount the naming. No, I chose _smoothing on my own. People didn't like checkpoint_write_percent either (including). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. That may be, but you could minimize the cost and notational ugliness by not using the spinlock where you don't have to. Put the sig_atomic_t fields back the way they were, and many of the uses of the spinlock go away. All you really need it for is the places where the additional flags are set or read. I find it easier to understand if it's used whenever any of the fields are accessed. You don't need to read and write ckpt_failed and ckpt_started/ckpt_done in specific order anymore, for example. Some other comments: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. How would a min and max rate work? Anyone else have an opinion on the parameters? The flow of control and responsibility between xlog.c, bgwriter.c and bufmgr.c seems to have reached the breaking point of unintelligibility. Can you think of a refactoring that would simplify it? We might have to resign ourselves to this much messiness, but I'd rather not... The problem we're trying to solve is doing a checkpoint while running the normal bgwriter activity at the same time. The normal way to do two things simultaneously is to have two different processes (or threads). I thought about having a separate checkpoint process, but I gave up on that thought because the communication needed between backends, bgwriter and the checkpointer seems like a mess. The checkpointer would need the pendingOpsTable so that it can do the fsyncs, and it would also need to receive the forget-messages to that table. We could move that table entirely to the checkpointer, but since bgwriter is presumably doing most of the writes, there would be a lot of chatter between bgwriter and the checkpointer. The current approach is like co-operative multitasking. BufferSyncs yields control to bgwriter every now and then. The division of labor between xlog.c and other modules is not that bad, IMO. CreateCheckPoint is the main entry point to create a checkpoint, and it calls other modules to do their stuff, including bufmgr.c. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: In fact, I think there's a small race condition in CVS HEAD: Yeah, probably --- the original no-locking design didn't have any side flags. The reason you need the lock is for a backend to be sure that a newly-started checkpoint is using its requested flags. But the detection of checkpoint termination is still the same. Actually, the race condition I outlined isn't related to the flags. It's possible because RequestCheckpoint doesn't guarantee that a checkpoint is performed when there's been no WAL activity since last one. I did use a new force-flag to fix it, but I'm sure there is other ways. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. I don't think you understand how the settings work. Did you read the documentation? If you did, it's apparently not adequate. The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time. So the length of a checkpoint stays the same regardless of how many dirty buffers there is (assuming you don't exceed the bandwidth of your hardware), but the I/O rate varies. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. I'm envisioning we set the defaults so that checkpoint_smoothing is the effective control in a relatively busy system, and checkpoint_rate ensures that we don't unnecessarily prolong checkpoints on an idle system. Now how would you replace checkpoint_smoothing with a max I/O rate? The only real alternative way of defining it is directly as a time and/or segments, similar to checkpoint_timeout and checkpoint_segments, but then we'd really need two knobs instead of one. Though maybe we could just hard-code it to 0.8, for example, and tell people to tune checkpoint_rate if they want more aggressive checkpoints. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: I don't think you understand how the settings work. Did you read the documentation? If you did, it's apparently not adequate. I did read the documentation, and I'm not complaining that I don't understand it. I'm complaining that I don't like the presented API because it's self-inconsistent. You've got two parameters that are in effect upper and lower bounds for the checkpoint write rate, but they are named inconsistently and not even measured in the same kind of unit. Nor do I agree that the inconsistency buys any ease of use. The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time. So the question is, why in the heck would anyone want the behavior that checkpoints take exactly X time?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second. Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time So the question is, why in the heck would anyone want the behavior that checkpoints take exactly X time?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second. Because it's easier to tune. You don't need to know how much checkpoint I/O you can tolerate. The system will use just enough I/O bandwidth to meet the deadline, but not more than that. Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. Well that's one option. It feels like a good thing to be able to control how much headroom you have until the next checkpoint, but maybe we can just hardcode it close to 1. It's also good to avoid spreading the checkpoints unnecessarily, to keep recovery times lower, but you can control that with the min rate setting as well. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. Let me put it this way: If you define a min and a max I/O rate, when would the max I/O rate limit take effect? If there's few dirty buffers in the pool, so that you'll finish the checkpoint in time before the next one is due writing at the min rate, that's what you'll use. If there's more, you'll need to write fast enough that you'll finish the checkpoint in time, regardless of the max rate. Or would you let the next checkpoint slip and keep writing at the max rate? That seems like a footgun if someone sets it to a too low value. Or are you thinking that we have just one setting: checkpoint_rate? You describe it as a maximum, but I've been thinking of it as a minimum because you *will* exceed it if the next checkpoint is due soon. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Transaction Guarantee, updated version
Simon Riggs [EMAIL PROTECTED] writes: Completed all of the agreed changes for TG: I've just realized that there's a fatal problem with this design. We've now got tqual.c setting page LSN when it holds only share lock on the buffer. That will absolutely not work, eg two backends might concurrently set different values and end up with garbage (since it's unlikely that LSN store is atomic). Can we fix it to be a read test instead of a write test, that is, if we know WAL has been flushed through the target LSN, it's safe to set the hint bit, else not? In general, I think a transaction abort should not need to flush anything, since the default assumption is that it crashed anyway. Hence for instance recording a transaction abort needn't advance the LSN of the clog page. (You seem to have it flushing through the last xlog record written by the backend, which is exactly what it doesn't need to do.) By extension, it should be OK to set INVALID (aborted) hint bits in a tuple header without any concerns about flushing. Also, I'm sort of wondering if we really need a separate walwriter process; that code seems awfully duplicative. Is there a reason not to have the bgwriter include this functionality? In lesser news: The caching logic in TransactionGetCommitLSN is obviously broken. Is there really a use-case for adding a pgstat counter for guaranteed transactions? That adds pgstat overhead, and bloats the patch noticeably, and I don't entirely see the value of it. There's some padding junk inserted in XLogCtlData, which as far as I recall was never discussed, and is certainly not an integral part of the delayed-commit feature. If you want that you should submit and defend it separately. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: So the question is, why in the heck would anyone want the behavior that checkpoints take exactly X time?? Because it's easier to tune. You don't need to know how much checkpoint I/O you can tolerate. The system will use just enough I/O bandwidth to meet the deadline, but not more than that. Uh, not really. After consuming more caffeine and reading the patch more carefully, I think there are several problems here: 1. checkpoint_rate is used thusly: writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); where writes_per_nap is the max number of dirty blocks to write before taking a bgwriter nap. Now surely this is completely backward: if BgWriterDelay is increased, the number of writes to allow per nap decreases? If you think checkpoint_rate is expressed in some kind of physical bytes/sec unit, that cannot be right; the number of blocks per nap has to increase if the naps get longer. (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) 2. checkpoint_smoothing is used thusly: /* scale progress according to CheckPointSmoothing */ progress *= CheckPointSmoothing; where the progress value being scaled is the fraction so far completed of the total number of dirty pages we expect to have to write. This is then compared against estimates of the total fraction of the time- between-checkpoints that has elapsed; if less, we are behind schedule and should not nap, if more, we are ahead of schedule and may nap. (This is a bit odd, but I guess it's all right because it's equivalent to dividing the elapsed-time estimate by CheckPointSmoothing, which seems a more natural way of thinking about what needs to happen.) What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not smoothing to me, that's introducing some pretty bursty behavior. ISTM that actual smoothing would involve adjusting writes_per_nap up or down according to whether we are ahead or behind schedule, so as to have a finer degree of control over the I/O rate. (I'd also consider saving the last writes_per_nap value across checkpoints so as to have a more nearly accurate starting value next time.) In any case I still concur with Takahiro-san that smoothing doesn't seem the most appropriate name for the parameter. Something along the lines of checkpoint_completion_target would convey more about what it does, I think. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] psql: flush output in cursor-fetch mode
On Wed, 2007-06-20 at 15:51 -0700, Neil Conway wrote: Attached is a patch that fixes this, by calling fflush() on the psql output stream after each call to printQuery() in ExecQueryUsingCursor(). Applied to HEAD. -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] psql: flush output in cursor-fetch mode
Neil Conway [EMAIL PROTECTED] writes: On Wed, 2007-06-20 at 15:51 -0700, Neil Conway wrote: Attached is a patch that fixes this, by calling fflush() on the psql output stream after each call to printQuery() in ExecQueryUsingCursor(). Applied to HEAD. Seems reasonable to back-patch into 8.2 too... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Cancel autovacuum conflicting with DROP TABLE
ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Here is a patch that cancels autovacuum workers conflicting with DROP TABLE, TRUNCATE and CLUSTER. It was discussed here: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00556.php I made an adjustment for the latest 'more autovacuum fixes' patch. http://archives.postgresql.org/pgsql-patches/2007-06/msg00217.php Now, autovacuum workers handles ProcDie signals using ERROR instead of FATAL. The exception is caught by the worker and converted to the following logs. SIGINT -- Cancel the current job. LOG: autovacuum on db.schema.table is canceled SIGTERM -- Cancel all jobs. LOG: autovacuum on db is canceled We are planning to ship 8.3 with autovacuum=on, so users will be more likely to see conflicts between autovacuum and their commands. This makes autovacuum more gentle. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center cancel_autovac_on_drop-2.patch Description: Binary data ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] psql: flush output in cursor-fetch mode
On Thu, 2007-21-06 at 22:09 -0400, Tom Lane wrote: Seems reasonable to back-patch into 8.2 too... Okay, done. -Neil ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster