Re: [PATCHES] Load distributed checkpoint V3
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT) instead of fsync on my big and my little servers with battery-backed cache, so I know sync writes perform reasonably well on my hardware. Since I've had problems with the fsync at checkpoint time, I did a similar test to yours recently, adding O_SYNC to the open calls and pulling the fsyncs out to get a rough idea how things would work. Performance was reasonable most of the time, but when I hit a checkpoint with a lot of the buffer cache dirty it was incredibly bad. It took minutes to write everything out, compared with a few seconds for the current case, and the background writer was too sluggish as well to help. This appears to match your data. If you compare how Oracle handles their writes and checkpoints to the Postgres code, it's obvious they have a different architecture that enables them to support sync writing usefully. I'd recommend the Database Writer Process section of http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm as an introduction for those not familiar with that; it's interesting reading for anyone tinking with background writer code. It would be great to compare performance of the current PostgreSQL code with a fancy multiple background writer version using the latest sync methods or AIO; there have actually been multiple updates to improve O_SYNC writes within Linux during the 2.6 kernel series that make this more practical than ever on that platform. But as you've already seen, the performance hurdle to overcome is significant, and it would have to be optional as a result. When you add all this up--have to keep the current non-sync writes around as well, need to redesign the whole background writer/checkpoint approach around the idea of sync writes, and the OS-specific parts that would come from things like AIO--it gets real messy. Good luck drumming up support for all that when the initial benchmarks suggest it's going to be a big step back. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
From: Greg Smith [EMAIL PROTECTED] If you compare how Oracle handles their writes and checkpoints to the Postgres code, it's obvious they have a different architecture that enables them to support sync writing usefully. I'd recommend the Database Writer Process section of http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm as an introduction for those not familiar with that; it's interesting reading for anyone tinking with background writer code. Hmm... what makes you think that sync writes is useful for Oracle and not for PostgreSQL? The process architecture is similar; bgwriter performs most of writes in PostgreSQL, while DBWn performs all writes in Oracle. The difference is that Oracle can assure crash recovery time by writing dirby buffers periodically in the order of their LSN. It would be great to compare performance of the current PostgreSQL code with a fancy multiple background writer version using the latest sync methods or AIO; there have actually been multiple updates to improve O_SYNC writes within Linux during the 2.6 kernel series that make this more practical than ever on that platform. But as you've already seen, the performance hurdle to overcome is significant, and it would have to be optional as a result. When you add all this up--have to keep the current non-sync writes around as well, need to redesign the whole background writer/checkpoint approach around the idea of sync writes, and the OS-specific parts that would come from things like AIO--it gets real messy. Good luck drumming up support for all that when the initial benchmarks suggest it's going to be a big step back. I agree with you in that write method has to be optional until there's enough data from the field that help determine which is better. ... It's a pity not to utilize async I/O and Josh-san's offer. I hope it will be used some day. I think OS developers have evolved async I/O for databases. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 2007-04-06 at 02:53 -0400, Greg Smith wrote: If you compare how Oracle handles their writes and checkpoints to the Postgres code, it's obvious they have a different architecture that enables them to support sync writing usefully. I'd recommend the Database Writer Process section of http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm as an introduction for those not familiar with that; it's interesting reading for anyone tinking with background writer code. Oracle does have a different checkpointing technique and we know it is patented, so we need to go carefully there, especially when directly referencing documentation. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: Hmm... what makes you think that sync writes is useful for Oracle and not for PostgreSQL? They do more to push checkpoint-time work in advance, batch writes up more efficiently, and never let clients do the writing. All of which make for a different type of checkpoint. Like Simon points out, even if it were conceivable to mimic their design it might not even be legally feasible. The point I was trying to make is this: you've been saying that Oracle's writing technology has better performance in this area, which is probably true, and suggesting the cause of that was their using O_SYNC writes. I wanted to believe that and even tested out a prototype. The reality here appears to be that their checkpoints go smoother *despite* using the slower sync writes because they're built their design around the limitations of that write method. I suspect it would take a similar scale of redesign to move Postgres in that direction; the issues you identified (the same ones I ran into) are not so easy to resolve. You're certainly not going to move anybody in that direction by throwing a random comment into a discussion on the patches list about a feature useful *right now* in this area. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load distributed checkpoint V3
ITAGAKI Takahiro wrote: Here is the latest version of Load distributed checkpoint patch. Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? The Logging checkpoints and other slowdown causes patch I submitted touches some of the same code as well, that's another possible merge coming depending on what order this all gets committed in. Running into what I dubbed perpetual checkpoints was one of the reasons I started logging timing information for the various portions of the checkpoint, to tell when it was bogged down with slow writes versus being held up in sync for various (possibly fixed with your CheckpointStartLock) issues. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
Greg Smith wrote: On Thu, 5 Apr 2007, Heikki Linnakangas wrote: Bgwriter has two goals: 1. keep enough buffers clean that normal backends never need to do a write 2. smooth checkpoints by writing buffers ahead of time Load distributed checkpoints will do 2. in a much better way than the bgwriter_all_* guc options. I think we should remove that aspect of bgwriter in favor of this patch. ... Let me suggest a different way of looking at this problem. At any moment, some percentage of your buffer pool is dirty. Whether it's 0% or 100% dramatically changes what the background writer should be doing. Whether most of the data is usage_count0 or not also makes a difference. None of the current code has any idea what type of buffer pool they're working with, and therefore they don't have enough information to make a well-informed prediction about what is going to happen in the near future. The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? I'll tell you what I did to the all-scan. I ran a few hundred hours worth of background writer tests to collect data on what it does wrong, then wrote a prototype automatic background writer that resets the all-scan parameters based on what I found. It keeps a running estimate of how dirty the pool at large is using a weighted average of the most recent scan with the past history. From there, I have a simple model that predicts how much of the buffer we can scan in any interval, and intends to enforce a maximum bound on the amount of physical I/O you're willing to stream out. The beta code is sitting at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to see what I've done so far. The parts that are done work fine--as long as you give it a reasonable % to scan by default, it will correct all_max_pages and the interval in real-time to meet the scan rate requested you want given how much is currently dirty; the I/O rate is computed but doesn't limit properly yet. Nice. Enforcing a max bound on the I/O seems reasonable, if we accept that shortening the checkpoint is a goal. Why haven't I brought this all up yet? Two reasons. The first is because it doesn't work on my system; checkpoints and overall throughput get worse when you try to shorten them by running the background writer at optimal aggressiveness. Under really heavy load, the writes slow down as all the disk caches fill, the background writer fights with reads on the data that isn't in the mostly dirty cache (introducing massive seek delays), it stops cleaning effectively, and it's better for it to not even try. My next generation of code was going to start with the LRU flush and then only move onto the all-scan if there's time leftover. The second is that I just started to get useful results here in the last few weeks, and I assumed it's too big of a topic to start suggesting major redesigns to the background writer mechanism at that point (from me at least!). I was waiting for 8.3 to freeze before even trying. If you want to push through a redesign there, maybe you can get away with it at this late moment. But I ask that you please don't remove anything from the current design until you have significant test results to back up that change. Point taken. I need to start testing the LDC patch. Since we're discussing this, let me tell what I've been thinking about the lru cleaning behavior of bgwriter. ISTM that that's more straigthforward to tune automatically. Bgwriter basically needs to ensure that the next X buffers with usage_count=0 in the clock sweep are clean. X is the predicted number of buffers backends will evict until the next bgwriter round. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
Heikki Linnakangas [EMAIL PROTECTED] writes: The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. You'd want some kind of moving-average smoothing in there, probably with a lot shorter ramp-up than ramp-down time constant, but this seems reasonable enough to try. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load distributed checkpoint V3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. You'd want some kind of moving-average smoothing in there, probably with a lot shorter ramp-up than ramp-down time constant, but this seems reasonable enough to try. Ironically, I just noticed that we already have a patch in the patch queue that implements exactly that, again by Itagaki. I need to start paying more attention :-). Keep up the good work! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load distributed checkpoint V3
Hello, long time no see. I'm sorry to interrupt your discussion. I'm afraid the code is getting more complicated to continue to use fsync(). Though I don't intend to say the current approach is wrong, could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? This January, I got a good result with O_SYNC, which I haven't reported here. I'll show it briefly. Please forgive me for my abrupt email, because I don't have enough time. # Personally, I want to work in the community, if I'm allowed. And sorry again. I reported that O_SYNC resulted in very bad performance last year. But that was wrong. The PC server I borrowed was configured that all the disks form one RAID5 device. So, the disks for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5 device, resulting in I/O conflict. What I modified is md.c only. I just added O_SYNC to the open flag in mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want backends to use O_SYNC because mdextend() does not have to transfer data to disk. My evaluation environment was: CPU: Intel Xeon 3.2GHz * 2 (HT on) Memory: 4GB Disk: Ultra320 SCSI (perhaps configured as write back) OS: RHEL3.0 Update 6 Kernel: 2.4.21-37.ELsmp PostgreSQL: 8.2.1 The relevant settings of PostgreSQL are: shared_buffers = 2GB wal_buffers = 1MB wal_sync_method = open_sync checkpoint_* and bgwriter_* parameters are left as their defaults. I used pgbench, with the data of scaling factor 50. [without O_SYNC, original behavior] - pgbench -c1 -t16000 best response: 1ms worst response: 6314ms 10th worst response: 427ms tps: 318 - pgbench -c32 -t500 best response: 1ms worst response: 8690ms 10th worst response: 8668ms tps: 330 [with O_SYNC] - pgbench -c1 -t16000 best response: 1ms worst response: 350ms 10th worst response: 91ms tps: 427 - pgbench -c32 -t500 best response: 1ms worst response: 496ms 10th worst response: 435ms tps: 1117 If the write back cache were disabled, the difference would be smaller. Windows version showed similar improvements. However, this approach has two big problems. (1) Slow down bulk updates Updates of large amount of data get much slower because bgwriter seeks and writes dirty buffers synchronously page-by-page. For example: - COPY of accounts (5m records) and CHECKPOINT command after COPY without O_SYNC: 100sec with O_SYNC: 1046sec - UPDATE of all records of accounts without O_SYNC: 139sec with O_SYNC: 639sec - CHECKPOINT command for flushing 1.6GB of dirty buffers without O_SYNC: 24sec with O_SYNC: 126sec To mitigate this problem, I sorted dirty buffers by their relfilenode and block numbers and wrote multiple pages that are adjacent both on memory and on disk. The result was: - COPY of accounts (5m records) and CHECKPOINT command after COPY 227sec - UPDATE of all records of accounts 569sec - CHECKPOINT command for flushing 1.6GB of dirty buffers 71sec Still bad... (2) Can't utilize tablespaces Though I didn't evaluate, update activity would be much less efficient with O_SYNC than with fsync() when using multiple tablespaces, because there is only one bgwriter. Anyone can solve these problems? One of my ideas is to use scattered I/O. I hear that readv()/writev() became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With kernels before 2.6, readv()/writev() just performed I/Os sequentially. Windows has provided reliable scattered I/O for years. Another idea is to use async I/O, possibly combined with multiple bgwriter approach on platforms where async I/O is not available. How about the chance Josh-san has brought? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load distributed checkpoint V3
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? My testing results suggest that LDC doesn't smooth the checkpoint usefully when under a high (30 client here) load, because (on Linux at least) the way the OS caches writes clashes badly with how buffers end up being evicted if the buffer pool fills back up before the checkpoint is done. In that context, anything that slows down the checkpoint duration is going to make the problem worse rather than better, because it makes it more likely that the tail end of the checkpoint will have to fight with the clients for write bandwidth, at which point they both suffer. If you just get the checkpoint done fast, the clients can't fill the pool as fast as the BufferSync is writing it out, and things are as happy as they can be without a major rewrite to all this code. I can get a tiny improvement in some respects by delaying 2-5 seconds between finishing the writes and calling fsync, because that gives Linux a moment to usefully spool some of the data to the disk controller's cache; beyond that any additional delay is a problem. Since it's only the high load cases I'm having trouble dealing with, this basically makes it a non-starter for me. The focus on checkpoint_timeout and ignoring checkpoint_segments in the patch is also a big issue for me. At the same time, I recognize that the approach taken in LDC probably is a big improvement for many systems, it's just a step backwards for my highest throughput one. I'd really enjoy hearing some results from someone else. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. I see you've already found the other helpful Itagaki patch in this area. I know I would like to see his code for tracking evictions commited, then I'd like that to be added as another counter in pg_stat_bgwriter (I mentioned that to Magnus in passing when he was setting the stats up but didn't press it because of the patch dependency). Ideally, and this idea was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug hook, I think it's important that you know how every buffer written to disk got there--was it a background writer, a checkpoint, or an eviction that wrote it out? Track all those and you can really learn something about your write performance, data that's impossible to collect right now. However, as Itagaki himself points out, doing something useful with bgwriter_lru_maxpages is only one piece of automatically tuning the background writer. I hate to join in on chopping his patches up, but without some additional work I don't think the exact auto-tuning logic he then applies will work in all cases, which could make it more a problem than the current crude yet predictable method. The whole way bgwriter_lru_maxpages and num_to_clean play off each other in his code currently has a number of failure modes I'm concerned about. I'm not sure if a re-write using a moving-average approach (as I did in my auto-tuning writer prototype and as Tom just suggested here) will be sufficient to fix all of them. Was already on my to-do list to investigate that further. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load distributed checkpoint V3
On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote: Here is the latest version of Load distributed checkpoint patch. Couple of questions for you: -Is it still possible to get the original behavior by adjusting your tunables? It would be nice to do a before/after without having to recompile, and I know I'd be concerned about something so different becoming the new default behavior. -Can you suggest a current test case to demonstrate the performance improvement here? I've tried several variations on stretching out checkpoints like you're doing here and they all made slow checkpoint issues even worse on my Linux system. I'm trying to evaluate this fairly. -This code operates on the assumption you have a good value for the checkpoint timeout. Have you tested its behavior when checkpoints are being triggered by checkpoint_segments being reached instead? -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load distributed checkpoint V3
Greg Smith [EMAIL PROTECTED] wrote: Here is the latest version of Load distributed checkpoint patch. Couple of questions for you: -Is it still possible to get the original behavior by adjusting your tunables? It would be nice to do a before/after without having to recompile, and I know I'd be concerned about something so different becoming the new default behavior. Yes, if you want the original behavior, please set all of checkpoint_[write|nap|sync]_percent to zero. They can be changed at SIGHUP timing (pg_ctl reload). The new default configurations are write/nap/sync = 50%/10%/20%. There might be room for discussion in choice of the values. -Can you suggest a current test case to demonstrate the performance improvement here? I've tried several variations on stretching out checkpoints like you're doing here and they all made slow checkpoint issues even worse on my Linux system. I'm trying to evaluate this fairly. You might need to increase checkpoint_segments and checkpoint_timeout. Here is the results on my machine: http://archives.postgresql.org/pgsql-hackers/2007-02/msg01613.php I've set the values to 32 segs and 15 min to take advantage of it in the case of pgbench -s100 then. -This code operates on the assumption you have a good value for the checkpoint timeout. Have you tested its behavior when checkpoints are being triggered by checkpoint_segments being reached instead? This patch does not work fully when checkpoints are triggered by segments. Write phases still work because they refer to consumption of segments, but nap and fsync phases only check amount of time. I'm assuming checkpoints are triggered by timeout in normal use -- and it's my recommended configuration whether the patch is installed or not. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote: I'm assuming checkpoints are triggered by timeout in normal use -- and it's my recommended configuration whether the patch is installed or not. I'm curious what other people running fairly serious hardware do in this area for write-heavy loads, whether it's timeout or segment limits that normally trigger their checkpoints. I'm testing on a slightly different class of machine than your sample results, something that is in the 1500 TPS range running the pgbench test you describe. Running that test, I always hit the checkpoint_segments wall well before any reasonable timeout. With 64 segments, I get a checkpoint every two minutes or so. There's something I'm working on this week that may help out other people trying to test your patch out. I've put together some simple scripts that graph (patched) pgbench results, which make it very easy to see what changes when you alter the checkpoint behavior. Edges are still rough but the scripts work for me, will be polishing and testing over the next few days: http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm (Note that the example graphs there aren't from the production system I mentioned above, they're from my server at home, which is similar to the system your results came from). -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load distributed checkpoint V3
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- ITAGAKI Takahiro wrote: Folks, Here is the latest version of Load distributed checkpoint patch. I've fixed some bugs, including in cases of missing file errors and overlapping of asynchronous checkpoint requests. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center [ Attachment, skipping... ] ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org