I spent a significant chunk of my time last week, and also a whole lot of machine time, trying to evaluate the effectiveness of flushing CLOG pages to disk in the background. Simon made the last effort in this area:
http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php ...but we weren't able to demonstrate that it improved performance. However, commit 3ae5133b1cf478d516666f2003bc68ba0edb84c7 improved the SLRU eviction logic in a way that seems to give background writing a better chance of actually helping, so I thought it worthwhile to revisit the topic. That commit conflicted heavily with Simon's patch, so I drummed up a couple of patches of my own. They are a bit different in detail from what Simon did: his patch cleaned the next buffer to be evicted, while my patch will clean any dirty buffer provided that it's "old enough". In the attached background-write-clog.patch, "old enough" means "before RecentXmin" - that is, we clean pages as soon as we know they won't be written again. Unfortunately the approach taken there can't work during recovery (I think), so I tried another approach in the attached background-write-clog-2p.patch, which cleans pages are more than two pages before the page where nextXid lives. I then benchmarked these using pgbench at scale factor 300. I thought this approach would be better than cleaning only the oldest buffer because it's fairly easy to thrash the cache, so even a recently-used buffer may get evicted within a very short period of time. But it turns out that it's not: even if you aggressively clean the CLOG pool on every background-writer tick, the backends still end up doing all the dirty-page eviction. There's just so much cache pressure that things end up getting booted out of the cache more or less randomly, and a background writing process that comes along every 200ms is far too slow. I didn't try cranking down the bgwriter delay, but I doubt it would help much. Now, potentially, the fix here is to tweak the buffer replacement algorithm so that it prefers to a newer clean buffer over an older dirty buffer. But there's danger lurking in the weeds there, because now you really need background writing for *all* of the SLRUs, not just CLOG. Otherwise, you can get really pathological situations where, say, all the pages but one are dirty, and you sit there and replace the last remaining non-dirty buffer over and over again. Or, alternatively, all the buffers become dirty, and now suddenly every backend in the system starts a buffer I/O and you get a system-wide stall of exactly the type we're trying to avoid by doing background writing in the first place. And it's not enough to just have *some* kind of background writing for every SLRU - it's actually got to be aggressive enough to keep up, which is probably not too hard for CLOG but may be trickier for some of the others: you only need to clean a CLOG buffer every couple of seconds at current peak transaction rates, but you need to clean pg_subtrans buffers 16 times as fast, which is starting to push the limits of what we can expect the bgwriter to keep up with as a side task. Also, if you clean *too* aggressively, you'll end up increasing the total write volume, which isn't good either. We could add another background task just to do background cleaning of SLRU buffers of all sorts, but I think it might be time to consider whether there are other reasonable approaches to the problem. I have a couple of thoughts in mind. 1. Instrumentation reveals that ExtendCLOG() causes much longer stalls than ExtendSUBTRANS(), and it appears that those stalls happen mostly as a result of ExtendCLOG() needing to evict a dirty buffer. But ExtendSUBTRANS() also evicts dirty buffers, yet the stalls that it causes are much less severe. Of course, this is because clog is fsync'd and subtrans is not. I previously suggested the idea of passing off fsync requests for SLRU buffers just as we do for shared_buffers, and I think that's one angle that we should investigate here. Backends might still end up writing dirty pages, but not having to fsync them would ease the pain quite a bit. And even if we figure out a way to make background writing safe and useful, off-loading the fsyncs is still a good back-stop against the possibility that a backend might somehow end up writing a dirty page anyway. So I'm going to see if I can work something up for this. 2. ExtendSUBTRANS() seems ripe for optimization. Many pg_subtrans pages will never contain anything but zeros. So ExtendSUBTRANS() is mostly guarding against XID wrapround: we need to make sure we clobber any pg_subtrans data left over from previous use of the XID space - but if we kept track of the age of the oldest pg_subtrans page, we could know that there's no problem there, as will normally be the case. I think we could then arrange to create pg_subtrans pages lazily, rather than repeatedly writing out a dirty page of all zeros to make room for a new dirty page of all zeros. I'm not quite sure of all the details here yet but will look further. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
background-write-clog-2p.patch
Description: Binary data
background-write-clog.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers