[HACKERS] CLOG background writing

Robert Haas Mon, 07 May 2012 19:17:23 -0700

I spent a significant chunk of my time last week, and also a whole lot
of machine time, trying to evaluate the effectiveness of flushing CLOG
pages to disk in the background.  Simon made the last effort in this
area:


http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php

...but we weren't able to demonstrate that it improved performance.
However, commit 3ae5133b1cf478d516666f2003bc68ba0edb84c7 improved the
SLRU eviction logic in a way that seems to give background writing a
better chance of actually helping, so I thought it worthwhile to
revisit the topic.  That commit conflicted heavily with Simon's patch,
so I drummed up a couple of patches of my own.  They are a bit
different in detail from what Simon did: his patch cleaned the next
buffer to be evicted, while my patch will clean any dirty buffer
provided that it's "old enough".   In the attached
background-write-clog.patch, "old enough" means "before RecentXmin" -
that is, we clean pages as soon as we know they won't be written
again.  Unfortunately the approach taken there can't work during
recovery (I think), so I tried another approach in the attached
background-write-clog-2p.patch, which cleans pages are more than two
pages before the page where nextXid lives.  I then benchmarked these
using pgbench at scale factor 300.

I thought this approach would be better than cleaning only the oldest
buffer because it's fairly easy to thrash the cache, so even a
recently-used buffer may get evicted within a very short period of
time.  But it turns out that it's not: even if you aggressively clean
the CLOG pool on every background-writer tick, the backends still end
up doing all the dirty-page eviction.  There's just so much cache
pressure that things end up getting booted out of the cache more or
less randomly, and a background writing process that comes along every
200ms is far too slow.  I didn't try cranking down the bgwriter delay,
but I doubt it would help much.

Now, potentially, the fix here is to tweak the buffer replacement
algorithm so that it prefers to a newer clean buffer over an older
dirty buffer.  But there's danger lurking in the weeds there, because
now you really need background writing for *all* of the SLRUs, not
just CLOG.  Otherwise, you can get really pathological situations
where, say, all the pages but one are dirty, and you sit there and
replace the last remaining non-dirty buffer over and over again.  Or,
alternatively, all the buffers become dirty, and now suddenly every
backend in the system starts a buffer I/O and you get a system-wide
stall of exactly the type we're trying to avoid by doing background
writing in the first place.  And it's not enough to just have *some*
kind of background writing for every SLRU - it's actually got to be
aggressive enough to keep up, which is probably not too hard for CLOG
but may be trickier for some of the others: you only need to clean a
CLOG buffer every couple of seconds at current peak transaction rates,
but you need to clean pg_subtrans buffers 16 times as fast, which is
starting to push the limits of what we can expect the bgwriter to keep
up with as a side task.  Also, if you clean *too* aggressively, you'll
end up increasing the total write volume, which isn't good either.

We could add another background task just to do background cleaning of
SLRU buffers of all sorts, but I think it might be time to consider
whether there are other reasonable approaches to the problem.  I have
a couple of thoughts in mind.

1. Instrumentation reveals that ExtendCLOG() causes much longer stalls
than ExtendSUBTRANS(), and it appears that those stalls happen mostly
as a result of ExtendCLOG() needing to evict a dirty buffer.  But
ExtendSUBTRANS() also evicts dirty buffers, yet the stalls that it
causes are much less severe.  Of course, this is because clog is
fsync'd and subtrans is not.  I previously suggested the idea of
passing off fsync requests for SLRU buffers just as we do for
shared_buffers, and I think that's one angle that we should
investigate here.  Backends might still end up writing dirty pages,
but not having to fsync them would ease the pain quite a bit.  And
even if we figure out a way to make background writing safe and
useful, off-loading the fsyncs is still a good back-stop against the
possibility that a backend might somehow end up writing a dirty page
anyway.  So I'm going to see if I can work something up for this.

2. ExtendSUBTRANS() seems ripe for optimization.  Many pg_subtrans
pages will never contain anything but zeros.  So ExtendSUBTRANS() is
mostly guarding against XID wrapround: we need to make sure we clobber
any pg_subtrans data left over from previous use of the XID space -
but if we kept track of the age of the oldest pg_subtrans page, we
could know that there's no problem there, as will normally be the
case.  I think we could then arrange to create pg_subtrans pages
lazily, rather than repeatedly writing out a dirty page of all zeros
to make room for a new dirty page of all zeros.  I'm not quite sure of
all the details here yet but will look further.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

background-write-clog-2p.patch
Description: Binary data

background-write-clog.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CLOG background writing

Reply via email to