Re: [HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)

2014-01-21 Thread Bruce Momjian
On Fri, Jan 17, 2014 at 04:31:48PM +, Mel Gorman wrote:
 NUMA Optimisations
 The primary one that showed up was zone_reclaim_mode. Enabling that parameter
 is a disaster for many workloads and apparently Postgres is one. It might
 be time to revisit leaving that thing disabled by default and explicitly
 requiring that NUMA-aware workloads that are correctly partitioned enable it.
 Otherwise NUMA considerations are not that much of a concern right now.

Here is a blog post about our zone_reclaim_mode-disable recommendations:

 Direct IO, buffered IO, double buffering and wishlists
6. Only writeback pages if explicitly synced. Postgres has strict write
   ordering requirements. In the words of Tom Lane -- As things currently
   stand, we dirty the page in our internal buffers, and we don't write
   it to the kernel until we've written and fsync'd the WAL data that
   needs to get to disk first. mmap() would avoid double buffering but
   it has no control about the write ordering which is a show-stopper.
   As Andres Freund described;

What was not explicitly stated here is that the Postgres design is
taking advantage of the double-buffering feature here and writing to a
memory copy of the page while there is still an unmodified copy in the
kernel cache, or on disk.  In the case of a crash, we rely on the fact
that the disk page is unchanged.  Certainly any design that requires the
kernel to mange two different copies of the same page is going to be

One larger question is how many of these things that Postgres needs are
needed by other applications?  I doubt Postgres is large enough to
warrant changes on its own.

  Bruce Momjian  br...@momjian.us

  + Everyone has their own god. +

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)

2014-01-17 Thread Andres Freund
Hi Mel,

On 2014-01-17 16:31:48 +, Mel Gorman wrote:
 Direct IO, buffered IO, double buffering and wishlists
3. Hint that a page should be dropped immediately when IO completes.
   There is already something like this buried in the kernel internals
   and sometimes called immediate reclaim which comes into play when
   pages are bgin invalidated. It should just be a case of investigating
   if that is visible to userspace, if not why not and do it in a
   semi-sensible fashion.

bgin invalidated?

Generally, +1 on the capability to achieve such a behaviour from

7. Allow userspace process to insert data into the kernel page cache
   without marking the page dirty. This would allow the application
   to request that the OS use the application copy of data as page
   cache if it does not have a copy already. The difficulty here
   is that the application has no way of knowing if something else
   has altered the underlying file in the meantime via something like
   direct IO. Granted, such activity has probably corrupted the database
   already but initial reactions are that this is not a safe interface
   and there are coherency concerns.

I was one of the people suggesting that capability in this thread (after
pondering about it on the back on my mind for quite some time), and I
first though it would never be acceptable for pretty much those
But on second thought I don't think that line of argument makes too much
sense. If such an API would require write permissions on the file -
which it surely would - it wouldn't allow an application to do anything
it previously wasn't able to.
And I don't see the dangers of concurrent direct IO as anything
new. Right now the page's contents reside in userspace memory and aren't
synced in any way with either the page cache or the actual on disk
state. And afaik there are already several data races if a file is
modified and read both via the page cache and direct io.

The scheme that'd allow us is the following:
When postgres reads a data page, it will continue to first look up the
page in its shared buffers, if it's not there, it will perform a page
cache backed read, but instruct that read to immediately remove from the
page cache afterwards (new API or, posix_fadvise() or whatever). As long
as it's in shared_buffers, postgres will not need to issue new reads, so
there's no no benefit keeping it in the page cache.
If the page is dirtied, it will be written out normally telling the
kernel to forget about the caching the page (using 3) or possibly direct
When a page in postgres's buffers (which wouldn't be set to very large
values) isn't needed anymore and *not* dirty, it will seed the kernel
page cache with the current data.

Now, such a scheme wouldn't likely be zero-copy, but it would avoid
double buffering. I think the cost of buffer copying has been overstated
in this thread... he major advantage is that all that could easily
implemented in a very localized manner, without hurting other OSs and it
could easily degrade on kernels not providing that capability, which
would surely be the majority of installations for the next couple of

So, I think such an interface would be hugely beneficial - and I'd be
surprised if other applications couldn't reuse it. And I don't think
it'd be all that hard to implement on the kernel side?

   Dave Chinner asked why, exactly, do you even need the kernel page
   cache here?  when Postgres already knows how and when data should
   be written back to disk. The answer boiled down to To let kernel do
   the job that it is good at, namely managing the write-back of dirty
   buffers to disk and to manage (possible) read-ahead pages. Postgres
   has some ordering requirements but it does not want to be responsible
   for all cache replacement and IO scheduling. Hannu Krosing summarised
   it best as

The other part is that using the page cache for the majority of warm,
but not burning hot pages, allows the kernel to much more sensibly adapt
to concurrent workloads requiring memory in some form or other (possibly
giving it to other VMs when mostly idle and such).

8. Allow copy-on-write of page-cache pages to anonymous. This would limit
   the double ram usage to some extent. It's not as simple as having a
   MAP_PRIVATE mapping of a file-backed page because presumably they want
   this data in a shared buffer shared between Postgres processes. The
   implementation details of something like this are hairy because it's
   mmap()-like but not mmap() as it does not have the same writeback
   semantics due to the write ordering requirements Postgres has for
   database integrity.

9. Hint that a page in an anonymous buffer is a copy of a page cache
page and invalidate the page cache page on COW. This limits the