Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-24 Thread Andreas Pflug
Am 23.01.14 02:14, schrieb Jim Nasby:
 On 1/19/14, 5:51 PM, Dave Chinner wrote:
 Postgres is far from being the only application that wants this; many
 people resort to tmpfs because of this:
 https://lwn.net/Articles/499410/
 Yes, we covered the possibility of using tmpfs much earlier in the
 thread, and came to the conclusion that temp files can be larger
 than memory so tmpfs isn't the solution here.:)

 Although... instead of inventing new APIs and foisting this work onto
 applications, perhaps it would be better to modify tmpfs such that it
 can handle a temp space that's larger than memory... possibly backing
 it with X amount of real disk and allowing it/the kernel to decide
 when to passively move files out of the in-memory tmpfs and onto disk.

This is exactly what I'd expect from a file system that's suitable for
tmp purposes. The current tmpfs better should have been named memfs or
so, since it lacks the dedicated disk backing storage.

Regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-23 Thread Gregory Smith

On 1/20/14 9:46 AM, Mel Gorman wrote:
They could potentially be used to evalate any IO scheduler changes. 
For example -- deadline scheduler with these parameters has X 
transactions/sec throughput with average latency of Y millieseconds 
and a maximum fsync latency of Z seconds. Evaluate how well the 
out-of-box behaviour compares against it with and without some set of 
patches. At the very least it would be useful for tracking historical 
kernel performance over time and bisecting any regressions that got 
introduced. Once we have a test I think many kernel developers (me at 
least) can run automated bisections once a test case exists. 


That's the long term goal.  What we used to get out of pgbench were 
things like 60 second latencies when a checkpoint hit with GBs of dirty 
memory.  That does happen in the real world, but that's not a realistic 
case you can tune for very well.  In fact, tuning for it can easily 
degrade performance on more realistic workloads.


The main complexity I don't have a clear view of yet is how much 
unavoidable storage level latency there is in all of the common 
deployment types.  For example, I can take a server with a 256MB 
battery-backed write cache and set dirty_background_bytes to be smaller 
than that.  So checkpoint spikes go away, right?  No. Eventually you 
will see dirty_background_bytes of data going into an already full 256MB 
cache.  And when that happens, the latency will be based on how long it 
takes to write the cached 256MB out to the disks.  If you have a single 
disk or RAID-1 pair, that random I/O could easily happen at 5MB/s or 
less, and that makes for a 51 second cache clearing time.  This is a lot 
better now than it used to be because fsync hasn't flushed the whole 
cache in many years now. (Only RHEL5 systems still in the field suffer 
much from that era of code)  But you do need to look at the distribution 
of latency a bit because of how the cache impact things, you can't just 
consider min/max values.


Take the BBWC out of the equation, and you'll see latency proportional 
to how long it takes to clear the disk's cache out. It's fun upgrading 
from a disk with 32MB of cache to 64MB only to watch worst case latency 
double.  At least the kernel does the right thing now, using that cache 
when it can while forcing data out when fsync calls arrive.  (That's 
another important kernel optimization we'll never be able to teach the 
database)


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Robert Haas
On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara j...@suse.cz wrote:
 But that still doesn't work out very well, because now the guy who
 does the write() has to wait for it to finish before he can do
 anything else.  That's not always what we want, because WAL gets
 written out from our internal buffers for multiple different reasons.
   Well, you can always use AIO (io_submit) to submit direct IO without
 waiting for it to finish. But then you might need to track the outstanding
 IO so that you can watch with io_getevents() when it is finished.

Yeah.  That wouldn't work well for us; the process that did the
io_submit() would want to move on to other things, and how would it,
or any other process, know that the I/O had completed?

   As I wrote in some other email in this thread, using IO priorities for
 data file checkpoint might be actually the right answer. They will work for
 IO submitted by fsync(). The downside is that currently IO priorities / IO
 scheduling classes work only with CFQ IO scheduler.

IMHO, the problem is simpler than that: no single process should be
allowed to completely screw over every other process on the system.
When the checkpointer process starts calling fsync(), the system
begins writing out the data that needs to be fsync()'d so aggressively
that service times for I/O requests from other process go through the
roof.  It's difficult for me to imagine that any application on any
I/O scheduler is ever happy with that behavior.  We shouldn't need to
sprinkle of fsync() calls with special magic juju sauce that says
hey, when you do this, could you try to avoid causing the rest of the
system to COMPLETELY GRIND TO A HALT?.  That should be the *default*
behavior, if not the *only* behavior.

Now, that is not to say that we're unwilling to sprinkle magic juju
sauce if that's what it takes to solve this problem.  If calling
fadvise() or sync_file_range() or some new API that you invent at some
point prior to calling fsync() helps the kernel do the right thing,
we're willing to do that.  Or if you/the Linux community wants to
invent a new API fsync_but_do_not_crush_system() and have us call that
instead of the regular fsync(), we're willing to do that, too.  But I
think there's an excellent case to be made, at least as far as
checkpoint I/O spikes are concerned, that the API is just fine as it
is and Linux's implementation is simply naive.  We'd be perfectly
happy to wait longer for fsync() to complete in exchange for not
starving the rest of the system - and really, who wouldn't?  Linux is
a multi-user system, and apportioning resources among multiple tasks
is a basic function of a multi-user kernel.

/rant

Anyway, if CFQ or any other Linux I/O scheduler gets an option to
lower the priority of the fsyncs, I'm sure somebody here will test it
out and see whether it solves this problem.  AFAICT, experiments to
date have pretty much universally shown CFQ to be worse than not-CFQ
and everything else to be more or less equivalent - but if that
changes, I'm sure many PostgreSQL DBAs will be more than happy to flip
CFQ back on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Dave Chinner
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
 On Fri 17-01-14 08:57:25, Robert Haas wrote:
  On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
   So this says to me that the WAL is a place where DIO should really be
   reconsidered. It's mostly sequential writes that need to hit the disk
   ASAP, and you need to know that they have hit the disk before you can
   proceed with other operations.
  
  Ironically enough, we actually *have* an option to use O_DIRECT here.
  But it doesn't work well.  See below.
  
   Also, is the WAL actually ever read under normal (non-recovery)
   conditions or is it write-only under normal operation? If it's seldom
   read, then using DIO for them also avoids some double buffering since
   they wouldn't go through pagecache.
  
  This is the first problem: if replication is in use, then the WAL gets
  read shortly after it gets written.  Using O_DIRECT bypasses the
  kernel cache for the writes, but then the reads stink.
   OK, yes, this is hard to fix with direct IO.

Actually, it's not. Block level caching is the time-honoured answer
to this problem, and it's been used very successfully on a large
scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS
and flashcache sitting on an SSD in front of rotating storage.
There's multiple choices for this now - bcache, dm-cache,
flahscache, etc, and they all solve this same problem. And in many
cases do it better than using the page cache because you can
independently scale the size of the block level cache...

And given the size of SSDs these days, being able to put half a TB
of flash cache in front of spinning disks is a pretty inexpensive
way of solving such IO problems

  If we're forcing the WAL out to disk because of transaction commit or
  because we need to write the buffer protected by a certain WAL record
  only after the WAL hits the platter, then it's fine.  But sometimes
  we're writing WAL just because we've run out of internal buffer space,
  and we don't want to block waiting for the write to complete.  Opening
  the file with O_SYNC deprives us of the ability to control the timing
  of the sync relative to the timing of the write.
   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
 transaction commit whenever there's any metadata changed on the filesystem.
 Since mtime  ctime of files will be changed often, the will be a case very
 often.

Therefore: O_DATASYNC.

  Maybe it'll be useful to have hints that say always write this file
  to disk as quick as you can and always postpone writing this file to
  disk for as long as you can for WAL and temp files respectively.  But
  the rule for the data files, which are the really important case, is
  not so simple.  fsync() is actually a fine API except that it tends to
  destroy system throughput.  Maybe what we need is just for fsync() to
  be less aggressive, or a less aggressive version of it.  We wouldn't
  mind waiting an almost arbitrarily long time for fsync to complete if
  other processes could still get their I/O requests serviced in a
  reasonable amount of time in the meanwhile.
   As I wrote in some other email in this thread, using IO priorities for
 data file checkpoint might be actually the right answer. They will work for
 IO submitted by fsync(). The downside is that currently IO priorities / IO
 scheduling classes work only with CFQ IO scheduler.

And I don't see it being implemented anywhere else because it's the
priority aware scheduling infrastructure in CFQ that causes all the
problems with IO concurrency and scalability...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara
On Wed 22-01-14 09:07:19, Dave Chinner wrote:
 On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
   If we're forcing the WAL out to disk because of transaction commit or
   because we need to write the buffer protected by a certain WAL record
   only after the WAL hits the platter, then it's fine.  But sometimes
   we're writing WAL just because we've run out of internal buffer space,
   and we don't want to block waiting for the write to complete.  Opening
   the file with O_SYNC deprives us of the ability to control the timing
   of the sync relative to the timing of the write.
O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
  transaction commit whenever there's any metadata changed on the filesystem.
  Since mtime  ctime of files will be changed often, the will be a case very
  often.
 
 Therefore: O_DATASYNC.
  O_DSYNC to be exact.

   Maybe it'll be useful to have hints that say always write this file
   to disk as quick as you can and always postpone writing this file to
   disk for as long as you can for WAL and temp files respectively.  But
   the rule for the data files, which are the really important case, is
   not so simple.  fsync() is actually a fine API except that it tends to
   destroy system throughput.  Maybe what we need is just for fsync() to
   be less aggressive, or a less aggressive version of it.  We wouldn't
   mind waiting an almost arbitrarily long time for fsync to complete if
   other processes could still get their I/O requests serviced in a
   reasonable amount of time in the meanwhile.
As I wrote in some other email in this thread, using IO priorities for
  data file checkpoint might be actually the right answer. They will work for
  IO submitted by fsync(). The downside is that currently IO priorities / IO
  scheduling classes work only with CFQ IO scheduler.
 
 And I don't see it being implemented anywhere else because it's the
 priority aware scheduling infrastructure in CFQ that causes all the
 problems with IO concurrency and scalability...
  So CFQ has all sorts of problems but I never had the impression that
priority aware scheduling is the culprit. It is all just complex - sync
idling, seeky writer detection, cooperating threads detection, sometimes
even sync vs async distinction isn't exactly what one would want. And I'm
not speaking about the cgroup stuff... So it doesn't seem to me that some
other IO scheduler couldn't reasonably efficiently implement stuff like IO
scheduling classes.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara
On Fri 17-01-14 08:57:25, Robert Haas wrote:
 On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
  So this says to me that the WAL is a place where DIO should really be
  reconsidered. It's mostly sequential writes that need to hit the disk
  ASAP, and you need to know that they have hit the disk before you can
  proceed with other operations.
 
 Ironically enough, we actually *have* an option to use O_DIRECT here.
 But it doesn't work well.  See below.
 
  Also, is the WAL actually ever read under normal (non-recovery)
  conditions or is it write-only under normal operation? If it's seldom
  read, then using DIO for them also avoids some double buffering since
  they wouldn't go through pagecache.
 
 This is the first problem: if replication is in use, then the WAL gets
 read shortly after it gets written.  Using O_DIRECT bypasses the
 kernel cache for the writes, but then the reads stink.
  OK, yes, this is hard to fix with direct IO.

 However, if you configure wal_sync_method=open_sync and disable
 replication, then you will in fact get O_DIRECT|O_SYNC behavior.
 
 But that still doesn't work out very well, because now the guy who
 does the write() has to wait for it to finish before he can do
 anything else.  That's not always what we want, because WAL gets
 written out from our internal buffers for multiple different reasons.
  Well, you can always use AIO (io_submit) to submit direct IO without
waiting for it to finish. But then you might need to track the outstanding
IO so that you can watch with io_getevents() when it is finished.

 If we're forcing the WAL out to disk because of transaction commit or
 because we need to write the buffer protected by a certain WAL record
 only after the WAL hits the platter, then it's fine.  But sometimes
 we're writing WAL just because we've run out of internal buffer space,
 and we don't want to block waiting for the write to complete.  Opening
 the file with O_SYNC deprives us of the ability to control the timing
 of the sync relative to the timing of the write.
  O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
transaction commit whenever there's any metadata changed on the filesystem.
Since mtime  ctime of files will be changed often, the will be a case very
often.

  Again, I think this discussion would really benefit from an outline of
  the different files used by pgsql, and what sort of data access
  patterns you expect with them.
 
 I think I more or less did that in my previous email, but here it is
 again in briefer form:
 
 - WAL files are written (and sometimes read) sequentially and fsync'd
 very frequently and it's always good to write the data out to disk as
 soon as possible
 - Temp files are written and read sequentially and never fsync'd.
 They should only be written to disk when memory pressure demands it
 (but are a good candidate when that situation comes up)
 - Data files are read and written randomly.  They are fsync'd at
 checkpoint time; between checkpoints, it's best not to write them
 sooner than necessary, but when the checkpoint arrives, they all need
 to get out to the disk without bringing the system to a standstill
 
 We have other kinds of files, but off-hand I'm not thinking of any
 that are really very interesting, apart from those.
 
 Maybe it'll be useful to have hints that say always write this file
 to disk as quick as you can and always postpone writing this file to
 disk for as long as you can for WAL and temp files respectively.  But
 the rule for the data files, which are the really important case, is
 not so simple.  fsync() is actually a fine API except that it tends to
 destroy system throughput.  Maybe what we need is just for fsync() to
 be less aggressive, or a less aggressive version of it.  We wouldn't
 mind waiting an almost arbitrarily long time for fsync to complete if
 other processes could still get their I/O requests serviced in a
 reasonable amount of time in the meanwhile.
  As I wrote in some other email in this thread, using IO priorities for
data file checkpoint might be actually the right answer. They will work for
IO submitted by fsync(). The downside is that currently IO priorities / IO
scheduling classes work only with CFQ IO scheduler.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Bruce Momjian
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
  If we're forcing the WAL out to disk because of transaction commit or
  because we need to write the buffer protected by a certain WAL record
  only after the WAL hits the platter, then it's fine.  But sometimes
  we're writing WAL just because we've run out of internal buffer space,
  and we don't want to block waiting for the write to complete.  Opening
  the file with O_SYNC deprives us of the ability to control the timing
  of the sync relative to the timing of the write.
   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
 transaction commit whenever there's any metadata changed on the filesystem.
 Since mtime  ctime of files will be changed often, the will be a case very
 often.

Also, there is the issue of writes that don't need sycning being synced
because sync is set on the file descriptor.  Here is output from our
pg_test_fsync tool when run on an SSD with a BBU:

$ pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync   n/a
fdatasync  8424.785 ops/sec 119 
usecs/op
fsync  7127.072 ops/sec 140 
usecs/op
fsync_writethrough  n/a
open_sync 10548.469 ops/sec  95 
usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync   n/a
fdatasync  4367.375 ops/sec 229 
usecs/op
fsync  4427.761 ops/sec 226 
usecs/op
fsync_writethrough  n/a
open_sync  4303.564 ops/sec 232 
usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
--  1 * 16kB open_sync write  4938.711 ops/sec 202 
usecs/op
--  2 *  8kB open_sync writes 4233.897 ops/sec 236 
usecs/op
--  4 *  4kB open_sync writes 2904.710 ops/sec 344 
usecs/op
--  8 *  2kB open_sync writes 1736.720 ops/sec 576 
usecs/op
-- 16 *  1kB open_sync writes  935.917 ops/sec1068 
usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close7626.783 ops/sec 131 
usecs/op
write, close, fsync6492.697 ops/sec 154 
usecs/op

Non-Sync'ed 8kB writes:
write351517.178 ops/sec   3 
usecs/op

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby

On 1/17/14, 7:57 AM, Robert Haas wrote:

- WAL files are written (and sometimes read) sequentially and fsync'd
very frequently and it's always good to write the data out to disk as
soon as possible
- Temp files are written and read sequentially and never fsync'd.
They should only be written to disk when memory pressure demands it
(but are a good candidate when that situation comes up)
- Data files are read and written randomly.  They are fsync'd at
checkpoint time; between checkpoints, it's best not to write them
sooner than necessary, but when the checkpoint arrives, they all need
to get out to the disk without bringing the system to a standstill


For sake of completeness... there are also data files that are temporary and 
don't need to be written to disk unless the kernel thinks there's better things 
to use that memory for. AFAIK those files are never fsync'd.

In other words, these are the same as the temp files Robert describes except 
they also have random access. Dunno if that matters.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby

On 1/17/14, 2:24 PM, Gregory Smith wrote:

I am skeptical that the database will take over very much of this work and perform better 
than the Linux kernel does.  My take is that our most useful role would be providing test 
cases kernel developers can add to a performance regression suite.  Ugly we never 
though that would happen situations seems at the root of many of the kernel 
performance regressions people here get nailed by.


FWIW, there are some scenarios where we could potentially provide additional 
info to the kernel scheduler; stuff that we know that it never will.

For example, if we have a limit clause we can (sometimes) provide a rough 
estimate of how many pages we'll need to read from a relation.

Probably more useful is the case of index scans; if we pre-read more data from 
the index we could hand the kernel a list of base relation blocks that we know 
we'll need.

There's some other things that have been mentioned, such as cases where files 
will only be accessed sequentially.

Outside of that though, the kernel is going to be in a way better position to 
schedule IO than we will ever be. Not only does it understand the underlying 
hardware, it can also see everything else that's going on.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jim Nasby

On 1/19/14, 5:51 PM, Dave Chinner wrote:

Postgres is far from being the only application that wants this; many
people resort to tmpfs because of this:
https://lwn.net/Articles/499410/

Yes, we covered the possibility of using tmpfs much earlier in the
thread, and came to the conclusion that temp files can be larger
than memory so tmpfs isn't the solution here.:)


Although... instead of inventing new APIs and foisting this work onto 
applications, perhaps it would be better to modify tmpfs such that it can 
handle a temp space that's larger than memory... possibly backing it with X 
amount of real disk and allowing it/the kernel to decide when to passively move 
files out of the in-memory tmpfs and onto disk.

Of course that's theoretically what swapping is supposed to do, but if that's 
not up to the job...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Claudio Freire
On Wed, Jan 22, 2014 at 10:08 PM, Jim Nasby j...@nasby.net wrote:

 Probably more useful is the case of index scans; if we pre-read more data
 from the index we could hand the kernel a list of base relation blocks that
 we know we'll need.


Actually, I've already tried this. The most important part is fetching
heap pages, not index. Tried that too.

Currently, fadvising those pages works *in detriment* of physically
correlated scans. That's a kernel bug I've reported to LKML, and I
could probably come up with a patch. I've just never had time to set
up the testing machinery to test the patch myself.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Marti Raudsepp
On Mon, Jan 20, 2014 at 1:51 AM, Dave Chinner da...@fromorbit.com wrote:
 Postgres is far from being the only application that wants this; many
 people resort to tmpfs because of this:
 https://lwn.net/Articles/499410/

 Yes, we covered the possibility of using tmpfs much earlier in the
 thread, and came to the conclusion that temp files can be larger
 than memory so tmpfs isn't the solution here. :)

What I meant is: lots of applications want this behavior. If Linux
filesystems had support for delaying writeback for temporary files,
then there would be no point in mounting tmpfs on /tmp at all and we'd
get the best of both worlds.

Right now people resort to tmpfs because of this missing feature. And
then have their machines end up in swap hell if they overuse it.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Dave Chinner
On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
 On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote:
  it's very common to create temporary file data that will never, ever, ever
  actually NEED to hit disk. Where I work being able to tell the kernel to
  avoid flushing those files unless the kernel thinks it's got better things
  to do with that memory would be EXTREMELY valuable
 
 Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
 
 ISTR that there was discussion about implementing something analogous
 in Linux when ext4 got delayed allocation support, but I don't think
 it got anywhere and I can't find the discussion now. I think the
 proposed interface was to create and then unlink the file immediately,
 which serves as a hint that the application doesn't care about
 persistence.

You're thinking about O_TMPFILE, which is for making temp files that
can't be seen in the filesystem namespace, not for preventing them
from being written to disk.

I don't really like the idea of overloading a namespace directive to
have special writeback connotations. What we are getting into the
realm of here is generic user controlled allocation and writeback
policy...

 Postgres is far from being the only application that wants this; many
 people resort to tmpfs because of this:
 https://lwn.net/Articles/499410/

Yes, we covered the possibility of using tmpfs much earlier in the
thread, and came to the conclusion that temp files can be larger
than memory so tmpfs isn't the solution here. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Mel Gorman
On Mon, Jan 20, 2014 at 10:51:41AM +1100, Dave Chinner wrote:
 On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
  On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote:
   it's very common to create temporary file data that will never, ever, ever
   actually NEED to hit disk. Where I work being able to tell the kernel to
   avoid flushing those files unless the kernel thinks it's got better things
   to do with that memory would be EXTREMELY valuable
  
  Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
  
  ISTR that there was discussion about implementing something analogous
  in Linux when ext4 got delayed allocation support, but I don't think
  it got anywhere and I can't find the discussion now. I think the
  proposed interface was to create and then unlink the file immediately,
  which serves as a hint that the application doesn't care about
  persistence.
 
 You're thinking about O_TMPFILE, which is for making temp files that
 can't be seen in the filesystem namespace, not for preventing them
 from being written to disk.
 
 I don't really like the idea of overloading a namespace directive to
 have special writeback connotations. What we are getting into the
 realm of here is generic user controlled allocation and writeback
 policy...
 

Such overloading would be unwelcome. FWIW, I assumed this would be an
fadvise thing. Initially something that controlled writeback on an inode
and not an fd context that ignored the offset and length parameters.
Granded, someone will probably throw a fit about adding a Linux-specific
flag to the fadvise64 syscall. POSIX_FADV_NOREUSE is currently unimplemented
and it could be argued that it could be used to flag temporary files that
have a different writeback policy but it's not clear if that matches the
original intent of the posix flag.

  Postgres is far from being the only application that wants this; many
  people resort to tmpfs because of this:
  https://lwn.net/Articles/499410/
 
 Yes, we covered the possibility of using tmpfs much earlier in the
 thread, and came to the conclusion that temp files can be larger
 than memory so tmpfs isn't the solution here. :)
 

And swap IO patterns blow chunks because people rarely want to touch
that area of the code with a 50 foot pole. It gets filed under if you're
swapping, you already lost

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Mel Gorman
On Fri, Jan 17, 2014 at 03:24:01PM -0500, Gregory Smith wrote:
 On 1/17/14 10:37 AM, Mel Gorman wrote:
 There is not an easy way to tell. To be 100%, it would require an
 instrumentation patch or a systemtap script to detect when a
 particular page is being written back and track the context. There
 are approximations though. Monitor nr_dirty pages over time.
 
 I have a benchmarking wrapper for the pgbench testing program called
 pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of
 October, on Linux it now plots the Dirty value from /proc/meminfo
 over time.
 SNIP

Cheers for pointing that out, I was not previously aware of its
existence. While I have some support for running pgbench via another kernel
testing framework (mmtests) the postgres-based tests are miserable. Right
now for me, pgbench is only setup to reproduce a workload that detected a
scheduler regression in the past so that it does not get reintroduced. I'd
like to have it running IO-based tests even though I typically do not
do proper regression testing for IO. I have used sysbench as a workload
generator before but it's not great for a number of reasons.

 I've been working on the problem of how we can make a benchmark test
 case that acts enough like real busy PostgreSQL servers that we can
 share it with kernel developers, and then everyone has an objective
 way to measure changes.  These rate limited tests are working much
 better for that than anything I came up with before.
 

This would be very welcome and thanks for the other observations on IO
scheduler parameter tuning. They could potentially be used to evalate any IO
scheduler changes. For example -- deadline scheduler with these parameters
has X transactions/sec throughput with average latency of Y millieseconds
and a maximum fsync latency of Z seconds. Evaluate how well the out-of-box
behaviour compares against it with and without some set of patches.  At the
very least it would be useful for tracking historical kernel performance
over time and bisecting any regressions that got introduced. Once we have
a test I think many kernel developers (me at least) can run automated
bisections once a test case exists.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Jeff Layton
On Mon, 20 Jan 2014 10:51:41 +1100
Dave Chinner da...@fromorbit.com wrote:

 On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
  On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote:
   it's very common to create temporary file data that will never, ever, ever
   actually NEED to hit disk. Where I work being able to tell the kernel to
   avoid flushing those files unless the kernel thinks it's got better things
   to do with that memory would be EXTREMELY valuable
  
  Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
  
  ISTR that there was discussion about implementing something analogous
  in Linux when ext4 got delayed allocation support, but I don't think
  it got anywhere and I can't find the discussion now. I think the
  proposed interface was to create and then unlink the file immediately,
  which serves as a hint that the application doesn't care about
  persistence.
 
 You're thinking about O_TMPFILE, which is for making temp files that
 can't be seen in the filesystem namespace, not for preventing them
 from being written to disk.
 
 I don't really like the idea of overloading a namespace directive to
 have special writeback connotations. What we are getting into the
 realm of here is generic user controlled allocation and writeback
 policy...
 

Agreed -- O_TMPFILE semantics are a different beast entirely.

Perhaps what might be reasonable though is a fadvise POSIX_FADV_TMPFILE
hint that tells the kernel: Don't write out this data unless it's
necessary due to memory pressure.

If the inode is only open with file descriptors that have that hint
set on them. Then we could exempt it from dirty_expire_interval and
dirty_writeback_interval?

Tracking that desire on an inode open multiple times might be
interesting though. We'd have to be quite careful not to allow that
to open an attack vector.

  Postgres is far from being the only application that wants this; many
  people resort to tmpfs because of this:
  https://lwn.net/Articles/499410/
 
 Yes, we covered the possibility of using tmpfs much earlier in the
 thread, and came to the conclusion that temp files can be larger
 than memory so tmpfs isn't the solution here. :)
 

-- 
Jeff Layton jlay...@redhat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Bruce Momjian
On Wed, Jan 15, 2014 at 11:49:09AM +, Mel Gorman wrote:
 It may be the case that mmap/madvise is still required to handle a double
 buffering problem but it's far from being a free lunch and it has costs
 that read/write does not have to deal with. Maybe some of these problems
 can be fixed or mitigated but it is a case where a test case demonstrates
 the problem even if that requires patching PostgreSQL.

We suspected trying to use mmap would have costs, but it is nice to hear
actual details about it.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
 On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote:
 
  On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
   On 1/15/14, 12:00 AM, Claudio Freire wrote:
   My completely unproven theory is that swapping is overwhelmed by
   near-misses. Ie: a process touches a page, and before it's
   actually swapped in, another process touches it too, blocking on
   the other process' read. But the second process doesn't account
   for that page when evaluating predictive models (ie: read-ahead),
   so the next I/O by process 2 is unexpected to the kernel. Then
   the same with 1. Etc... In essence, swap, by a fluke of its
   implementation, fails utterly to predict the I/O pattern, and
   results in far sub-optimal reads.
   
   Explicit I/O is free from that effect, all read calls are
   accountable, and that makes a difference.
   
   Maybe, if the kernel could be fixed in that respect, you could
   consider mmap'd files as a suitable form of temporary storage.
   But that would depend on the success and availability of such a
   fix/patch.
  
   Another option is to consider some of the more radical ideas in
   this thread, but only for temporary data. Our write sequencing and
   other needs are far less stringent for this stuff.  -- Jim C.
 
  I suspect that a lot of the temporary data issues can be solved by
  using tmpfs for temporary files
 
 
 Temp files can collectively reach hundreds of gigs.

So unless you have terabytes of RAM you're going to have to write
them back to disk.

But there's something here that I'm not getting - you're talking
about a data set that you want ot keep cache resident that is at
least an order of magnitude larger than the cyclic 5-15 minute WAL
dataset that ongoing operations need to manage to avoid IO storms.
Where do these temporary files fit into this picture, how fast do
they grow and why are do they need to be so large in comparison to
the ongoing modifications being made to the database?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
 On 1/15/14, 12:00 AM, Claudio Freire wrote:
 My completely unproven theory is that swapping is overwhelmed by
 near-misses. Ie: a process touches a page, and before it's
 actually swapped in, another process touches it too, blocking on
 the other process' read. But the second process doesn't account
 for that page when evaluating predictive models (ie: read-ahead),
 so the next I/O by process 2 is unexpected to the kernel. Then
 the same with 1. Etc... In essence, swap, by a fluke of its
 implementation, fails utterly to predict the I/O pattern, and
 results in far sub-optimal reads.
 
 Explicit I/O is free from that effect, all read calls are
 accountable, and that makes a difference.
 
 Maybe, if the kernel could be fixed in that respect, you could
 consider mmap'd files as a suitable form of temporary storage.
 But that would depend on the success and availability of such a
 fix/patch.
 
 Another option is to consider some of the more radical ideas in
 this thread, but only for temporary data. Our write sequencing and
 other needs are far less stringent for this stuff.  -- Jim C.

I suspect that a lot of the temporary data issues can be solved by
using tmpfs for temporary files

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote:
 On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote:
  But there's something here that I'm not getting - you're talking
  about a data set that you want ot keep cache resident that is at
  least an order of magnitude larger than the cyclic 5-15 minute WAL
  dataset that ongoing operations need to manage to avoid IO storms.
  Where do these temporary files fit into this picture, how fast do
  they grow and why are do they need to be so large in comparison to
  the ongoing modifications being made to the database?

[ snip ]

 Temp files are something else again.  If PostgreSQL needs to sort a
 small amount of data, like a kilobyte, it'll use quicksort.  But if it
 needs to sort a large amount of data, like a terabyte, it'll use a
 merge sort.[1] 

IOWs the temp files contain data that requires transformation as
part of a query operation. So, temp file size is bound by the
dataset, growth determined by data retreival and transformation
rate.

IOWs, there are two very different IO and caching requirements in
play here and tuning the kernel for one actively degrades the
performance of the other. Right, got it now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Jeff Layton
On Thu, 16 Jan 2014 20:48:24 -0500
Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote:
  But there's something here that I'm not getting - you're talking
  about a data set that you want ot keep cache resident that is at
  least an order of magnitude larger than the cyclic 5-15 minute WAL
  dataset that ongoing operations need to manage to avoid IO storms.
  Where do these temporary files fit into this picture, how fast do
  they grow and why are do they need to be so large in comparison to
  the ongoing modifications being made to the database?
 
 I'm not sure you've got that quite right.  WAL is fsync'd very
 frequently - on every commit, at the very least, and multiple times
 per second even there are no commits going on just to make sure we get
 it all down to the platter as fast as possible.  The thing that causes
 the I/O storm is the data file writes, which are performed either when
 we need to free up space in PostgreSQL's internal buffer pool (aka
 shared_buffers) or once per checkpoint interval (5-60 minutes) in any
 event.  The point of this system is that if we crash, we're going to
 need to replay all of the WAL to recover the data files to the proper
 state; but we don't want to keep WAL around forever, so we checkpoint
 periodically.  By writing all the data back to the underlying data
 files, checkpoints render older WAL segments irrelevant, at which
 point we can recycle those files before the disk fills up.
 

So this says to me that the WAL is a place where DIO should really be
reconsidered. It's mostly sequential writes that need to hit the disk
ASAP, and you need to know that they have hit the disk before you can
proceed with other operations.

Also, is the WAL actually ever read under normal (non-recovery)
conditions or is it write-only under normal operation? If it's seldom
read, then using DIO for them also avoids some double buffering since
they wouldn't go through pagecache.

Again, I think this discussion would really benefit from an outline of
the different files used by pgsql, and what sort of data access
patterns you expect with them.

 Temp files are something else again.  If PostgreSQL needs to sort a
 small amount of data, like a kilobyte, it'll use quicksort.  But if it
 needs to sort a large amount of data, like a terabyte, it'll use a
 merge sort.[1]  The reason is of course that quicksort requires random
 access to work well; if parts of quicksort's working memory get paged
 out during the sort, your life sucks.  Merge sort (or at least our
 implementation of it) is slower overall, but it only accesses the data
 sequentially.  When we do a merge sort, we use files to simulate the
 tapes that Knuth had in mind when he wrote down the algorithm.  If the
 OS runs short of memory - because the sort is really big or just
 because of other memory pressure - it can page out the parts of the
 file we're not actively using without totally destroying performance.
 It'll be slow, of course, because disks always are, but not like
 quicksort would be if it started swapping.
 
 I haven't actually experienced (or heard mentioned) the problem Jeff
 Janes is mentioning where temp files get written out to disk too
 aggressively; as mentioned before, the problems I've seen are usually
 the other way - stuff not getting written out aggressively enough.
 But it sounds plausible.  The OS only lets you set one policy, and if
 you make that file right for permanent data files that get
 checkpointed it could well be wrong for temp files that get thrown
 out.  Just stuffing the data on RAMFS will work for some
 installations, but might not be good if you actually do want to
 perform sorts whose size exceeds RAM.
 
 BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
 in attending Collab on behalf of the PostgreSQL community.  Although
 the prospect of a cross-country flight is a somewhat depressing
 thought, it does sound pretty cool, so I'm potentially interested.  I
 have no idea what the procedure is here for moving forward though,
 especially since it sounds like there might be only one seat available
 and I don't know who else may wish to sit in it.
 


-- 
Jeff Layton jlay...@redhat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Robert Haas
On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
 So this says to me that the WAL is a place where DIO should really be
 reconsidered. It's mostly sequential writes that need to hit the disk
 ASAP, and you need to know that they have hit the disk before you can
 proceed with other operations.

Ironically enough, we actually *have* an option to use O_DIRECT here.
But it doesn't work well.  See below.

 Also, is the WAL actually ever read under normal (non-recovery)
 conditions or is it write-only under normal operation? If it's seldom
 read, then using DIO for them also avoids some double buffering since
 they wouldn't go through pagecache.

This is the first problem: if replication is in use, then the WAL gets
read shortly after it gets written.  Using O_DIRECT bypasses the
kernel cache for the writes, but then the reads stink.  However, if
you configure wal_sync_method=open_sync and disable replication, then
you will in fact get O_DIRECT|O_SYNC behavior.

But that still doesn't work out very well, because now the guy who
does the write() has to wait for it to finish before he can do
anything else.  That's not always what we want, because WAL gets
written out from our internal buffers for multiple different reasons.
If we're forcing the WAL out to disk because of transaction commit or
because we need to write the buffer protected by a certain WAL record
only after the WAL hits the platter, then it's fine.  But sometimes
we're writing WAL just because we've run out of internal buffer space,
and we don't want to block waiting for the write to complete.  Opening
the file with O_SYNC deprives us of the ability to control the timing
of the sync relative to the timing of the write.

 Again, I think this discussion would really benefit from an outline of
 the different files used by pgsql, and what sort of data access
 patterns you expect with them.

I think I more or less did that in my previous email, but here it is
again in briefer form:

- WAL files are written (and sometimes read) sequentially and fsync'd
very frequently and it's always good to write the data out to disk as
soon as possible
- Temp files are written and read sequentially and never fsync'd.
They should only be written to disk when memory pressure demands it
(but are a good candidate when that situation comes up)
- Data files are read and written randomly.  They are fsync'd at
checkpoint time; between checkpoints, it's best not to write them
sooner than necessary, but when the checkpoint arrives, they all need
to get out to the disk without bringing the system to a standstill

We have other kinds of files, but off-hand I'm not thinking of any
that are really very interesting, apart from those.

Maybe it'll be useful to have hints that say always write this file
to disk as quick as you can and always postpone writing this file to
disk for as long as you can for WAL and temp files respectively.  But
the rule for the data files, which are the really important case, is
not so simple.  fsync() is actually a fine API except that it tends to
destroy system throughput.  Maybe what we need is just for fsync() to
be less aggressive, or a less aggressive version of it.  We wouldn't
mind waiting an almost arbitrarily long time for fsync to complete if
other processes could still get their I/O requests serviced in a
reasonable amount of time in the meanwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Mel Gorman
On Thu, Jan 16, 2014 at 04:30:59PM -0800, Jeff Janes wrote:
 On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman mgor...@suse.de wrote:
 
  On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
   
That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say reclaim this page if you need memory but
leave it resident if there is no memory pressure or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.
   
  
   I think the reclaim this page if you need memory but leave it resident
  if
   there is no memory pressure hint would be more useful for temporary
   working files than for what was being discussed above (shared buffers).
When I do work that needs large temporary files, I often see physical
   write IO spike but physical read IO does not.  I interpret that to mean
   that the temporary data is being written to disk to satisfy either
   dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
   cache and so disk reads are not needed to satisfy it.  So a hint that
  says
   this file will never be fsynced so please ignore dirty_*bytes and
   dirty_expire_centisecs.
 
  It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
  were the problem here.
 
 
 Is there an easy way to tell?  I would guess it has to be at least
 dirty_expire_centisecs, if not both, as a very large sort operation takes a
 lot more than 30 seconds to complete.
 

There is not an easy way to tell. To be 100%, it would require an
instrumentation patch or a systemtap script to detect when a particular page
is being written back and track the context. There are approximations though.
Monitor nr_dirty pages over time. If at the time of the stall there are fewer
dirty pages than allowed by dirty_ratio then the dirty_expire_centisecs
kicked in. That or monitor the process for stalls, when it stalls check
/proc/PID/stack and see if it's stuck in balance_dirty_pages or something
similar which would indicate the process hit dirty_ratio.

  An interface that forces a dirty page to stay dirty
  regardless of the global system would be a major hazard. It potentially
  allows the creator of the temporary file to stall all other processes
  dirtying pages for an unbounded period of time.
 
 Are the dirty ratio/bytes limits the mechanisms by which adequate clean
 memory is maintained? 

Yes, for file-backed pages.

 I thought those were there just to but a limit on
 long it would take to execute a sync call should one be issued, and there
 were other setting which said how much clean memory to maintain.  It should
 definitely write out the pages if it needs the memory for other things,
 just not write them out due to fear of how long it would take to sync it if
 a sync was called.  (And if it needs the memory, it should be able to write
 it out quickly as the writes would be mostly sequential, not
 random--although how the kernel can believe me that that will always be the
 case could a problem)
 

It has been suggested on more than one occasion that a more sensible
interface would be to do not allow more dirty data than it takes N seconds
to writeback. The details of how to implement this are tricky and no one
has taken up the challenge yet.

  I proposed in another part
  of the thread a hint for open inodes to have the background writer thread
  ignore dirty pages belonging to that inode. Dirty limits and fsync would
  still be obeyed. It might also be workable for temporary files but the
  proposal could be full of holes.
 
 
 If calling fsync would fail with an error, would that lower the risk of DoS?
 

I do not understand the proposal. If there are pages that must remain
dirty and the kernel cannot touch then there will be the risk that
dirty_ratio number of pages are all untouchable and the system livelocks
until userspace takes an action.

That still leaves the possibility of flagging temp pages that should
only be written to disk if the kernel really needs to.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Hannu Krosing
On 01/17/2014 06:40 AM, Dave Chinner wrote:
 On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote:
 On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote:
 But there's something here that I'm not getting - you're talking
 about a data set that you want ot keep cache resident that is at
 least an order of magnitude larger than the cyclic 5-15 minute WAL
 dataset that ongoing operations need to manage to avoid IO storms.
 Where do these temporary files fit into this picture, how fast do
 they grow and why are do they need to be so large in comparison to
 the ongoing modifications being made to the database?
 [ snip ]

 Temp files are something else again.  If PostgreSQL needs to sort a
 small amount of data, like a kilobyte, it'll use quicksort.  But if it
 needs to sort a large amount of data, like a terabyte, it'll use a
 merge sort.[1] 
 IOWs the temp files contain data that requires transformation as
 part of a query operation. So, temp file size is bound by the
 dataset, 
Basically yes, though the size of the dataset can be orders of
magnitude bigger than the database in case of some queries.
 growth determined by data retreival and transformation
 rate.

 IOWs, there are two very different IO and caching requirements in
 play here and tuning the kernel for one actively degrades the
 performance of the other. Right, got it now.
Yes. A step in right solutions would be some way to tune this
on per-device basis, but as large part of this in linux seems
to be driven from the keeping-vm-clean side it guess it will
be far from simple.

 Cheers,

 Dave.


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Gregory Smith

On 1/17/14 10:37 AM, Mel Gorman wrote:
There is not an easy way to tell. To be 100%, it would require an 
instrumentation patch or a systemtap script to detect when a 
particular page is being written back and track the context. There are 
approximations though. Monitor nr_dirty pages over time.


I have a benchmarking wrapper for the pgbench testing program called 
pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of 
October, on Linux it now plots the Dirty value from /proc/meminfo over 
time.  You get that on the same time axis as the transaction latency 
data.  The report at the end includes things like the maximum amount of 
dirty memory observed during the test sampling. That doesn't tell you 
exactly what's happening to the level someone reworking the kernel logic 
might want, but you can easily see things like the database's checkpoint 
cycle reflected by watching the dirty memory total.  This works really 
well for monitoring production servers too.  I have a lot of data from a 
plugin for the Munin monitoring system that plots the same way.  Once 
you have some history about what's normal, it's easy to see when systems 
fall behind in a way that's ruining writes, and the high water mark 
often correlates with bad responsiveness periods.


Another recent change is that pgbench for the upcoming PostgreSQL 9.4 
now allows you to specify a target transaction rate.  Seeing the write 
latency behavior with that in place is far more interesting than 
anything we were able to watch with pgbench before.  The pgbench write 
tests we've been doing for years mainly told you the throughput rate 
when all of the caches were always as full as the database could make 
them, and tuning for that is not very useful. Turns out it's far more 
interesting to run at 50% of what the storage is capable of, then watch 
what happens to latency when you adjust things like the dirty_* parameters.


I've been working on the problem of how we can make a benchmark test 
case that acts enough like real busy PostgreSQL servers that we can 
share it with kernel developers, and then everyone has an objective way 
to measure changes.  These rate limited tests are working much better 
for that than anything I came up with before.


I am skeptical that the database will take over very much of this work 
and perform better than the Linux kernel does.  My take is that our most 
useful role would be providing test cases kernel developers can add to a 
performance regression suite.  Ugly we never though that would happen 
situations seems at the root of many of the kernel performance 
regressions people here get nailed by.


Effective I/O scheduling is very hard, and we are unlikely to ever out 
innovate the kernel hacking community by pulling more of that into the 
database.  It's already possible to experiment with moving in that 
direction with tuning changes.  Use a larger database shared_buffers 
value, tweak checkpoints to spread I/O out, and reduce things like 
dirty_ratio.  I do some of that, but I've learned it's dangerous to 
wander too far that way.


If instead you let Linux do even more work--give it a lot of memory to 
manage and room to re-order I/O--that can work out quite well. For 
example, I've seen a lot of people try to keep latency down by using the 
deadline scheduler and very low settings for the expire times.  Theory 
is great, but it never works out in the real world for me though.  
Here's the sort of deadline I deploy instead now:


echo 500   ${DEV}/queue/iosched/read_expire
echo 30${DEV}/queue/iosched/write_expire
echo 1048576   ${DEV}/queue/iosched/writes_starved

These numbers look insane compared to the defaults, but I assure you 
they're from a server that's happily chugging through 5 to 10K 
transactions/second around the clock.  PostgreSQL forces writes out with 
fsync when they must go out, but this sort of tuning is basically giving 
up on it managing writes beyond that.  We really have no idea what order 
they should go out in.  I just let the kernel have a large pile of work 
queued up, and trust things like the kernel's block elevator and 
congestion code are smarter than the database can possibly be.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread knizhnik
I wonder if kernel can sometimes provide weaker version of fsync() which 
is not enforcing all pending data to be written immediately but just 
servers as write barrier, guaranteeing
that all write operations preceding fsync() will be completed before any 
of subsequent operations.


It will allow implementation of weaker transaction models which are not 
satisfying all ACID requirements (results of committed transaction can 
be lost in case power failure or OS crash) but still preserving database 
consistency. It is acceptable for many applications and can provide much 
better performance.


Right now it is possible to implement something like this at application 
level using asynchronous write process. So all write/sync operations 
should be redirected to this process.
But such process can become a bottleneck reducing scalability of the 
system. Also communication channels with this process can cause 
significant memory/CPU overhead.


In most DBMSes including PostgreSQL transaction log and database data 
are located in separate files. So such write barrier should be 
associated not with one file, but with set of files or may be the whole 
file system.  I wonder if there are some principle problems in 
implementing or using such file system write barrier?




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jeremy Harris

On 14/01/14 22:23, Dave Chinner wrote:

On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:

To quantify that, in a production setting we were seeing pauses of
up to two minutes with shared_buffers set to 8GB and default dirty

^

page settings for Linux, on a machine with 256GB RAM and 512MB

   ^
There's your problem.

By default, background writeback doesn't start until 10% of memory
is dirtied, and on your machine that's 25GB of RAM. That's way to
high for your workload.

It appears to me that we are seeing large memory machines much more
commonly in data centers - a couple of years ago 256GB RAM was only
seen in supercomputers. Hence machines of this size are moving from
tweaking settings for supercomputers is OK class to tweaking
settings for enterprise servers is not OK

Perhaps what we need to do is deprecate dirty_ratio and
dirty_background_ratio as the default values as move to the byte
based values as the defaults and cap them appropriately.  e.g.
10/20% of RAM for small machines down to a couple of GB for large
machines


whisper  Perhaps the kernel needs a dirty-amount control measured
in time units rather than pages (it being up to the kernel to
measure the achievable write rate)...
--
Cheers,
   Jeremy


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jan Kara
On Wed 15-01-14 21:37:16, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara j...@suse.cz wrote:
  On Wed 15-01-14 10:12:38, Robert Haas wrote:
  On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
   Filesystems could in theory provide facility like atomic write (at least 
   up
   to a certain size say in MB range) but it's not so easy and when there 
   are
   no strong usecases fs people are reluctant to make their code more 
   complex
   unnecessarily. OTOH without widespread atomic write support I understand
   application developers have similar stance. So it's kind of chicken and 
   egg
   problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
   due to its data=journal mode so if someone on the PostgreSQL side wanted 
   to
   research on this, knitting some experimental ext4 patches should be 
   doable.
 
  Atomic 8kB writes would improve performance for us quite a lot.  Full
  page writes to WAL are very expensive.  I don't remember what
  percentage of write-ahead log traffic that accounts for, but it's not
  small.
OK, and do you need atomic writes on per-IO basis or per-file is enough?
  It basically boils down to - is all or most of IO to a file going to be
  atomic or it's a smaller fraction?
 
 The write-ahead log wouldn't need it, but data files writes would.  So
 we'd need it a lot, but not for absolutely everything.
 
 For any given file, we'd either care about writes being atomic, or we
 wouldn't.
  OK, when you say that either all writes to a file should be atomic or
none of them should be, then can you try the following:
chattr +j file

  will turn on data journalling for file on ext3/ext4 filesystem.
Currently it *won't* guarantee the atomicity in all the cases but the
performance will be very similar as if it would. You might also want to
increase filesystem journal size with 'tune2fs -J size=XXX /dev/yyy' where
XXX is desired journal size in MB. Default is 128 MB I think but with
intensive data journalling you might want to have that in GB range. I'd be
interested in hearing what impact does turning 'atomic write' support
in PostgreSQL and using data journalling on ext4 have.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jan Kara
On Wed 15-01-14 10:12:38, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
  Filesystems could in theory provide facility like atomic write (at least up
  to a certain size say in MB range) but it's not so easy and when there are
  no strong usecases fs people are reluctant to make their code more complex
  unnecessarily. OTOH without widespread atomic write support I understand
  application developers have similar stance. So it's kind of chicken and egg
  problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
  due to its data=journal mode so if someone on the PostgreSQL side wanted to
  research on this, knitting some experimental ext4 patches should be doable.
 
 Atomic 8kB writes would improve performance for us quite a lot.  Full
 page writes to WAL are very expensive.  I don't remember what
 percentage of write-ahead log traffic that accounts for, but it's not
 small.
  OK, and do you need atomic writes on per-IO basis or per-file is enough?
It basically boils down to - is all or most of IO to a file going to be
atomic or it's a smaller fraction?

As Dave notes, unless there is HW support (which is coming with newest
solid state drives), ext4/xfs will have to implement this by writing data
to a filesystem journal and after transaction commit checkpointing them to
a final location. Which is exactly what you do with your WAL logs so
it's not clear it will be a performance win. But it is easy enough to code
for ext4 that I'm willing to try...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:31:15PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
  No, we'd be happy to re-request it during each checkpoint cycle, as
  long as that wasn't an unduly expensive call to make.  I'm not quite
  sure where such requests ought to live though.  One idea is to tie
  them to file descriptors; but the data to be written might be spread
  across more files than we really want to keep open at one time.
 
  It would be a property of the inode, as that is how writeback is
  tracked and timed. Set and queried through a file descriptor,
  though - it's basically the same context that fadvise works
  through.
 
 Ah, got it.  That would be fine on our end, I think.
 
  We could probably live with serially checkpointing data
  in sets of however-many-files-we-can-have-open, if file descriptors are
  the place to keep the requests.
 
  Inodes live longer than file descriptors, but there's no guarantee
  that they live from one fd context to another. Hence my question
  about persistence ;)
 
 I plead ignorance about what an fd context is.

open-to-close life time.

fd = open(some/file, );
.
close(fd);

is a single context. If multiple fd contexts of the same file
overlap in lifetime, then the inode is constantly referenced and the
inode won't get reclaimed so the value won't get lost. However, is
there is no open fd context, there are no external references to the
inode so it can get reclaimed. Hence there's not guarantee that the
inode is present and the writeback property maintained across
close-to-open timeframes.

 We're ahead of the game as long as it usually works.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jeff Layton
On Wed, 15 Jan 2014 21:37:16 -0500
Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara j...@suse.cz wrote:
  On Wed 15-01-14 10:12:38, Robert Haas wrote:
  On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
   Filesystems could in theory provide facility like atomic write (at least 
   up
   to a certain size say in MB range) but it's not so easy and when there 
   are
   no strong usecases fs people are reluctant to make their code more 
   complex
   unnecessarily. OTOH without widespread atomic write support I understand
   application developers have similar stance. So it's kind of chicken and 
   egg
   problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
   due to its data=journal mode so if someone on the PostgreSQL side wanted 
   to
   research on this, knitting some experimental ext4 patches should be 
   doable.
 
  Atomic 8kB writes would improve performance for us quite a lot.  Full
  page writes to WAL are very expensive.  I don't remember what
  percentage of write-ahead log traffic that accounts for, but it's not
  small.
OK, and do you need atomic writes on per-IO basis or per-file is enough?
  It basically boils down to - is all or most of IO to a file going to be
  atomic or it's a smaller fraction?
 
 The write-ahead log wouldn't need it, but data files writes would.  So
 we'd need it a lot, but not for absolutely everything.
 
 For any given file, we'd either care about writes being atomic, or we 
 wouldn't.
 

Just getting caught up on this thread. One thing that you're just now
getting to here is that the different types of files in the DB have
different needs.

It might be good to outline each type of file (WAL, data files, tmp
files), what sort of I/O patterns are typically done to them, and what
sort of special needs they have (atomicity or whatever). Then we
could treat each file type as a separate problem, which may make some
of these problems easier to solve.

For instance, typically a WAL would be fairly sequential I/O, whereas
the data files are almost certainly random. It may make sense to
consider DIO for some of these use-cases, even if it's not suitable
everywhere.

For tempfiles, it may make sense to consider housing those on tmpfs.
They wouldn't go to disk at all that way, but if there is mem pressure
they could get swapped out (maybe this is standard practice already --
I don't know).

  As Dave notes, unless there is HW support (which is coming with newest
  solid state drives), ext4/xfs will have to implement this by writing data
  to a filesystem journal and after transaction commit checkpointing them to
  a final location. Which is exactly what you do with your WAL logs so
  it's not clear it will be a performance win. But it is easy enough to code
  for ext4 that I'm willing to try...
 
 Yeah, hardware support would be great.
 


-- 
Jeff Layton jlay...@redhat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Theodore Ts'o
On Wed, Jan 15, 2014 at 10:35:44AM +0100, Jan Kara wrote:
 Filesystems could in theory provide facility like atomic write (at least up
 to a certain size say in MB range) but it's not so easy and when there are
 no strong usecases fs people are reluctant to make their code more complex
 unnecessarily. OTOH without widespread atomic write support I understand
 application developers have similar stance. So it's kind of chicken and egg
 problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
 due to its data=journal mode so if someone on the PostgreSQL side wanted to
 research on this, knitting some experimental ext4 patches should be doable.

For the record, a researcher (plus is PhD student) at HP Labs actually
implemented a prototype based on ext3 which created an atomic write
facility.  It was good up to about 25% of the ext4 journal size (so, a
couple of MB), and it was use to research using persistent memory by
creating a persistent heap using standard in-memory data structures as
a replacement for using a database.

The results of their research work was that showed that ext3 plus
atomic write plus standard Java associative arrays beat using Sqllite.

It was a research prototype, so they didn't handle OOM kill
conditions, and they also didn't try benchmarking against a real
database instead of a toy database such as SqlLite, but if someone
wants to experiment with Atomic write, there are patches against ext3
that we can probably get from HP Labs.

  - Ted


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jeff Janes
On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote:

 On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
  On 1/15/14, 12:00 AM, Claudio Freire wrote:
  My completely unproven theory is that swapping is overwhelmed by
  near-misses. Ie: a process touches a page, and before it's
  actually swapped in, another process touches it too, blocking on
  the other process' read. But the second process doesn't account
  for that page when evaluating predictive models (ie: read-ahead),
  so the next I/O by process 2 is unexpected to the kernel. Then
  the same with 1. Etc... In essence, swap, by a fluke of its
  implementation, fails utterly to predict the I/O pattern, and
  results in far sub-optimal reads.
  
  Explicit I/O is free from that effect, all read calls are
  accountable, and that makes a difference.
  
  Maybe, if the kernel could be fixed in that respect, you could
  consider mmap'd files as a suitable form of temporary storage.
  But that would depend on the success and availability of such a
  fix/patch.
 
  Another option is to consider some of the more radical ideas in
  this thread, but only for temporary data. Our write sequencing and
  other needs are far less stringent for this stuff.  -- Jim C.

 I suspect that a lot of the temporary data issues can be solved by
 using tmpfs for temporary files


Temp files can collectively reach hundreds of gigs.  So I would have to set
up two temporary tablespaces, one in tmpfs and one in regular storage, and
then remember to choose between them based on my estimate of how much temp
space is going to be used in each connection (and hope I don't mess up the
estimation and so either get errors, or render the server unresponsive).

So I just use regular storage, and pay the insurance premium of having
some extraneous write IO.  It would be nice if the insurance premium were
cheaper, though.  I think the IO storms during checkpoint syncs are
definitely the more critical issue, this is just something nice to have
which seemed to align with one the comments.

Cheers,

Jeff


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Robert Haas
On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote:
 But there's something here that I'm not getting - you're talking
 about a data set that you want ot keep cache resident that is at
 least an order of magnitude larger than the cyclic 5-15 minute WAL
 dataset that ongoing operations need to manage to avoid IO storms.
 Where do these temporary files fit into this picture, how fast do
 they grow and why are do they need to be so large in comparison to
 the ongoing modifications being made to the database?

I'm not sure you've got that quite right.  WAL is fsync'd very
frequently - on every commit, at the very least, and multiple times
per second even there are no commits going on just to make sure we get
it all down to the platter as fast as possible.  The thing that causes
the I/O storm is the data file writes, which are performed either when
we need to free up space in PostgreSQL's internal buffer pool (aka
shared_buffers) or once per checkpoint interval (5-60 minutes) in any
event.  The point of this system is that if we crash, we're going to
need to replay all of the WAL to recover the data files to the proper
state; but we don't want to keep WAL around forever, so we checkpoint
periodically.  By writing all the data back to the underlying data
files, checkpoints render older WAL segments irrelevant, at which
point we can recycle those files before the disk fills up.

Temp files are something else again.  If PostgreSQL needs to sort a
small amount of data, like a kilobyte, it'll use quicksort.  But if it
needs to sort a large amount of data, like a terabyte, it'll use a
merge sort.[1]  The reason is of course that quicksort requires random
access to work well; if parts of quicksort's working memory get paged
out during the sort, your life sucks.  Merge sort (or at least our
implementation of it) is slower overall, but it only accesses the data
sequentially.  When we do a merge sort, we use files to simulate the
tapes that Knuth had in mind when he wrote down the algorithm.  If the
OS runs short of memory - because the sort is really big or just
because of other memory pressure - it can page out the parts of the
file we're not actively using without totally destroying performance.
It'll be slow, of course, because disks always are, but not like
quicksort would be if it started swapping.

I haven't actually experienced (or heard mentioned) the problem Jeff
Janes is mentioning where temp files get written out to disk too
aggressively; as mentioned before, the problems I've seen are usually
the other way - stuff not getting written out aggressively enough.
But it sounds plausible.  The OS only lets you set one policy, and if
you make that file right for permanent data files that get
checkpointed it could well be wrong for temp files that get thrown
out.  Just stuffing the data on RAMFS will work for some
installations, but might not be good if you actually do want to
perform sorts whose size exceeds RAM.

BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
in attending Collab on behalf of the PostgreSQL community.  Although
the prospect of a cross-country flight is a somewhat depressing
thought, it does sound pretty cool, so I'm potentially interested.  I
have no idea what the procedure is here for moving forward though,
especially since it sounds like there might be only one seat available
and I don't know who else may wish to sit in it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] The threshold where we switch from quicksort to merge sort is a
configurable parameter.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Heikki Linnakangas

On 01/15/2014 06:01 AM, Jim Nasby wrote:

For the sake of completeness... it's theoretically silly that Postgres
is doing all this stuff with WAL when the filesystem is doing something
very similar with it's journal. And an SSD drive (and next generation
spinning rust) is doing the same thing *again* in it's own journal.

If all 3 communities (or even just 2 of them!) could agree on the
necessary interface a tremendous amount of this duplicated technology
could be eliminated.

That said, I rather doubt the Postgres community would go this route,
not so much because of the presumably massive changes needed, but more
because our community is not a fan of restricting our users to things
like Thou shalt use a journaled FS or risk all thy data!


The WAL is also used for continuous archiving and replication, not just 
crash recovery. We could skip full-page-writes, though, if we knew that 
the underlying filesystem/storage is guaranteeing that a write() is atomic.


It might be useful for PostgreSQL somehow tell the filesystem that we're 
taking care of WAL-logging, so that the filesystem doesn't need to.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing
On 01/14/2014 06:12 PM, Robert Haas wrote:
 This would be pretty similar to copy-on-write, except
 without the copying. It would just be
 forget-from-the-buffer-pool-on-write. 

+1

A version of this could probably already be implement using MADV_DONTNEED
and MADV_WILLNEED

Thet is, just after reading the page in, use MADV_DONTNEED on it. When
evicting
a clean page, check that it is still in cache and if it is, then
MADV_WILLNEED it.

Another nice thing to do would be dynamically adjusting kernel
dirty_background_ratio
and other related knobs in real time based on how many buffers are dirty
inside postgresql.
Maybe in background writer.

Question to LKM folks - will kernel react well to frequent changes to
/proc/sys/vm/dirty_*  ?
How frequent can they be (every few second? every second? 100Hz ?)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing
On 01/15/2014 12:16 PM, Hannu Krosing wrote:
 On 01/14/2014 06:12 PM, Robert Haas wrote:
 This would be pretty similar to copy-on-write, except
 without the copying. It would just be
 forget-from-the-buffer-pool-on-write. 
 +1

 A version of this could probably already be implement using MADV_DONTNEED
 and MADV_WILLNEED

 Thet is, just after reading the page in, use MADV_DONTNEED on it. When
 evicting
 a clean page, check that it is still in cache and if it is, then
 MADV_WILLNEED it.

 Another nice thing to do would be dynamically adjusting kernel
 dirty_background_ratio
 and other related knobs in real time based on how many buffers are dirty
 inside postgresql.
 Maybe in background writer.

 Question to LKM folks - will kernel react well to frequent changes to
 /proc/sys/vm/dirty_*  ?
 How frequent can they be (every few second? every second? 100Hz ?)
One obvious use case of this would be changing dirty_background_bytes
linearly to almost zero during a checkpoint to make final fsync fast.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Mon, Jan 13, 2014 at 02:19:56PM -0800, James Bottomley wrote:
 On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
  On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
   On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
Well, if we were to collaborate with the kernel community on this then
presumably we can do better than that for eviction... even to the
extent of here's some data from this range in this file. It's (clean|
dirty). Put it in your cache. Just trust me on this.
   
   This should be the madvise() interface (with MADV_WILLNEED and
   MADV_DONTNEED) is there something in that interface that is
   insufficient?
  
  For one, postgres doesn't use mmap for files (and can't without major
  new interfaces).
 
 I understand, that's why you get double buffering: because we can't
 replace a page in the range you give us on read/write.  However, you
 don't have to switch entirely to mmap: you can use mmap/madvise
 exclusively for cache control and still use read/write (and still pay
 the double buffer penalty, of course).  It's only read/write with
 directio that would cause problems here (unless you're planning to
 switch to DIO?).
 

There are hazards with using mmap/madvise that may or may not be a problem
for them. I think these are well known but just in case;

mmap/munmap intensive workloads may get hammered on taking mmap_sem for
write. The greatest costs are incurred if the application is threaded
if the parallel threads are fault-intensive. I do not think this is the
case for PostgreSQL as it is process based but it is a concern. Even it's
a single-threaded process, the cost of the mmap_sem cache line bouncing
can be a concern. Outside of that, the mmap/munmap paths are just really
costly and take a lot of work.

madvise has different hazards but lets take DONTNEED as an example because
it's the most likely candidate for use. A DONTNEED hint has three potential
downsides. The first is that mmap_sem taken for read can be very costly
for threaded applications as the cache line bounces. On NUMA machines it
can be a major problem for madvise-intensive workloads. The second is that
the page table teardown frees the pages with the associated costs but most
importantly, an IPI is required afterwards to flush the TLB. If that process
has been running on a lot of different CPUs then the IPI cost can be very
high. The third hazard is that a madvise(DONTNEED) region will incur page
faults on the next accesses again hammering into mmap_sem and all the faults
associated with faulting (allocating the same pages again, zeroing etc)

It may be the case that mmap/madvise is still required to handle a double
buffering problem but it's far from being a free lunch and it has costs
that read/write does not have to deal with. Maybe some of these problems
can be fixed or mitigated but it is a case where a test case demonstrates
the problem even if that requires patching PostgreSQL.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
 On 1/14/14, 3:41 PM, Dave Chinner wrote:
 On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de
 wrote: Whether the problem is with the system call or the
 programmer is harder to determine.  I think the problem is in
 part that it's not exactly clear when we should call it.  So
 suppose we want to do a checkpoint.  What we used to do a long
 time ago is write everything, and then fsync it all, and then
 call it good.  But that produced horrible I/O storms.  So what
 we do now is do the writes over a period of time, with sleeps in
 between, and then fsync it all at the end, hoping that the
 kernel will write some of it before the fsyncs arrive so that we
 don't get a huge I/O spike.  And that sorta works, and it's
 definitely better than doing it all at full speed, but it's
 pretty imprecise.  If the kernel doesn't write enough of the
 data out in advance, then there's still a huge I/O storm when we
 do the fsyncs and everything grinds to a halt.  If it writes out
 more data than needed in advance, it increases the total number
 of physical writes because we get less write-combining, and that
 hurts performance, too.
 
 I think there's a pretty important bit that Robert didn't mention:
 we have a specific *time* target for when we want all the fsync's
 to complete. People that have problems here tend to tune
 checkpoints to complete every 5-15 minutes, and they want the
 write traffic for the checkpoint spread out over 90% of that time
 interval. To put it another way, fsync's should be done when 90%
 of the time to the next checkpoint hits, but preferably not a lot
 before then.

I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
problem.

 It sounds to me like you want the kernel to start background
 writeback earlier so that it doesn't build up as much dirty data
 before you require a flush. There are several ways to do this by
 tweaking writeback knobs. The simplest is probably just to set
 /proc/sys/vm/dirty_background_bytes to an appropriate threshold
 (say 50MB) and dirty_expire_centiseconds to a few seconds so that
 background writeback starts and walks all dirty inodes almost
 immediately. This will keep a steady stream of low level
 background IO going, and fsync should then not take very long.
 
 Except that still won't throttle writes, right? That's the big
 issue here: our users often can't tolerate big spikes in IO
 latency. They want user requests to always happen within a
 specific amount of time.

Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.

 So while delaying writes potentially reduces the total amount of
 data you're writing, users that run into problems here ultimately
 care more about ensuring that their foreground IO completes in a
 timely fashion.

Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound

 Fundamentally, though, we need bug reports from people seeing
 these problems when they see them so we can diagnose them on
 their systems. Trying to discuss/diagnose these problems without
 knowing anything about the storage, the kernel version, writeback
 thresholds, etc really doesn't work because we can't easily
 determine a root cause.
 
 So is lsf...@linux-foundation.org the best way to accomplish that?

No. That is just the list for organising the LFSMM summit. ;)

For general pagecache and writeback issues, discussions, etc,
linux-fsde...@vger.kernel.org is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g. linux-e...@vger.kernel.org and
x...@oss.sgi.com). We tend to cross-post to other relevant lists as
triage moves into different areas of the storage stack.

 Also, along the lines of 

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote:
 On 01/15/2014 06:01 AM, Jim Nasby wrote:
 For the sake of completeness... it's theoretically silly that Postgres
 is doing all this stuff with WAL when the filesystem is doing something
 very similar with it's journal. And an SSD drive (and next generation
 spinning rust) is doing the same thing *again* in it's own journal.
 
 If all 3 communities (or even just 2 of them!) could agree on the
 necessary interface a tremendous amount of this duplicated technology
 could be eliminated.
 
 That said, I rather doubt the Postgres community would go this route,
 not so much because of the presumably massive changes needed, but more
 because our community is not a fan of restricting our users to things
 like Thou shalt use a journaled FS or risk all thy data!
 
 The WAL is also used for continuous archiving and replication, not
 just crash recovery. We could skip full-page-writes, though, if we
 knew that the underlying filesystem/storage is guaranteeing that a
 write() is atomic.
 
 It might be useful for PostgreSQL somehow tell the filesystem that
 we're taking care of WAL-logging, so that the filesystem doesn't
 need to.
  Well, journalling fs generally cares about its metadata consistency. We
have much weaker guarantees regarding file data because those guarantees
come at a cost most people don't want to pay.

Filesystems could in theory provide facility like atomic write (at least up
to a certain size say in MB range) but it's not so easy and when there are
no strong usecases fs people are reluctant to make their code more complex
unnecessarily. OTOH without widespread atomic write support I understand
application developers have similar stance. So it's kind of chicken and egg
problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
due to its data=journal mode so if someone on the PostgreSQL side wanted to
research on this, knitting some experimental ext4 patches should be doable.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
 On 01/14/2014 06:12 PM, Robert Haas wrote:
  This would be pretty similar to copy-on-write, except
  without the copying. It would just be
  forget-from-the-buffer-pool-on-write. 
 
 +1
 
 A version of this could probably already be implement using MADV_DONTNEED
 and MADV_WILLNEED
 
 Thet is, just after reading the page in, use MADV_DONTNEED on it. When
 evicting
 a clean page, check that it is still in cache and if it is, then
 MADV_WILLNEED it.
 
 Another nice thing to do would be dynamically adjusting kernel
 dirty_background_ratio
 and other related knobs in real time based on how many buffers are dirty
 inside postgresql.
 Maybe in background writer.
 
 Question to LKM folks - will kernel react well to frequent changes to
 /proc/sys/vm/dirty_*  ?
 How frequent can they be (every few second? every second? 100Hz ?)
  So the question is what do you mean by 'react'. We check whether we
should start background writeback every dirty_writeback_centisecs (5s). We
will also check whether we didn't exceed the background dirty limit (and
wake writeback thread) when dirtying pages. However this check happens once
per several dirtied MB (unless we are close to dirty_bytes).

When writeback is running we check roughly once per second (the logic is
more complex there but I don't think explaining details would be useful
here) whether we are below dirty_background_bytes and stop writeback in
that case.

So changing dirty_background_bytes every few seconds should work
reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
note that you have conflicting requirements on the kernel writeback. On one
hand you want checkpoint data to steadily trickle to disk (well, trickle
isn't exactly the proper word since if you need to checkpoing 16 GB every 5
minutes than you need a steady throughput of ~50 MB/s just for
checkpointing) so you want to set dirty_background_bytes low, on the other
hand you don't want temporary files to get to disk so you want to set
dirty_background_bytes high. And also that changes of
dirty_background_bytes probably will not take into account other events
happening on the system (maybe a DB backup is running...). So I'm somewhat
skeptical you will be able to tune dirty_background_bytes frequently in a
useful way.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing
On 01/15/2014 02:01 PM, Jan Kara wrote:
 On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
 On 01/14/2014 06:12 PM, Robert Haas wrote:
 This would be pretty similar to copy-on-write, except
 without the copying. It would just be
 forget-from-the-buffer-pool-on-write. 
 +1

 A version of this could probably already be implement using MADV_DONTNEED
 and MADV_WILLNEED

 Thet is, just after reading the page in, use MADV_DONTNEED on it. When
 evicting
 a clean page, check that it is still in cache and if it is, then
 MADV_WILLNEED it.

 Another nice thing to do would be dynamically adjusting kernel
 dirty_background_ratio
 and other related knobs in real time based on how many buffers are dirty
 inside postgresql.
 Maybe in background writer.

 Question to LKM folks - will kernel react well to frequent changes to
 /proc/sys/vm/dirty_*  ?
 How frequent can they be (every few second? every second? 100Hz ?)
   So the question is what do you mean by 'react'. We check whether we
 should start background writeback every dirty_writeback_centisecs (5s). We
 will also check whether we didn't exceed the background dirty limit (and
 wake writeback thread) when dirtying pages. However this check happens once
 per several dirtied MB (unless we are close to dirty_bytes).

 When writeback is running we check roughly once per second (the logic is
 more complex there but I don't think explaining details would be useful
 here) whether we are below dirty_background_bytes and stop writeback in
 that case.

 So changing dirty_background_bytes every few seconds should work
 reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
 note that you have conflicting requirements on the kernel writeback. On one
 hand you want checkpoint data to steadily trickle to disk (well, trickle
 isn't exactly the proper word since if you need to checkpoing 16 GB every 5
 minutes than you need a steady throughput of ~50 MB/s just for
 checkpointing) so you want to set dirty_background_bytes low, on the other
 hand you don't want temporary files to get to disk so you want to set
 dirty_background_bytes high. 
Is it possible to have more fine-grained control over writeback, like
configuring dirty_background_bytes per file system / device (or even
a file or a group of files) ?

If not, then how hard would it be to provide this ?

This is a bit backwards from keeping-the-cache-clean perspective,
but would help a lot with hinting the writer that a big sync is coming.

 And also that changes of
 dirty_background_bytes probably will not take into account other events
 happening on the system (maybe a DB backup is running...). So I'm somewhat
 skeptical you will be able to tune dirty_background_bytes frequently in a
 useful way.



Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 4:23 PM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 Yes, that's what I was thinking: it's a cache.  About how many files
 comprise this cache?  Are you thinking it's too difficult for every
 process to map the files?

No, I'm thinking that would throw cache coherency out the window.
Separate mappings are all well and good until somebody decides to
modify the page, but after that point the database processes need to
see the modified version of the page (which is, further, hedged about
with locks) yet the operating system MUST NOT see the modified version
of the page until the write-ahead log entry for the page modification
has been flushed to disk.  There's really no way to do that without
having our own private cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Tue, Jan 14, 2014 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:
 By default, background writeback doesn't start until 10% of memory
 is dirtied, and on your machine that's 25GB of RAM. That's way to
 high for your workload.

 It appears to me that we are seeing large memory machines much more
 commonly in data centers - a couple of years ago 256GB RAM was only
 seen in supercomputers. Hence machines of this size are moving from
 tweaking settings for supercomputers is OK class to tweaking
 settings for enterprise servers is not OK

 Perhaps what we need to do is deprecate dirty_ratio and
 dirty_background_ratio as the default values as move to the byte
 based values as the defaults and cap them appropriately.  e.g.
 10/20% of RAM for small machines down to a couple of GB for large
 machines

I think that's right.  In our case we know we're going to call fsync()
eventually and that's going to produce a torrent of I/O.  If that
torrent fits in downstream caches or can be satisfied quickly without
disrupting the rest of the system too much, then life is good.  But
the downstream caches don't typically grow proportionately to the size
of system memory.  Maybe a machine with 16GB has 1GB of battery-backed
write cache, but it doesn't follow that 256GB machine has 16GB of
battery-backed write cache.

 Essentially, changing dirty_background_bytes, dirty_bytes and
 dirty_expire_centiseconds to be much smaller should make the kernel
 start writeback much sooner and so you shouldn't have to limit the
 amount of buffers the application has to prevent major fsync
 triggered stalls...

I think this has been tried with some success, but I don't know the
details.  I think the bytes values are clearly more useful than the
percentages, because you can set them smaller and with better
granularity.

One thought that occurs to me is that it might be useful to have
PostgreSQL tell the system when we expect to perform an fsync.
Imagine fsync_is_coming(int fd, time_t).  We know long in advance
(minutes) when we're gonna do it, so in some sense what we'd like to
tell the kernel is: we're not in a hurry to get this data on disk
right now, but when the indicated time arrives, we are going to do
fsyncs of a bunch of files in rapid succession, so please arrange to
flush the data as close to that time as possible (to maximize
write-combining) while still finishing by that time (so that the
fsyncs are fast and more importantly so that they don't cause a
system-wide stall).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Heikki Linnakangas

On 01/15/2014 07:50 AM, Dave Chinner wrote:

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.


For checkpoint writes, direct I/O actually would be reasonable. 
Bypassing the OS cache is a good thing in that case - we don't want the 
written pages to evict other pages from the OS cache, as we already have 
them in the PostgreSQL buffer cache.


Writing one page at a time with O_DIRECT from a single process might be 
quite slow, so we'd probably need to use writev() or asynchronous I/O to 
work around that.


We'd still need to issue an fsync() to flush any already-written pages 
from the OS cache to disk, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 01/15/2014 07:50 AM, Dave Chinner wrote:
 FWIW [and I know you're probably sick of hearing this by now], but
 the blk-io throttling works almost perfectly with applications that
 use direct IO.

 For checkpoint writes, direct I/O actually would be reasonable. 
 Bypassing the OS cache is a good thing in that case - we don't want the 
 written pages to evict other pages from the OS cache, as we already have 
 them in the PostgreSQL buffer cache.

But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc.  We don't want to build
an I/O scheduler, IMO, but we'd have to.

 Writing one page at a time with O_DIRECT from a single process might be 
 quite slow, so we'd probably need to use writev() or asynchronous I/O to 
 work around that.

Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?

What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation.  Let us know when you're done.  Right now,
because there's no way to negotiate such behavior, we're reduced to having
to dribble out the pages (in what's very likely a non-optimal order) and
hope that the kernel is neither too lazy nor too aggressive about cleaning
dirty pages in its caches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
 Filesystems could in theory provide facility like atomic write (at least up
 to a certain size say in MB range) but it's not so easy and when there are
 no strong usecases fs people are reluctant to make their code more complex
 unnecessarily. OTOH without widespread atomic write support I understand
 application developers have similar stance. So it's kind of chicken and egg
 problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
 due to its data=journal mode so if someone on the PostgreSQL side wanted to
 research on this, knitting some experimental ext4 patches should be doable.

Atomic 8kB writes would improve performance for us quite a lot.  Full
page writes to WAL are very expensive.  I don't remember what
percentage of write-ahead log traffic that accounts for, but it's not
small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
 Yes, that's basically zero-copy reads.
 
 It could be done. The kernel can remap the page to the physical page
 holding the shared buffer and mark it read-only, then expire the
 buffer and transfer ownership of the page if any page fault happens.
 
 But that incurrs:
  - Page faults, lots
  - Hugely bloated mappings, unless KSM is somehow leveraged for this

The page faults might be a problem but might be worth it.  Bloated
mappings sounds like a real issue though.

 And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
 
 I could try to see of madvising shared_buffers as mergeable helps. But
 this should be an automatic case of KSM - ie, when reading into a
 page-aligned address, the kernel should summarily apply KSM-style
 sharing without hinting. The current madvise interface puts the burden
 of figuring out what duplicates what on the kernel, but postgres
 already knows.

I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
with 3.5.0-23-generic, it's not doing anything with just PG running.
The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:


KSM is a memory-saving de-duplication feature, that merges anonymous
(private) pages (not pagecache ones).


Looks like it won't merge between pagecache and private/application
memory?  Or is it just that we're not madvise()'ing the shared buffers
region?  I'd be happy to test doing that, if there's a chance it'll
actually work..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 1:35 PM, Stephen Frost sfr...@snowman.net wrote:
 And there's a nice bingo. Had forgotten about KSM. KSM could help lots.

 I could try to see of madvising shared_buffers as mergeable helps. But
 this should be an automatic case of KSM - ie, when reading into a
 page-aligned address, the kernel should summarily apply KSM-style
 sharing without hinting. The current madvise interface puts the burden
 of figuring out what duplicates what on the kernel, but postgres
 already knows.

 I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
 with 3.5.0-23-generic, it's not doing anything with just PG running.
 The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:

 
 KSM is a memory-saving de-duplication feature, that merges anonymous
 (private) pages (not pagecache ones).
 

 Looks like it won't merge between pagecache and private/application
 memory?  Or is it just that we're not madvise()'ing the shared buffers
 region?  I'd be happy to test doing that, if there's a chance it'll
 actually work..


Yes, it's onlyl *intended* for merging private memory.

But, still, the implementation is very similar to what postgres needs:
sharing a physical page for two distinct logical pages, efficiently,
with efficient copy-on-write.

So it'd be just a matter of removing that limitation regarding page
cache and shared pages.

If you asked me, I'd implement it as copy-on-write on the page cache
(not the user page). That ought to be low-overhead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
 But, still, the implementation is very similar to what postgres needs:
 sharing a physical page for two distinct logical pages, efficiently,
 with efficient copy-on-write.

Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..

 So it'd be just a matter of removing that limitation regarding page
 cache and shared pages.

Any idea why that limitation is there?

 If you asked me, I'd implement it as copy-on-write on the page cache
 (not the user page). That ought to be low-overhead.

Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied.  Perhaps you mean that the application should keep the
original and that the page-cache should get the copy (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).

Would that be a way to go, perhaps?  This does go back to the make it
act like mmap, but not *be* mmap, but the idea would be:

open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal

The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.

This isn't the big win of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 3:41 PM, Stephen Frost sfr...@snowman.net wrote:
 * Claudio Freire (klaussfre...@gmail.com) wrote:
 But, still, the implementation is very similar to what postgres needs:
 sharing a physical page for two distinct logical pages, efficiently,
 with efficient copy-on-write.

 Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
 guessing there's a reason the pagecache isn't included normally..

KSM does an active de-duplication. That's slow. This would be
leveraging KSM structures in the kernel (page sharing) but without all
the de-duplication logic.


 So it'd be just a matter of removing that limitation regarding page
 cache and shared pages.

 Any idea why that limitation is there?

No, but I'm guessing it's because nobody bothered to implement the
required copy-on-write in the page cache, which would be a PITA to
write - think of all the complexities with privilege checks and
everything - even though the benefits for many kinds of applications
would be important.

 If you asked me, I'd implement it as copy-on-write on the page cache
 (not the user page). That ought to be low-overhead.

 Not entirely sure I'm following this- if it's a shared page, it doesn't
 matter who starts writing to it, as soon as that happens, it need to get
 copied.  Perhaps you mean that the application should keep the
 original and that the page-cache should get the copy (or, really,
 perhaps just forget about the page existing at that point- we won't want
 it again...).

 Would that be a way to go, perhaps?  This does go back to the make it
 act like mmap, but not *be* mmap, but the idea would be:
 open(..., O_ZEROCOPY_READ)
 read() - Goes to PG's shared buffers, pagecache and PG share the page
 page fault (PG writes to it) - pagecache forgets about the page
 write() / fsync() - operate as normal

Yep.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 14:38:44, Hannu Krosing wrote:
 On 01/15/2014 02:01 PM, Jan Kara wrote:
  On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
  On 01/14/2014 06:12 PM, Robert Haas wrote:
  This would be pretty similar to copy-on-write, except
  without the copying. It would just be
  forget-from-the-buffer-pool-on-write. 
  +1
 
  A version of this could probably already be implement using MADV_DONTNEED
  and MADV_WILLNEED
 
  Thet is, just after reading the page in, use MADV_DONTNEED on it. When
  evicting
  a clean page, check that it is still in cache and if it is, then
  MADV_WILLNEED it.
 
  Another nice thing to do would be dynamically adjusting kernel
  dirty_background_ratio
  and other related knobs in real time based on how many buffers are dirty
  inside postgresql.
  Maybe in background writer.
 
  Question to LKM folks - will kernel react well to frequent changes to
  /proc/sys/vm/dirty_*  ?
  How frequent can they be (every few second? every second? 100Hz ?)
So the question is what do you mean by 'react'. We check whether we
  should start background writeback every dirty_writeback_centisecs (5s). We
  will also check whether we didn't exceed the background dirty limit (and
  wake writeback thread) when dirtying pages. However this check happens once
  per several dirtied MB (unless we are close to dirty_bytes).
 
  When writeback is running we check roughly once per second (the logic is
  more complex there but I don't think explaining details would be useful
  here) whether we are below dirty_background_bytes and stop writeback in
  that case.
 
  So changing dirty_background_bytes every few seconds should work
  reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
  note that you have conflicting requirements on the kernel writeback. On one
  hand you want checkpoint data to steadily trickle to disk (well, trickle
  isn't exactly the proper word since if you need to checkpoing 16 GB every 5
  minutes than you need a steady throughput of ~50 MB/s just for
  checkpointing) so you want to set dirty_background_bytes low, on the other
  hand you don't want temporary files to get to disk so you want to set
  dirty_background_bytes high. 
 Is it possible to have more fine-grained control over writeback, like
 configuring dirty_background_bytes per file system / device (or even
 a file or a group of files) ?
  Currently it isn't possible to tune dirty_background_bytes per device
directly. However see below.

 If not, then how hard would it be to provide this ?
  We do track amount of dirty pages per device and the thread doing the
flushing is also per device. The thing is that currently we compute the
per-device background limit as dirty_background_bytes * p, where p is a
proportion of writeback happening on this device to total writeback in the
system (computed as floating average with exponential time-based backoff).
BTW, similarly maximum per-device dirty limit is derived from global
dirty_bytes in the same way. And you can also set bounds on the proportion
'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able
to set fixed background limit for a device by setting matching min and max
proportions.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jeff Janes
On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 01/15/2014 07:50 AM, Dave Chinner wrote:
  FWIW [and I know you're probably sick of hearing this by now], but
  the blk-io throttling works almost perfectly with applications that
  use direct IO.

  For checkpoint writes, direct I/O actually would be reasonable.
  Bypassing the OS cache is a good thing in that case - we don't want the
  written pages to evict other pages from the OS cache, as we already have
  them in the PostgreSQL buffer cache.

 But in exchange for that, we'd have to deal with selecting an order to
 write pages that's appropriate depending on the filesystem layout,
 other things happening in the system, etc etc.  We don't want to build
 an I/O scheduler, IMO, but we'd have to.

  Writing one page at a time with O_DIRECT from a single process might be
  quite slow, so we'd probably need to use writev() or asynchronous I/O to
  work around that.

 Yeah, and if the system has multiple spindles, we'd need to be issuing
 multiple O_DIRECT writes concurrently, no?


writev effectively does do that, doesn't it?  But they do have to be on the
same file handle, so that could be a problem.  I think we need something
like sorted checkpoints sooner or later, anyway.



 What we'd really like for checkpointing is to hand the kernel a boatload
 (several GB) of dirty pages and say how about you push all this to disk
 over the next few minutes, in whatever way seems optimal given the storage
 hardware and system situation.  Let us know when you're done.


And most importantly, Also, please don't freeze up everything else in the
process

Cheers,

Jeff


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner da...@fromorbit.com writes:
 On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
 What we'd really like for checkpointing is to hand the kernel a boatload
 (several GB) of dirty pages and say how about you push all this to disk
 over the next few minutes, in whatever way seems optimal given the storage
 hardware and system situation.  Let us know when you're done.

 The issue there is that the kernel has other triggers for needing to
 clean data. We have no infrastructure to handle variable writeback
 deadlines at the moment, nor do we have any infrastructure to do
 roughly metered writeback of such files to disk. I think we could
 add it to the infrastructure without too much perturbation of the
 code, but as you've pointed out that still leaves the fact there's
 no obvious interface to configure such behaviour. Would it need to
 be persistent?

No, we'd be happy to re-request it during each checkpoint cycle, as
long as that wasn't an unduly expensive call to make.  I'm not quite
sure where such requests ought to live though.  One idea is to tie
them to file descriptors; but the data to be written might be spread
across more files than we really want to keep open at one time.
But the only other idea that comes to mind is some kind of global sysctl,
which would probably have security and permissions issues.  (One thing
that hasn't been mentioned yet in this thread, but maybe is worth pointing
out now, is that Postgres does not run as root, and definitely doesn't
want to.  So we don't want a knob that would require root permissions
to twiddle.)  We could probably live with serially checkpointing data
in sets of however-many-files-we-can-have-open, if file descriptors are
the place to keep the requests.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner da...@fromorbit.com writes:
 On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
 And most importantly, Also, please don't freeze up everything else in the
 process

 If you hand writeback off to the kernel, then writeback for memory
 reclaim needs to take precedence over metered writeback. If we are
 low on memory, then cleaning dirty memory quickly to avoid ongoing
 allocation stalls, failures and potentially OOM conditions is far more
 important than anything else.

I think you're in violent agreement, actually.  Jeff's point is exactly
that we'd rather the checkpoint deadline slid than that the system goes
to hell in a handbasket for lack of I/O cycles.  Here metered really
means do it as a low-priority task.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 7:22 PM, Dave Chinner da...@fromorbit.com wrote:
 No, I meant the opposite - in low memory situations, the system is
 going to go to hell in a handbasket because we are going to cause a
 writeback IO storm cleaning memory regardless of these IO
 priorities. i.e. there is no way we'll let low priority writeback
 to avoid IO storms cause OOM conditions to occur. That is, in OOM
 conditions, cleaning dirty pages becomes one of the highest priority
 tasks of the system

I don't see that as a problem.  What we're struggling with today is
that, until we fsync(), the system is too lazy about writing back
dirty pages.  And then when we fsync(), it becomes very aggressive and
system-wide throughput goes into the tank.  What we're aiming to do
here is get is to start the writeback sooner than it would otherwise
start so that it is spread out over a longer period of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Dave Chinner da...@fromorbit.com writes:
 On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
 No, we'd be happy to re-request it during each checkpoint cycle, as
 long as that wasn't an unduly expensive call to make.  I'm not quite
 sure where such requests ought to live though.  One idea is to tie
 them to file descriptors; but the data to be written might be spread
 across more files than we really want to keep open at one time.

 It would be a property of the inode, as that is how writeback is
 tracked and timed. Set and queried through a file descriptor,
 though - it's basically the same context that fadvise works
 through.

Ah, got it.  That would be fine on our end, I think.

 We could probably live with serially checkpointing data
 in sets of however-many-files-we-can-have-open, if file descriptors are
 the place to keep the requests.

 Inodes live longer than file descriptors, but there's no guarantee
 that they live from one fd context to another. Hence my question
 about persistence ;)

I plead ignorance about what an fd context is.  However, if what you're
saying is that there's a small chance of the kernel forgetting the request
during normal system operation, I think we could probably tolerate that,
if the API is designed so that we ultimately do an fsync on the file
anyway.  The point of the hint would be to try to ensure that the later
fsync had little to do.  If sometimes it didn't work, well, that's life.
We're ahead of the game as long as it usually works.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I don't see that as a problem.  What we're struggling with today is
 that, until we fsync(), the system is too lazy about writing back
 dirty pages.  And then when we fsync(), it becomes very aggressive and
 system-wide throughput goes into the tank.  What we're aiming to do
 here is get is to start the writeback sooner than it would otherwise
 start so that it is spread out over a longer period of time.

Yeah.  It's sounding more and more like the right semantics are to
give the kernel a hint that we're going to fsync these files later,
so it ought to get on with writing them anytime the disk has nothing
better to do.  I'm not sure if there's value in being specific about
how much later; that would probably depend on details of the scheduler
that I don't know.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 01/15/2014 07:50 AM, Dave Chinner wrote:
  FWIW [and I know you're probably sick of hearing this by now], but
  the blk-io throttling works almost perfectly with applications that
  use direct IO.
 
  For checkpoint writes, direct I/O actually would be reasonable. 
  Bypassing the OS cache is a good thing in that case - we don't want the 
  written pages to evict other pages from the OS cache, as we already have 
  them in the PostgreSQL buffer cache.
 
 But in exchange for that, we'd have to deal with selecting an order to
 write pages that's appropriate depending on the filesystem layout,
 other things happening in the system, etc etc.  We don't want to build
 an I/O scheduler, IMO, but we'd have to.

I don't see that as necessary - nobody else needs to do this with
direct IO. Indeed, if the application does ascending offset order
writeback from within a file, then it's replicating exactly what the
kernel page cache writeback does. If what the kernel does is good
enough for you, then I can't see how doing the same thing with
a background thread doing direct IO is going to need any special
help

  Writing one page at a time with O_DIRECT from a single process might be 
  quite slow, so we'd probably need to use writev() or asynchronous I/O to 
  work around that.
 
 Yeah, and if the system has multiple spindles, we'd need to be issuing
 multiple O_DIRECT writes concurrently, no?
 
 What we'd really like for checkpointing is to hand the kernel a boatload
 (several GB) of dirty pages and say how about you push all this to disk
 over the next few minutes, in whatever way seems optimal given the storage
 hardware and system situation.  Let us know when you're done.

The issue there is that the kernel has other triggers for needing to
clean data. We have no infrastructure to handle variable writeback
deadlines at the moment, nor do we have any infrastructure to do
roughly metered writeback of such files to disk. I think we could
add it to the infrastructure without too much perturbation of the
code, but as you've pointed out that still leaves the fact there's
no obvious interface to configure such behaviour. Would it need to
be persistent?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
 On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Heikki Linnakangas hlinnakan...@vmware.com writes:
   On 01/15/2014 07:50 AM, Dave Chinner wrote:
   FWIW [and I know you're probably sick of hearing this by now], but
   the blk-io throttling works almost perfectly with applications that
   use direct IO.
 
   For checkpoint writes, direct I/O actually would be reasonable.
   Bypassing the OS cache is a good thing in that case - we don't want the
   written pages to evict other pages from the OS cache, as we already have
   them in the PostgreSQL buffer cache.
 
  But in exchange for that, we'd have to deal with selecting an order to
  write pages that's appropriate depending on the filesystem layout,
  other things happening in the system, etc etc.  We don't want to build
  an I/O scheduler, IMO, but we'd have to.
 
   Writing one page at a time with O_DIRECT from a single process might be
   quite slow, so we'd probably need to use writev() or asynchronous I/O to
   work around that.
 
  Yeah, and if the system has multiple spindles, we'd need to be issuing
  multiple O_DIRECT writes concurrently, no?
 
 
 writev effectively does do that, doesn't it?  But they do have to be on the
 same file handle, so that could be a problem.  I think we need something
 like sorted checkpoints sooner or later, anyway.

No, it doesn't. writev() allows you to supply multiple user buffers
for a single IO to fixed offset. If th efile is contiguous, then it
will be issued as a single IO. If you want concurrent DIO, then you
need to use multiple threads or AIO.

  What we'd really like for checkpointing is to hand the kernel a boatload
  (several GB) of dirty pages and say how about you push all this to disk
  over the next few minutes, in whatever way seems optimal given the storage
  hardware and system situation.  Let us know when you're done.
 
 And most importantly, Also, please don't freeze up everything else in the
 process

If you hand writeback off to the kernel, then writeback for memory
reclaim needs to take precedence over metered writeback. If we are
low on memory, then cleaning dirty memory quickly to avoid ongoing
allocation stalls, failures and potentially OOM conditions is far more
important than anything else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
  Filesystems could in theory provide facility like atomic write (at least up
  to a certain size say in MB range) but it's not so easy and when there are
  no strong usecases fs people are reluctant to make their code more complex
  unnecessarily. OTOH without widespread atomic write support I understand
  application developers have similar stance. So it's kind of chicken and egg
  problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
  due to its data=journal mode so if someone on the PostgreSQL side wanted to
  research on this, knitting some experimental ext4 patches should be doable.
 
 Atomic 8kB writes would improve performance for us quite a lot.  Full
 page writes to WAL are very expensive.  I don't remember what
 percentage of write-ahead log traffic that accounts for, but it's not
 small.

Essentially, the atomic writes will essentially be journalled data
so initially there is not going to be any different in performance
between journalling the data in userspace and journalling it in the
filesystem journal. Indeed, it could be worse because the filesystem
journal is typically much smaller than a database WAL file, and it
will flush much more frequently and without the database having any
say in when that occurs.

AFAICT, we're stuck with sucky WAL until block layer and hardware
support atomic writes.

FWIW, I've certainly considered adding per-file data journalling
capabilities to XFS in the past. If we decide that this is the way
to proceed (i.e. as a stepping stone towards hardware atomic write
support), then I can go back to my notes from a few years ago and
see what still needs to be done to support it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
  And most importantly, Also, please don't freeze up everything else in the
  process
 
  If you hand writeback off to the kernel, then writeback for memory
  reclaim needs to take precedence over metered writeback. If we are
  low on memory, then cleaning dirty memory quickly to avoid ongoing
  allocation stalls, failures and potentially OOM conditions is far more
  important than anything else.
 
 I think you're in violent agreement, actually.  Jeff's point is exactly
 that we'd rather the checkpoint deadline slid than that the system goes
 to hell in a handbasket for lack of I/O cycles.  Here metered really
 means do it as a low-priority task.

No, I meant the opposite - in low memory situations, the system is
going to go to hell in a handbasket because we are going to cause a
writeback IO storm cleaning memory regardless of these IO
priorities. i.e. there is no way we'll let low priority writeback
to avoid IO storms cause OOM conditions to occur. That is, in OOM
conditions, cleaning dirty pages becomes one of the highest priority
tasks of the system

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
  What we'd really like for checkpointing is to hand the kernel a boatload
  (several GB) of dirty pages and say how about you push all this to disk
  over the next few minutes, in whatever way seems optimal given the storage
  hardware and system situation.  Let us know when you're done.
 
  The issue there is that the kernel has other triggers for needing to
  clean data. We have no infrastructure to handle variable writeback
  deadlines at the moment, nor do we have any infrastructure to do
  roughly metered writeback of such files to disk. I think we could
  add it to the infrastructure without too much perturbation of the
  code, but as you've pointed out that still leaves the fact there's
  no obvious interface to configure such behaviour. Would it need to
  be persistent?
 
 No, we'd be happy to re-request it during each checkpoint cycle, as
 long as that wasn't an unduly expensive call to make.  I'm not quite
 sure where such requests ought to live though.  One idea is to tie
 them to file descriptors; but the data to be written might be spread
 across more files than we really want to keep open at one time.

It would be a property of the inode, as that is how writeback is
tracked and timed. Set and queried through a file descriptor,
though - it's basically the same context that fadvise works
through.

 But the only other idea that comes to mind is some kind of global sysctl,
 which would probably have security and permissions issues.  (One thing
 that hasn't been mentioned yet in this thread, but maybe is worth pointing
 out now, is that Postgres does not run as root, and definitely doesn't
 want to.  So we don't want a knob that would require root permissions
 to twiddle.)

I have assumed all along that requiring root to do stuff would be a
bad thing. :)

 We could probably live with serially checkpointing data
 in sets of however-many-files-we-can-have-open, if file descriptors are
 the place to keep the requests.

Inodes live longer than file descriptors, but there's no guarantee
that they live from one fd context to another. Hence my question
about persistence ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara j...@suse.cz wrote:
 On Wed 15-01-14 10:12:38, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
  Filesystems could in theory provide facility like atomic write (at least up
  to a certain size say in MB range) but it's not so easy and when there are
  no strong usecases fs people are reluctant to make their code more complex
  unnecessarily. OTOH without widespread atomic write support I understand
  application developers have similar stance. So it's kind of chicken and egg
  problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
  due to its data=journal mode so if someone on the PostgreSQL side wanted to
  research on this, knitting some experimental ext4 patches should be doable.

 Atomic 8kB writes would improve performance for us quite a lot.  Full
 page writes to WAL are very expensive.  I don't remember what
 percentage of write-ahead log traffic that accounts for, but it's not
 small.
   OK, and do you need atomic writes on per-IO basis or per-file is enough?
 It basically boils down to - is all or most of IO to a file going to be
 atomic or it's a smaller fraction?

The write-ahead log wouldn't need it, but data files writes would.  So
we'd need it a lot, but not for absolutely everything.

For any given file, we'd either care about writes being atomic, or we wouldn't.

 As Dave notes, unless there is HW support (which is coming with newest
 solid state drives), ext4/xfs will have to implement this by writing data
 to a filesystem journal and after transaction commit checkpointing them to
 a final location. Which is exactly what you do with your WAL logs so
 it's not clear it will be a performance win. But it is easy enough to code
 for ext4 that I'm willing to try...

Yeah, hardware support would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 03:44 AM, Dave Chinner wrote:
 On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
 On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
 a file into a user provided buffer, thus obtaining a page cache entry
 and a copy in their userspace buffer, then insert the page of the user
 buffer back into the page cache as the page cache page ... that's right,
 isn't it postgress people?
 Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
 isn't needed anymore when reading. And we'd normally write if the page
 is dirty.
 So why, exactly, do you even need the kernel page cache here? You've
 got direct access to the copy of data read into userspace, and you
 want direct control of when and how the data in that buffer is
 written and reclaimed. Why push that data buffer back into the
 kernel and then have to add all sorts of kernel interfaces to
 control the page you already have control of?
To let kernel do the job that it is good at, namely managing the write-back
of dirty buffers to disk and to manage (possible) read-ahead pages.

While we do have control of the page, we do not (and really don't want to)
have control of the complex and varied side of efficiently reading and
writing
to various file-systems with possibly very different disk configurations.

We quite prefer kernel to take care of it and generally like how kernel
manages it.

We have a few suggestions about giving the kernel extra info about the
applications usage patterns of the data.

 Effectively you end up with buffered read/write that's also mapped into
 the page cache.  It's a pretty awful way to hack around mmap.
 Well, the problem is that you can't really use mmap() for the things we
 do. Postgres' durability works by guaranteeing that our journal entries
 (called WAL := Write Ahead Log) are written  synced to disk before the
 corresponding entries of tables and indexes reach the disk. That also
 allows to group together many random-writes into a few contiguous writes
 fdatasync()ed at once. Only during a checkpointing phase the big bulk of
 the data is then (slowly, in the background) synced to disk.
 Which is the exact algorithm most journalling filesystems use for
 ensuring durability of their metadata updates.  Indeed, here's an
 interesting piece of architecture that you might like to consider:

 * Neither XFS and BTRFS use the kernel page cache to back their
   metadata transaction engines.
But file system code is supposed to know much more about the
underlying disk than a mere application program like postgresql.

We do not want to start duplicating OS if we can avoid it.

What we would like is to have a way to tell the kernel

1) here is the modified copy of file page, it is now safe to write
it back - the current 'lazy' write

2) here is the page, write it back now, before returning success
to me - unbuffered write or write + sync

but we also would like to have

3) here is the page as it is currently on disk, I may need it soon,
so keep it together with your other clean pages accessed at time X
- this is the non-dirtying write discussed
   
the page may be in buffer cache, in which case just update its LRU
position (to either current time or time provided by postgresql), or
it may not be there, in which case put it there if reasonable by it's
LRU position.

And we would like all this to work together with other current linux
kernel goodness of managing the whole disk-side interaction of
efficient reading and writing and managing the buffers :)
 Why not? Because the page cache is too simplistic to adequately
 represent the complex object heirarchies that the filesystems have
 and so it's flat LRU reclaim algorithms and writeback control
 mechanisms are a terrible fit and cause lots of performance issues
 under memory pressure.
Same is true for postgresql - if we would just use direct writes
and reads from disk then the performance would be terrible.

We would need to duplicate all the complicated algorithms in file
system do for good performance if we were to start implementing
that part of the file system ourselves.
 
 IOWs, the two most complex high performance transaction engines in
 the Linux kernel have moved to fully customised cache and (direct)
 IO implementations because the requirements for scalability and
 performance are far more complex than the kernel page cache
 infrastructure can provide.
And we would like to avoid implementing this again this by delegating
this part of work to said complex high performance transaction
engines in the Linux kernel.

We do not want to abandon all work for postgresql business code
and go into file system development mode for next few years.

Again, as said above the linux file system is doing fine. What we
want is a few ways to interact with it to let it do even better when
working with postgresql by telling it some stuff it otherwise would
have to second guess and by sometimes giving it back some cache
pages 

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.

You don't need new interfaces. Only a slight modification of what
fadvise DONTNEED does.

This insistence in injecting pages from postgres to kernel is just a
bad idea. At the very least, it still needs postgres to know too much
of the filesystem (block layout) to properly work. Ie: pg must be
required to put entire filesystem-level blocks into the page cache,
since that's how the page cache works. At the very worst, it may
introduce serious security and reliability implications, when
applications can destroy the consistency of the page cache (even if
full access rights are checked, there's still the possibility this
inconsistency might be exploitable).

Simply making fadvise DONTNEED move pages to the head of the LRU (ie:
discard next if you need) should work as expected without all the
complication of the above proposal.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 09:39 AM, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.
 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea. 
Do you think it would be possible to map copy-on-write pages
from linux cache to postgresql cache ?

this would be a step in direction of solving the double-ram-usage
of pages which have not been read from syscache to postgresql
cache without sacrificing linux read-ahead (which I assume does
not happen when reads bypass system cache).

and we can write back the copy at the point when it is safe (from
postgresql perspective)  to let the system write them back ?

Do you think it is possible to make it work with good performance
for a few million 8kb pages ?

 At the very least, it still needs postgres to know too much
 of the filesystem (block layout) to properly work. Ie: pg must be
 required to put entire filesystem-level blocks into the page cache,
 since that's how the page cache works. 
I was more thinking of an simple write() interface with extra
flags/sysctls to tell kernel that we already have this on disk
 At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).
If you allow write() which just writes clean pages, I can not see
where the extra security concerns are beyond what normal
write can do.


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
First off, I want to give a +1 on everything in the recent posts
from Heikki and Hannu.

Jan Kara j...@suse.cz wrote:

 Now the aging of pages marked as volatile as it is currently
 implemented needn't be perfect for your needs but you still have
 time to influence what gets implemented... Actually developers of
 the vrange() syscall were specifically looking for some ideas
 what to base aging on. Currently I think it is first marked -
 first evicted.

The first marked - first evicted seems like what we would want. 
The ability to unmark and have the page no longer be considered
preferred for eviction would be very nice.  That seems to me like
it would cover the multiple layers of buffering *clean* pages very
nicely (although I know nothing more about vrange() than what has
been said on this thread, so I could be missing something).

The other side of that is related avoiding multiple writes of the
same page as much as possible, while avoid write gluts.  The issue
here is that PostgreSQL tries to hang on to dirty pages for as long
as possible before writing them to the OS cache, while the OS
tries to avoid writing them to storage for as long as possible
until they reach a (configurable) threshold or are fsync'd.  The
problem is that a under various conditions PostgreSQL may need to
write and fsync a lot of dirty pages it has accumulated in a short
time.  That has an avalanche effect, creating a write glut
which can stall all I/O for a period of many seconds up to a few
minutes.  If the OS was aware of the dirty pages pending write in
the application, and counted those for purposes of calculating when
and how much to write, the glut could be avoided.  Currently,
people configure the PostgreSQL background writer to be very
aggressive, configure a small PostgreSQL shared_buffers setting,
and/or set the OS thresholds low enough to minimize the problem;
but all of these mitigation strategies have their own costs.

A new hint that the application has dirtied a page could be used by
the OS to improve things this way:  When the OS is notified that a
page is dirty, it takes action depending on whether the page is
considered dirty by the OS.  If it is not dirty, the page is
immediately discarded from the OS cache.  It is known that the
application has a modified version of the page that it intends to
write, so the version in the OS cache has no value.  We don't want
this page forcing eviction of vrange()-flagged pages.  If it is
dirty, any write ordering to storage by the OS based on when the
page was written to the OS would be pushed back as far as possible
without crossing any write barriers, in hopes that the writes could
be combined.  Either way, this page is counted toward dirty pages
for purposes of calculating how much to write from the OS to
storage, and the later write of the page doesn't redundantly add to
this number.

The combination of these two changes could boost PostgreSQL
performance quite a bit, at least for some common workloads.

The MMAP approach always seems tempting on first blush, but the
need to pin pages and the need to assure that dirty pages are not
written ahead of the WAL-logging of those pages makes it hard to
see how we can use it.  The pin means that we need to ensure that
a particular 8KB page remains available for direct reference by all
PostgreSQL processes until it is unpinned.  The other thing we
would need is the ability to modify a page with a solid assurance
that the modified page would *not* be written to disk until we
authorize it.  The page would remain pinned until we do authorize
write, at which point the changes are available to be written, but
can wait for an fsync or accumulations of sufficient dirty pages to
cross the write threshold.  Next comes the hard part.  The page may
or may not be unpinned after that, and if it remains pinned or is
pinned again, there may be further changes to the page.  While the
prior changes can be written (and *must* be written for an fsync),
these new changes must *not* be until we authorize it.  If MMAP can
be made to handle that, we could probably use it (and some of the
previously-discussed techniques might not be needed), but my
understanding is that there is currently no way to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 11:39 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 01/14/2014 09:39 AM, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.
 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea.
 Do you think it would be possible to map copy-on-write pages
 from linux cache to postgresql cache ?

 this would be a step in direction of solving the double-ram-usage
 of pages which have not been read from syscache to postgresql
 cache without sacrificing linux read-ahead (which I assume does
 not happen when reads bypass system cache).

 and we can write back the copy at the point when it is safe (from
 postgresql perspective)  to let the system write them back ?

 Do you think it is possible to make it work with good performance
 for a few million 8kb pages ?

I don't think so. The kernel would need to walk the page mapping on
each page fault, which would incurr the cost of a read cache hit on
each page fault.

A cache hit is still orders of magnitude slower than a regular page
fault, because the process page map is compact and efficient. But if
you bloat it, or if you make the kernel go read the buffer cache, it
would mean bad performance for RAM access, which I'd venture isn't
really a net gain.

That's probably the reason there is no zero-copy read mechanism.
Because you always have to copy from/to the buffer cache anyway.

Of course, this is just OTOMH. Without actually benchmarking, this is
all blabber.

 At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).
 If you allow write() which just writes clean pages, I can not see
 where the extra security concerns are beyond what normal
 write can do.

I've been working on security enough to never dismiss any kind of
system-level inconsistency.

The fact that you can make user-land applications see different data
than kernel-land code has over-reaching consequences that are hard to
ponder.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 3:39 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.

 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

Yeah.  DONTREALLYNEEDALLTHATTERRIBLYMUCH.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea. At the very least, it still needs postgres to know too much
 of the filesystem (block layout) to properly work. Ie: pg must be
 required to put entire filesystem-level blocks into the page cache,
 since that's how the page cache works. At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).

I agree with all that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara j...@suse.cz wrote:
 I thought that instead of injecting pages into pagecache for aging as you
 describe in 3), you would mark pages as volatile (i.e. for reclaim by
 kernel) through vrange() syscall. Next time you need the page, you check
 whether the kernel reclaimed the page or not. If yes, you reload it from
 disk, if not, you unmark it and use it.

 Now the aging of pages marked as volatile as it is currently implemented
 needn't be perfect for your needs but you still have time to influence what
 gets implemented... Actually developers of the vrange() syscall were
 specifically looking for some ideas what to base aging on. Currently I
 think it is first marked - first evicted.

This is an interesting idea but it stinks of impracticality.
Essentially when the last buffer pin on a page is dropped we'd have to
mark it as discardable, and then the next person wanting to pin it
would have to check whether it's still there.  But the system call
overhead of calling vrange() every time the last pin on a page was
dropped would probably hose us.

*thinks*

Well, I guess it could be done lazily: make periodic sweeps through
shared_buffers, looking for pages that haven't been touched in a
while, and vrange() them.  That's quite a bit of new mechanism, but in
theory it could work out to a win.  vrange() would have to scale well
to millions of separate ranges, though.  Will it?  And a lot depends
on whether the kernel makes the right decision about whether to chunk
data from our vrange() vs. any other page it could have reclaimed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.

Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()

Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.

Don't be aggressive isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.

Understood, but that makes this direction a dead end.  We can't use
it if the kernel might decide to write anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 12:42 PM, Trond Myklebust tron...@gmail.com wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.

 Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()

 Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.

 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 Then why are you dirtying the page at all? It makes no sense to tell the 
 kernel “we’re changing this page in the page cache, but we don’t want you to 
 change it on disk”: that’s not consistent with the function of a page cache.


PG doesn't currently.

All that dirtying happens in anonymous shared memory, in pg-specific buffers.

The proposal is to use mmap instead of anonymous shared memory as
pg-specific buffers to avoid the extra copy (mmap would share the page
with both kernel and user space). But that would dirty the page when
written to, because now the kernel has the correspondence between that
specific memory region and the file, and that's forbidden for PG's
usage.

I believe the only option here is for the kernel to implement
zero-copy reads. But that implementation is doomed for the performance
reasons I outlined on an eariler mail. So...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
Trond Myklebust tron...@gmail.com writes:
 On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 Then why are you dirtying the page at all? It makes no sense to tell the 
 kernel “we’re changing this page in the page cache, but we don’t want you to 
 change it on disk”: that’s not consistent with the function of a page cache.

As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there.  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the current
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.

Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
  Well, the problem is that you can't really use mmap() for the things we
  do. Postgres' durability works by guaranteeing that our journal entries
  (called WAL := Write Ahead Log) are written  synced to disk before the
  corresponding entries of tables and indexes reach the disk. That also
  allows to group together many random-writes into a few contiguous writes
  fdatasync()ed at once. Only during a checkpointing phase the big bulk of
  the data is then (slowly, in the background) synced to disk.
  Which is the exact algorithm most journalling filesystems use for
  ensuring durability of their metadata updates.  Indeed, here's an
  interesting piece of architecture that you might like to consider:
 
  * Neither XFS and BTRFS use the kernel page cache to back their
metadata transaction engines.
 But file system code is supposed to know much more about the
 underlying disk than a mere application program like postgresql.
 
 We do not want to start duplicating OS if we can avoid it.
 
 What we would like is to have a way to tell the kernel
 
 1) here is the modified copy of file page, it is now safe to write
 it back - the current 'lazy' write
 
 2) here is the page, write it back now, before returning success
 to me - unbuffered write or write + sync
 
 but we also would like to have
 
 3) here is the page as it is currently on disk, I may need it soon,
 so keep it together with your other clean pages accessed at time X
 - this is the non-dirtying write discussed

 the page may be in buffer cache, in which case just update its LRU
 position (to either current time or time provided by postgresql), or
 it may not be there, in which case put it there if reasonable by it's
 LRU position.
 
 And we would like all this to work together with other current linux
 kernel goodness of managing the whole disk-side interaction of
 efficient reading and writing and managing the buffers :)
  So when I was speaking about the proposed vrange() syscall in this thread,
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 11:11:28, Heikki Linnakangas wrote:
 On 01/14/2014 12:26 AM, Mel Gorman wrote:
 On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
 The other thing that comes to mind is the kernel's caching behavior.
 We've talked a lot over the years about the difficulties of getting
 the kernel to write data out when we want it to and to not write data
 out when we don't want it to.
 
 Is sync_file_range() broke?
 
 When it writes data back to disk too
 aggressively, we get lousy throughput because the same page can get
 written more than once when caching it for longer would have allowed
 write-combining.
 
 Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
 If it's dirty_writeback_centisecs then that would be particularly tricky
 because poor interactions there would come down to luck basically.
 
 When it doesn't write data to disk aggressively
 enough, we get huge latency spikes at checkpoint time when we call
 fsync() and the kernel says uh, what? you wanted that data *on the
 disk*? sorry boss! and then proceeds to destroy the world by starving
 the rest of the system for I/O for many seconds or minutes at a time.
 
 Ok, parts of that are somewhat expected. It *may* depend on the
 underlying filesystem. Some of them handle fsync better than others. If
 you are syncing the whole file though when you call fsync then you are
 potentially burned by having to writeback dirty_ratio amounts of memory
 which could take a substantial amount of time.
 
 We've made some desultory attempts to use sync_file_range() to improve
 things here, but I'm not sure that's really the right tool, and if it
 is we don't know how to use it well enough to obtain consistent
 positive results.
 
 That implies that either sync_file_range() is broken in some fashion we
 (or at least I) are not aware of and that needs kicking.
 
 Let me try to explain the problem: Checkpoints can cause an I/O
 spike, which slows down other processes.
 
 When it's time to perform a checkpoint, PostgreSQL will write() all
 dirty buffers from the PostgreSQL buffer cache, and finally perform
 an fsync() to flush the writes to disk. After that, we know the data
 is safely on disk.
 
 In older PostgreSQL versions, the write() calls would cause an I/O
 storm as the OS cache quickly fills up with dirty pages, up to
 dirty_ratio, and after that all subsequent write()s block. That's OK
 as far as the checkpoint is concerned, but it significantly slows
 down queries running at the same time. Even a read-only query often
 needs to write(), to evict a dirty page from the buffer cache to
 make room for a different page. We made that less painful by adding
 sleeps between the write() calls, so that they are trickled over a
 long period of time and hopefully stay below dirty_ratio at all
 times.
  Hum, I wonder whether you see any difference with reasonably recent
kernels (say newer than 3.2). Because those have IO-less dirty throttling.
That means that:
  a) checkpointing thread (or other threads blocked due to dirty limit)
won't issue IO on their own but rather wait for flusher thread to do the
work.
  b) there should be more noticeable difference between the delay imposed
on heavily dirtying thread (i.e. the checkpointing thread) and the delay
imposed on lightly dirtying thread (that's what I would expect from those
threads having to do occasional page eviction to make room for other page).

 However, we still have to perform the fsync()s after the
 writes(), and sometimes that still causes a similar I/O storm.
  Because there is still quite some dirty data in the page cache or because
e.g. ext3 has to flush a lot of unrelated dirty data?

 The checkpointer is not in a hurry. A checkpoint typically has 10-30
 minutes to finish, before it's time to start the next checkpoint,
 and even if it misses that deadline that's not too serious either.
 But the OS doesn't know that, and we have no way of telling it.
 
 As a quick fix, some sort of a lazy fsync() call would be nice. It
 would behave just like fsync() but it would not change the I/O
 scheduling at all. Instead, it would sleep until all the pages have
 been flushed to disk, at the speed they would've been without the
 fsync() call.
 
 Another approach would be to give the I/O that the checkpointer
 process initiates a lower priority. This would be slightly
 preferable, because PostgreSQL could then issue the writes() as fast
 as it can, and have the checkpoint finish earlier when there's not
 much other load. Last I looked into this (which was a long time
 ago), there was no suitable priority system for writes, only reads.
  Well, IO priority works for writes in principle, the trouble is it
doesn't work for writes which end up just in the page cache. Then writeback
of page cache is usually done by flusher thread so that's completely
disconnected from whoever created the dirty data (now I know this is dumb
and long term we want to do something about it so that IO cgroups 

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
 On 01/13/2014 02:26 PM, Mel Gorman wrote:
  Really?
  
  zone_reclaim_mode is often a complete disaster unless the workload is
  partitioned to fit within NUMA nodes. On older kernels enabling it would
  sometimes cause massive stalls. I'm actually very surprised to hear it
  fixes anything and would be interested in hearing more about what sort
  of circumstnaces would convince you to enable that thing.
 
 So the problem with the default setting is that it pretty much isolates
 all FS cache for PostgreSQL to whichever socket the postmaster is
 running on, and makes the other FS cache unavailable.  This means that,
 for example, if you have two memory banks, then only one of them is
 available for PostgreSQL filesystem caching ... essentially cutting your
 available cache in half.

No matter what default NUMA allocation policy we set, there will be
an application for which that behaviour is wrong. As such, we've had
tools for setting application specific NUMA policies for quite a few
years now. e.g:

$ man 8 numactl

   --interleave=nodes, -i nodes
  Set a memory interleave policy. Memory will be
  allocated using round robin on nodes.  When memory
  cannot be allocated on the current interleave target
  fall back to other nodes.  Multiple nodes may be
  specified on --interleave, --membind and
  --cpunodebind.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
 On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
  a file into a user provided buffer, thus obtaining a page cache entry
  and a copy in their userspace buffer, then insert the page of the user
  buffer back into the page cache as the page cache page ... that's right,
  isn't it postgress people?
 
 Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
 isn't needed anymore when reading. And we'd normally write if the page
 is dirty.

So why, exactly, do you even need the kernel page cache here? You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
 
 Well, the problem is that you can't really use mmap() for the things we
 do. Postgres' durability works by guaranteeing that our journal entries
 (called WAL := Write Ahead Log) are written  synced to disk before the
 corresponding entries of tables and indexes reach the disk. That also
 allows to group together many random-writes into a few contiguous writes
 fdatasync()ed at once. Only during a checkpointing phase the big bulk of
 the data is then (slowly, in the background) synced to disk.

Which is the exact algorithm most journalling filesystems use for
ensuring durability of their metadata updates.  Indeed, here's an
interesting piece of architecture that you might like to consider:

* Neither XFS and BTRFS use the kernel page cache to back their
  metadata transaction engines.

Why not? Because the page cache is too simplistic to adequately
represent the complex object heirarchies that the filesystems have
and so it's flat LRU reclaim algorithms and writeback control
mechanisms are a terrible fit and cause lots of performance issues
under memory pressure.

IOWs, the two most complex high performance transaction engines in
the Linux kernel have moved to fully customised cache and (direct)
IO implementations because the requirements for scalability and
performance are far more complex than the kernel page cache
infrastructure can provide.

Just food for thought

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote:
 On Jan 13, 2014, at 19:03, Hannu Krosing ha...@2ndquadrant.com wrote:
 
  On 01/13/2014 09:53 PM, Trond Myklebust wrote:
  On Jan 13, 2014, at 15:40, Andres Freund and...@2ndquadrant.com wrote:
  
  On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
  On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com 
  wrote:
  I notice, Josh, that you didn't mention the problems many people
  have run into with Transparent Huge Page defrag and with NUMA
  access.
  Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
  setting zone_reclaim_mode; is there some other problem besides that?
  I think that fixes some of the worst instances, but I've seen machines
  spending horrible amounts of CPU ( BUS) time in page reclaim
  nonetheless. If I analyzed it correctly it's in RAM  working set
  workloads where RAM is pretty large and most of it is used as page
  cache. The kernel ends up spending a huge percentage of time finding and
  potentially defragmenting pages when looking for victim buffers.
  
  On a related note, there's also the problem of double-buffering.  When
  we read a page into shared_buffers, we leave a copy behind in the OS
  buffers, and similarly on write-out.  It's very unclear what to do
  about this, since the kernel and PostgreSQL don't have intimate
  knowledge of what each other are doing, but it would be nice to solve
  somehow.
  I've wondered before if there wouldn't be a chance for postgres to say
  my dear OS, that the file range 0-8192 of file x contains y, no need to
  reread and do that when we evict a page from s_b but I never dared to
  actually propose that to kernel people...
  O_DIRECT was specifically designed to solve the problem of double 
  buffering 
  between applications and the kernel. Why are you not able to use that in 
  these situations?
  What is asked is the opposite of O_DIRECT - the write from a buffer inside
  postgresql to linux *buffercache* and telling linux that it is the same
  as what
  is currently on disk, so don't bother to write it back ever.
 
 I don’t understand. Are we talking about mmap()ed files here? Why
 would the kernel be trying to write back pages that aren’t dirty?

No ... if I have it right, it's pretty awful: they want to do a read of
a file into a user provided buffer, thus obtaining a page cache entry
and a copy in their userspace buffer, then insert the page of the user
buffer back into the page cache as the page cache page ... that's right,
isn't it postgress people?

Effectively you end up with buffered read/write that's also mapped into
the page cache.  It's a pretty awful way to hack around mmap.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote:
 On 01/14/2014 09:39 AM, Claudio Freire wrote:
  On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com 
  wrote:
  Again, as said above the linux file system is doing fine. What we
  want is a few ways to interact with it to let it do even better when
  working with postgresql by telling it some stuff it otherwise would
  have to second guess and by sometimes giving it back some cache
  pages which were copied away for potential modifying but ended
  up clean in the end.
  You don't need new interfaces. Only a slight modification of what
  fadvise DONTNEED does.
 
  This insistence in injecting pages from postgres to kernel is just a
  bad idea. 
 Do you think it would be possible to map copy-on-write pages
 from linux cache to postgresql cache ?
 
 this would be a step in direction of solving the double-ram-usage
 of pages which have not been read from syscache to postgresql
 cache without sacrificing linux read-ahead (which I assume does
 not happen when reads bypass system cache).

The current mechanism for coherency between a userspace cache and the
in-kernel page cache is mmap ... that's the only way you get the same
page in both currently.

glibc used to have an implementation of read/write in terms of mmap, so
it should be possible to insert it into your current implementation
without a major rewrite.  The problem I think this brings you is
uncontrolled writeback: you don't want dirty pages to go to disk until
you issue a write()  I think we could fix this with another madvise():
something like MADV_WILLUPDATE telling the page cache we expect to alter
the pages again, so don't be aggressive about cleaning them.  Plus all
the other issues with mmap() ... but if you can detail those, we might
be able to fix them.

 and we can write back the copy at the point when it is safe (from
 postgresql perspective)  to let the system write them back ?

Using MADV_WILLUPDATE, possibly ... you're still not going to have
absolute control.  The kernel will write back the pages if the dirty
limits are exceeded, for instance, but we could tune it to be useful.

 Do you think it is possible to make it work with good performance
 for a few million 8kb pages ?
 
  At the very least, it still needs postgres to know too much
  of the filesystem (block layout) to properly work. Ie: pg must be
  required to put entire filesystem-level blocks into the page cache,
  since that's how the page cache works. 
 I was more thinking of an simple write() interface with extra
 flags/sysctls to tell kernel that we already have this on disk
  At the very worst, it may
  introduce serious security and reliability implications, when
  applications can destroy the consistency of the page cache (even if
  full access rights are checked, there's still the possibility this
  inconsistency might be exploitable).
 If you allow write() which just writes clean pages, I can not see
 where the extra security concerns are beyond what normal
 write can do.

The problem is we can't give you absolute control of when pages are
written back because that interface can be used to DoS the system: once
we get too many dirty uncleanable pages, we'll thrash looking for memory
and the system will livelock.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Trond Myklebust

On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:

 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.
 
 Right.
 
 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()
 
 Exactly.
 
 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.
 
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.
 
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.
 
   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?

If you don't understand how write-ahead logging works, this
conversation is going nowhere.  Suffice it to say that the word
ahead is not optional.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 1:48 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.


In essence, if you do flush when you shouldn't, and there is a
hardware failure, or kernel panic, or anything that stops the rest of
the writes from succeeding, your database is kaputt, and you've got to
restore a backup.

Ie: very very bad.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 06:08 PM, Tom Lane wrote:

Trond Myklebust tron...@gmail.com writes:

On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:

Don't be aggressive isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.



Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.


As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.


To be honest, I think the impact of double buffering in real-life 
applications is greatly exaggerated. If you follow the usual guideline 
and configure shared_buffers to 25% of available RAM, at worst you're 
wasting 25% of RAM to double buffering. That's significant, but it's not 
the end of the world, and it's a problem that can be compensated by 
simply buying more RAM.


Of course, if someone can come up with an easy way to solve that, that'd 
be great, but if it means giving up other advantages that we get from 
relying on the OS page cache, then -1 from me. The usual response to the 
why don't you just use O_DIRECT? is that it'd require reimplementing a 
lot of I/O infrastructure, but misses an IMHO more important point: it 
would require setting shared_buffers a lot higher to get the same level 
of performance you get today. That has a number of problems:


1. It becomes a lot more important to tune shared_buffers correctly. Set 
it too low, and you're not taking advantage of all the RAM available. 
Set it too high, and you'll start swapping, totally killing performance. 
I can already hear consultants rubbing their hands, waiting for the rush 
of customers that will need expert help to determine the optimal 
shared_buffers setting.


2. Memory spent on the buffer cache can't be used for other things. For 
example, an index build can temporarily allocate several gigabytes of 
memory; if that memory is allocated to the shared buffer cache, it can't 
be used for that purpose. Yeah, we could change that, and allow 
borrowing pages from the shared buffer cache for other purposes, but 
that means more work and more code.


3. Memory used for the shared buffer cache can't be used by other 
processes (without swapping). It becomes a lot harder to be a good 
citizen on a system that's not entirely dedicated to PostgreSQL.


So not only would we need to re-implement I/O infrastructure, we'd also 
need to make memory management a lot smarter and a lot more flexible. 
We'd need a lot more information on what else is running on the system 
and how badly they need memory.



I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.


Agreed. It would be possible to use mmap() for pages that are not 
modified, though. When you're not modifying, you could mmap() the data 
you need, and bypass the PostgreSQL buffer cache that way. The 
interaction with the buffer cache becomes complicated, because you 
couldn't use the buffer cache's locks etc., and some pages might have a 
never version in the buffer cache than on-disk, but it might be doable.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
  No, I'm sorry, that's never going to be possible.  No user space
  application has all the facts.  If we give you an interface to force
  unconditional holding of dirty pages in core you'll livelock the system
  eventually because you made a wrong decision to hold too many dirty
  pages.   I don't understand why this has to be absolute: if you advise
  us to hold the pages dirty and we do up until it becomes a choice to
  hold on to the pages or to thrash the system into a livelock, why would
  you ever choose the latter?  And if, as I'm assuming, you never would,
  why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

 No, I do ... you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers which
 causes ordered grouping of I/O chunks.  If we could force a similar
 ordering in the writeout code, is that enough?

Probably not.  There are a whole raft of problems here.  For that to
be any of any use, we'd have to move to mmap()ing each buffer instead
of read()ing them in, and apparently mmap() doesn't scale well to
millions of mappings.  And even if it did, then we'd have a solution
that only works on Linux.  Plus, as Tom pointed out, there are
critical sections where it's not just a question of ordering but in
fact you need to completely hold off writes.

In terms of avoiding double-buffering, here's my thought after reading
what's been written so far.  Suppose we read a page into our buffer
pool.  Until the page is clean, it would be ideal for the mapping to
be shared between the buffer cache and our pool, sort of like
copy-on-write.  That way, if we decide to evict the page, it will
still be in the OS cache if we end up needing it again (remember, the
OS cache is typically much larger than our buffer pool).  But if the
page is dirtied, then instead of copying it, just have the buffer pool
forget about it, because at that point we know we're going to write
the page back out anyway before evicting it.

This would be pretty similar to copy-on-write, except without the
copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:12 PM, Robert Haas robertmh...@gmail.com wrote:
 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to

Correction: For so long as the page is clean...

 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:

 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to
 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.


But... either copy-on-write or forget-on-write needs a page fault, and
thus a page mapping.

Is a page fault more expensive than copying 8k?

(I really don't know).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to
 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

I don't know either.  I wasn't thinking so much that it would save CPU
time as that it would save memory.  Consider a system with 32GB of
RAM.  If you set shared_buffers=8GB, then in the worst case you've got
25% of your RAM wasted storing pages that already exist, dirtied, in
shared_buffers.  It's easy to imagine scenarios in which that results
in lots of extra I/O, so that the CPU required to do the accounting
comes to seem cheap by comparison.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
Claudio Freire klaussfre...@gmail.com wrote:
 Robert Haas robertmh...@gmail.com wrote:
 James Bottomley james.bottom...@hansenpartnership.com wrote:

 I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a
 choice to hold on to the pages or to thrash the system into a
 livelock, why would you ever choose the latter?

Because the former creates database corruption and the latter does
not.

 And if, as I'm assuming, you never would,

That assumption is totally wrong.

 why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

 In essence, if you do flush when you shouldn't, and there is a
 hardware failure, or kernel panic, or anything that stops the
 rest of the writes from succeeding, your database is kaputt, and
 you've got to restore a backup.

 Ie: very very bad.

Yup.  And when that's a few terrabytes, you will certainly find
yourself wishing that you had been able to do a recovery up to the
end of the last successfully committed transaction rather than a
restore from backup.

Now, as Tom said, if there was an API to create write boundaries
between particular dirty pages we could leave it to the OS.  Each
WAL record's write would be conditional on the previous one and
each data page write would be conditional on the WAL record for the
last update to the page.  But nobody seems to think that would
yield acceptable performance.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 05:44 PM, James Bottomley wrote:
 On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.
 Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()
 Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?
The short answer is crash safety.

A database system worth its name must make sure that all data
reported as stored to clients is there even after crash.

Write ahead log is the means for that. And writing wal files and
data pages has to be in certain order to guarantee consistent
recovery after crash.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
James Bottomley james.bottom...@hansenpartnership.com wrote:

 you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers
 which causes ordered grouping of I/O chunks.  If we could force a
 similar ordering in the writeout code, is that enough?

Unless it can be between particular pairs of pages, I don't think
performance could be at all acceptable.  Each data page has an
associated Log Sequence Number reflecting the last Write-Ahead Log
record which records a change to that page, and the referenced WAL
record must be safely persisted before the data page is allowed to
be written.  Currently, when we need to write a dirty page to the
OS, we must ensure that the WAL record is written and fsync'd
first.  We also write a WAL record for transaction command and
fsync it at each COMMIT, before telling the client that the COMMIT
request was successful.  (Well, at least by default; they can
choose to set synchronous_commit to off for some or all
transactions.)  If a write barrier to control this applied to
everything on the filesystem, performance would be horrible.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
  In terms of avoiding double-buffering, here's my thought after reading
  what's been written so far.  Suppose we read a page into our buffer
  pool.  Until the page is clean, it would be ideal for the mapping to
  be shared between the buffer cache and our pool, sort of like
  copy-on-write.  That way, if we decide to evict the page, it will
  still be in the OS cache if we end up needing it again (remember, the
  OS cache is typically much larger than our buffer pool).  But if the
  page is dirtied, then instead of copying it, just have the buffer pool
  forget about it, because at that point we know we're going to write
  the page back out anyway before evicting it.
 
  This would be pretty similar to copy-on-write, except without the
  copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

 (I really don't know).

 A page fault can be expensive, yes ... but perhaps you don't need one.

 What you want is a range of memory that's read from a file but treated
 as anonymous for writeout (i.e. written to swap if we need to reclaim
 it). Then at some time later, you want to designate it as written back
 to the file instead so you control the writeout order.  I'm not sure we
 can do this: the separation between file backed and anonymous pages is
 pretty deeply ingrained into the OS, but if it were possible, is that
 what you want?

Doesn't sound exactly like what I had in mind.  What I was suggesting
is an analogue of read() that, if it reads full pages of data to a
page-aligned address, shares the data with the buffer cache until it's
first written instead of actually copying the data.  The pages are
write-protected so that an attempt to write the address range causes a
page fault.  In response to such a fault, the pages become anonymous
memory and the buffer cache no longer holds a reference to the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
 No, I do ... you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers which
 causes ordered grouping of I/O chunks.  If we could force a similar
 ordering in the writeout code, is that enough?

 Probably not.  There are a whole raft of problems here.  For that to
 be any of any use, we'd have to move to mmap()ing each buffer instead
 of read()ing them in, and apparently mmap() doesn't scale well to
 millions of mappings.

We would presumably mmap whole files, not individual pages (at least
on 64-bit machines; else address space size is going to be a problem).
However, without a fix for the critical-section/atomic-update problem,
the idea's still going nowhere.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

That might possibly work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
  No, I'm sorry, that's never going to be possible.  No user space
  application has all the facts.  If we give you an interface to force
  unconditional holding of dirty pages in core you'll livelock the system
  eventually because you made a wrong decision to hold too many dirty
  pages.   I don't understand why this has to be absolute: if you advise
  us to hold the pages dirty and we do up until it becomes a choice to
  hold on to the pages or to thrash the system into a livelock, why would
  you ever choose the latter?  And if, as I'm assuming, you never would,
  why don't you want the kernel to make that choice for you?
 
 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

No, I do ... you mean the order of write out, if we have to do it, is
important.  In the rest of the kernel, we do this with barriers which
causes ordered grouping of I/O chunks.  If we could force a similar
ordering in the writeout code, is that enough?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
 
  In terms of avoiding double-buffering, here's my thought after reading
  what's been written so far.  Suppose we read a page into our buffer
  pool.  Until the page is clean, it would be ideal for the mapping to
  be shared between the buffer cache and our pool, sort of like
  copy-on-write.  That way, if we decide to evict the page, it will
  still be in the OS cache if we end up needing it again (remember, the
  OS cache is typically much larger than our buffer pool).  But if the
  page is dirtied, then instead of copying it, just have the buffer pool
  forget about it, because at that point we know we're going to write
  the page back out anyway before evicting it.
 
  This would be pretty similar to copy-on-write, except without the
  copying.  It would just be forget-from-the-buffer-pool-on-write.
 
 
 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.
 
 Is a page fault more expensive than copying 8k?
 
 (I really don't know).

A page fault can be expensive, yes ... but perhaps you don't need one. 

What you want is a range of memory that's read from a file but treated
as anonymous for writeout (i.e. written to swap if we need to reclaim
it).  Then at some time later, you want to designate it as written back
to the file instead so you control the writeout order.  I'm not sure we
can do this: the separation between file backed and anonymous pages is
pretty deeply ingrained into the OS, but if it were possible, is that
what you want?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
  The current mechanism for coherency between a userspace cache and the
  in-kernel page cache is mmap ... that's the only way you get the same
  page in both currently.
 
 Right.
 
  glibc used to have an implementation of read/write in terms of mmap, so
  it should be possible to insert it into your current implementation
  without a major rewrite.  The problem I think this brings you is
  uncontrolled writeback: you don't want dirty pages to go to disk until
  you issue a write()
 
 Exactly.
 
  I think we could fix this with another madvise():
  something like MADV_WILLUPDATE telling the page cache we expect to alter
  the pages again, so don't be aggressive about cleaning them.
 
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.
 
  The problem is we can't give you absolute control of when pages are
  written back because that interface can be used to DoS the system: once
  we get too many dirty uncleanable pages, we'll thrash looking for memory
  and the system will livelock.
 
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.

No, I'm sorry, that's never going to be possible.  No user space
application has all the facts.  If we give you an interface to force
unconditional holding of dirty pages in core you'll livelock the system
eventually because you made a wrong decision to hold too many dirty
pages.   I don't understand why this has to be absolute: if you advise
us to hold the pages dirty and we do up until it becomes a choice to
hold on to the pages or to thrash the system into a livelock, why would
you ever choose the latter?  And if, as I'm assuming, you never would,
why don't you want the kernel to make that choice for you?

James



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jeff Janes
On Mon, Jan 13, 2014 at 6:44 PM, Dave Chinner da...@fromorbit.com wrote:

 On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
  On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
   a file into a user provided buffer, thus obtaining a page cache entry
   and a copy in their userspace buffer, then insert the page of the user
   buffer back into the page cache as the page cache page ... that's
 right,
   isn't it postgress people?
 
  Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
  isn't needed anymore when reading. And we'd normally write if the page
  is dirty.

 So why, exactly, do you even need the kernel page cache here?


We don't need it, but it would be nice.


 You've
 got direct access to the copy of data read into userspace, and you
 want direct control of when and how the data in that buffer is
 written and reclaimed. Why push that data buffer back into the
 kernel and then have to add all sorts of kernel interfaces to
 control the page you already have control of?


Say 25% of the RAM is dedicated to the database's shared buffers, and 75%
is left to the kernel's judgement.  It sure would be nice if the kernel had
the capability of using some of that 75% for database pages, if it thought
that that was the best use for it.

Which is what we do get now, at the expense of quite a lot of double
buffering (by which I mean, a lot of pages are both in the kernel cache and
the database cache--not just transiently during the copy process, but for
quite a while).  If we had the ability to re-inject clean pages into the
kernel's cache, we would get that benefit without the double buffering.

Cheers,

Jeff


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to
 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

 I don't know either.  I wasn't thinking so much that it would save CPU
 time as that it would save memory.  Consider a system with 32GB of
 RAM.  If you set shared_buffers=8GB, then in the worst case you've got
 25% of your RAM wasted storing pages that already exist, dirtied, in
 shared_buffers.  It's easy to imagine scenarios in which that results
 in lots of extra I/O, so that the CPU required to do the accounting
 comes to seem cheap by comparison.

Not necessarily, you pay the CPU cost on each page fault (ie: first
write to the buffer at least), each time the page checks into the
shared buffers level.

It's like a tiered cache.

When promoting is expensive, one must be careful. The traffic to/from
the L0 (shared buffers) and L1 (page cache) will be considerable, even
if everything fits in RAM.

I guess it's the constant battle between inclusive and exclusive caches.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
 On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
  In terms of avoiding double-buffering, here's my thought after reading
  what's been written so far.  Suppose we read a page into our buffer
  pool.  Until the page is clean, it would be ideal for the mapping to
  be shared between the buffer cache and our pool, sort of like
  copy-on-write.  That way, if we decide to evict the page, it will
  still be in the OS cache if we end up needing it again (remember, the
  OS cache is typically much larger than our buffer pool).  But if the
  page is dirtied, then instead of copying it, just have the buffer pool
  forget about it, because at that point we know we're going to write
  the page back out anyway before evicting it.
 
  This would be pretty similar to copy-on-write, except without the
  copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

 (I really don't know).

 A page fault can be expensive, yes ... but perhaps you don't need one.

 What you want is a range of memory that's read from a file but treated
 as anonymous for writeout (i.e. written to swap if we need to reclaim
 it). Then at some time later, you want to designate it as written back
 to the file instead so you control the writeout order.  I'm not sure we
 can do this: the separation between file backed and anonymous pages is
 pretty deeply ingrained into the OS, but if it were possible, is that
 what you want?

 Doesn't sound exactly like what I had in mind.  What I was suggesting
 is an analogue of read() that, if it reads full pages of data to a
 page-aligned address, shares the data with the buffer cache until it's
 first written instead of actually copying the data.  The pages are
 write-protected so that an attempt to write the address range causes a
 page fault.  In response to such a fault, the pages become anonymous
 memory and the buffer cache no longer holds a reference to the page.


Yes, that's basically zero-copy reads.

It could be done. The kernel can remap the page to the physical page
holding the shared buffer and mark it read-only, then expire the
buffer and transfer ownership of the page if any page fault happens.

But that incurrs:
 - Page faults, lots
 - Hugely bloated mappings, unless KSM is somehow leveraged for this

And there's a nice bingo. Had forgotten about KSM. KSM could help lots.

I could try to see of madvising shared_buffers as mergeable helps. But
this should be an automatic case of KSM - ie, when reading into a
page-aligned address, the kernel should summarily apply KSM-style
sharing without hinting. The current madvise interface puts the burden
of figuring out what duplicates what on the kernel, but postgres
already knows.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Stephen Frost
* Claudio Freire (klaussfre...@gmail.com) wrote:
 On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas robertmh...@gmail.com wrote:
  I don't know either.  I wasn't thinking so much that it would save CPU
  time as that it would save memory.  Consider a system with 32GB of
  RAM.  If you set shared_buffers=8GB, then in the worst case you've got
  25% of your RAM wasted storing pages that already exist, dirtied, in
  shared_buffers.  It's easy to imagine scenarios in which that results
  in lots of extra I/O, so that the CPU required to do the accounting
  comes to seem cheap by comparison.
 
 Not necessarily, you pay the CPU cost on each page fault (ie: first
 write to the buffer at least), each time the page checks into the
 shared buffers level.

I'm really not sure that this is a real issue for us, but if it is,
perhaps having this as an option for each read() call would work..?
That is to say, rather than have this be an open() flag or similar, it's
normal read() with a flags field where we could decide when we want
pages to be write-protected this way and when we don't (perhaps because
we know we're about to write to them).

I'm not 100% sure it'd be easy for us to manage that flag perfectly, but
it's our issue and it'd be on us to deal with as the kernel can't
possibly guess our intentions.

There were concerns brought up earlier that such a zero-copy-read option
wouldn't be performant though and I'm curious to hear more about those
and if we could avoid the performance issues by manging the
zero-copy-read case ourselves as Robert suggests.

Thanks,

Stephen


signature.asc
Description: Digital signature


  1   2   >