Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara
On Wed 22-01-14 09:07:19, Dave Chinner wrote:
 On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
   If we're forcing the WAL out to disk because of transaction commit or
   because we need to write the buffer protected by a certain WAL record
   only after the WAL hits the platter, then it's fine.  But sometimes
   we're writing WAL just because we've run out of internal buffer space,
   and we don't want to block waiting for the write to complete.  Opening
   the file with O_SYNC deprives us of the ability to control the timing
   of the sync relative to the timing of the write.
O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
  transaction commit whenever there's any metadata changed on the filesystem.
  Since mtime  ctime of files will be changed often, the will be a case very
  often.
 
 Therefore: O_DATASYNC.
  O_DSYNC to be exact.

   Maybe it'll be useful to have hints that say always write this file
   to disk as quick as you can and always postpone writing this file to
   disk for as long as you can for WAL and temp files respectively.  But
   the rule for the data files, which are the really important case, is
   not so simple.  fsync() is actually a fine API except that it tends to
   destroy system throughput.  Maybe what we need is just for fsync() to
   be less aggressive, or a less aggressive version of it.  We wouldn't
   mind waiting an almost arbitrarily long time for fsync to complete if
   other processes could still get their I/O requests serviced in a
   reasonable amount of time in the meanwhile.
As I wrote in some other email in this thread, using IO priorities for
  data file checkpoint might be actually the right answer. They will work for
  IO submitted by fsync(). The downside is that currently IO priorities / IO
  scheduling classes work only with CFQ IO scheduler.
 
 And I don't see it being implemented anywhere else because it's the
 priority aware scheduling infrastructure in CFQ that causes all the
 problems with IO concurrency and scalability...
  So CFQ has all sorts of problems but I never had the impression that
priority aware scheduling is the culprit. It is all just complex - sync
idling, seeky writer detection, cooperating threads detection, sometimes
even sync vs async distinction isn't exactly what one would want. And I'm
not speaking about the cgroup stuff... So it doesn't seem to me that some
other IO scheduler couldn't reasonably efficiently implement stuff like IO
scheduling classes.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Jan Kara
On Fri 17-01-14 08:57:25, Robert Haas wrote:
 On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
  So this says to me that the WAL is a place where DIO should really be
  reconsidered. It's mostly sequential writes that need to hit the disk
  ASAP, and you need to know that they have hit the disk before you can
  proceed with other operations.
 
 Ironically enough, we actually *have* an option to use O_DIRECT here.
 But it doesn't work well.  See below.
 
  Also, is the WAL actually ever read under normal (non-recovery)
  conditions or is it write-only under normal operation? If it's seldom
  read, then using DIO for them also avoids some double buffering since
  they wouldn't go through pagecache.
 
 This is the first problem: if replication is in use, then the WAL gets
 read shortly after it gets written.  Using O_DIRECT bypasses the
 kernel cache for the writes, but then the reads stink.
  OK, yes, this is hard to fix with direct IO.

 However, if you configure wal_sync_method=open_sync and disable
 replication, then you will in fact get O_DIRECT|O_SYNC behavior.
 
 But that still doesn't work out very well, because now the guy who
 does the write() has to wait for it to finish before he can do
 anything else.  That's not always what we want, because WAL gets
 written out from our internal buffers for multiple different reasons.
  Well, you can always use AIO (io_submit) to submit direct IO without
waiting for it to finish. But then you might need to track the outstanding
IO so that you can watch with io_getevents() when it is finished.

 If we're forcing the WAL out to disk because of transaction commit or
 because we need to write the buffer protected by a certain WAL record
 only after the WAL hits the platter, then it's fine.  But sometimes
 we're writing WAL just because we've run out of internal buffer space,
 and we don't want to block waiting for the write to complete.  Opening
 the file with O_SYNC deprives us of the ability to control the timing
 of the sync relative to the timing of the write.
  O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
transaction commit whenever there's any metadata changed on the filesystem.
Since mtime  ctime of files will be changed often, the will be a case very
often.

  Again, I think this discussion would really benefit from an outline of
  the different files used by pgsql, and what sort of data access
  patterns you expect with them.
 
 I think I more or less did that in my previous email, but here it is
 again in briefer form:
 
 - WAL files are written (and sometimes read) sequentially and fsync'd
 very frequently and it's always good to write the data out to disk as
 soon as possible
 - Temp files are written and read sequentially and never fsync'd.
 They should only be written to disk when memory pressure demands it
 (but are a good candidate when that situation comes up)
 - Data files are read and written randomly.  They are fsync'd at
 checkpoint time; between checkpoints, it's best not to write them
 sooner than necessary, but when the checkpoint arrives, they all need
 to get out to the disk without bringing the system to a standstill
 
 We have other kinds of files, but off-hand I'm not thinking of any
 that are really very interesting, apart from those.
 
 Maybe it'll be useful to have hints that say always write this file
 to disk as quick as you can and always postpone writing this file to
 disk for as long as you can for WAL and temp files respectively.  But
 the rule for the data files, which are the really important case, is
 not so simple.  fsync() is actually a fine API except that it tends to
 destroy system throughput.  Maybe what we need is just for fsync() to
 be less aggressive, or a less aggressive version of it.  We wouldn't
 mind waiting an almost arbitrarily long time for fsync to complete if
 other processes could still get their I/O requests serviced in a
 reasonable amount of time in the meanwhile.
  As I wrote in some other email in this thread, using IO priorities for
data file checkpoint might be actually the right answer. They will work for
IO submitted by fsync(). The downside is that currently IO priorities / IO
scheduling classes work only with CFQ IO scheduler.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jan Kara
On Wed 15-01-14 21:37:16, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara j...@suse.cz wrote:
  On Wed 15-01-14 10:12:38, Robert Haas wrote:
  On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
   Filesystems could in theory provide facility like atomic write (at least 
   up
   to a certain size say in MB range) but it's not so easy and when there 
   are
   no strong usecases fs people are reluctant to make their code more 
   complex
   unnecessarily. OTOH without widespread atomic write support I understand
   application developers have similar stance. So it's kind of chicken and 
   egg
   problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
   due to its data=journal mode so if someone on the PostgreSQL side wanted 
   to
   research on this, knitting some experimental ext4 patches should be 
   doable.
 
  Atomic 8kB writes would improve performance for us quite a lot.  Full
  page writes to WAL are very expensive.  I don't remember what
  percentage of write-ahead log traffic that accounts for, but it's not
  small.
OK, and do you need atomic writes on per-IO basis or per-file is enough?
  It basically boils down to - is all or most of IO to a file going to be
  atomic or it's a smaller fraction?
 
 The write-ahead log wouldn't need it, but data files writes would.  So
 we'd need it a lot, but not for absolutely everything.
 
 For any given file, we'd either care about writes being atomic, or we
 wouldn't.
  OK, when you say that either all writes to a file should be atomic or
none of them should be, then can you try the following:
chattr +j file

  will turn on data journalling for file on ext3/ext4 filesystem.
Currently it *won't* guarantee the atomicity in all the cases but the
performance will be very similar as if it would. You might also want to
increase filesystem journal size with 'tune2fs -J size=XXX /dev/yyy' where
XXX is desired journal size in MB. Default is 128 MB I think but with
intensive data journalling you might want to have that in GB range. I'd be
interested in hearing what impact does turning 'atomic write' support
in PostgreSQL and using data journalling on ext4 have.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jan Kara
On Wed 15-01-14 10:12:38, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
  Filesystems could in theory provide facility like atomic write (at least up
  to a certain size say in MB range) but it's not so easy and when there are
  no strong usecases fs people are reluctant to make their code more complex
  unnecessarily. OTOH without widespread atomic write support I understand
  application developers have similar stance. So it's kind of chicken and egg
  problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
  due to its data=journal mode so if someone on the PostgreSQL side wanted to
  research on this, knitting some experimental ext4 patches should be doable.
 
 Atomic 8kB writes would improve performance for us quite a lot.  Full
 page writes to WAL are very expensive.  I don't remember what
 percentage of write-ahead log traffic that accounts for, but it's not
 small.
  OK, and do you need atomic writes on per-IO basis or per-file is enough?
It basically boils down to - is all or most of IO to a file going to be
atomic or it's a smaller fraction?

As Dave notes, unless there is HW support (which is coming with newest
solid state drives), ext4/xfs will have to implement this by writing data
to a filesystem journal and after transaction commit checkpointing them to
a final location. Which is exactly what you do with your WAL logs so
it's not clear it will be a performance win. But it is easy enough to code
for ext4 that I'm willing to try...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote:
 On 01/15/2014 06:01 AM, Jim Nasby wrote:
 For the sake of completeness... it's theoretically silly that Postgres
 is doing all this stuff with WAL when the filesystem is doing something
 very similar with it's journal. And an SSD drive (and next generation
 spinning rust) is doing the same thing *again* in it's own journal.
 
 If all 3 communities (or even just 2 of them!) could agree on the
 necessary interface a tremendous amount of this duplicated technology
 could be eliminated.
 
 That said, I rather doubt the Postgres community would go this route,
 not so much because of the presumably massive changes needed, but more
 because our community is not a fan of restricting our users to things
 like Thou shalt use a journaled FS or risk all thy data!
 
 The WAL is also used for continuous archiving and replication, not
 just crash recovery. We could skip full-page-writes, though, if we
 knew that the underlying filesystem/storage is guaranteeing that a
 write() is atomic.
 
 It might be useful for PostgreSQL somehow tell the filesystem that
 we're taking care of WAL-logging, so that the filesystem doesn't
 need to.
  Well, journalling fs generally cares about its metadata consistency. We
have much weaker guarantees regarding file data because those guarantees
come at a cost most people don't want to pay.

Filesystems could in theory provide facility like atomic write (at least up
to a certain size say in MB range) but it's not so easy and when there are
no strong usecases fs people are reluctant to make their code more complex
unnecessarily. OTOH without widespread atomic write support I understand
application developers have similar stance. So it's kind of chicken and egg
problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
due to its data=journal mode so if someone on the PostgreSQL side wanted to
research on this, knitting some experimental ext4 patches should be doable.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
 On 01/14/2014 06:12 PM, Robert Haas wrote:
  This would be pretty similar to copy-on-write, except
  without the copying. It would just be
  forget-from-the-buffer-pool-on-write. 
 
 +1
 
 A version of this could probably already be implement using MADV_DONTNEED
 and MADV_WILLNEED
 
 Thet is, just after reading the page in, use MADV_DONTNEED on it. When
 evicting
 a clean page, check that it is still in cache and if it is, then
 MADV_WILLNEED it.
 
 Another nice thing to do would be dynamically adjusting kernel
 dirty_background_ratio
 and other related knobs in real time based on how many buffers are dirty
 inside postgresql.
 Maybe in background writer.
 
 Question to LKM folks - will kernel react well to frequent changes to
 /proc/sys/vm/dirty_*  ?
 How frequent can they be (every few second? every second? 100Hz ?)
  So the question is what do you mean by 'react'. We check whether we
should start background writeback every dirty_writeback_centisecs (5s). We
will also check whether we didn't exceed the background dirty limit (and
wake writeback thread) when dirtying pages. However this check happens once
per several dirtied MB (unless we are close to dirty_bytes).

When writeback is running we check roughly once per second (the logic is
more complex there but I don't think explaining details would be useful
here) whether we are below dirty_background_bytes and stop writeback in
that case.

So changing dirty_background_bytes every few seconds should work
reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
note that you have conflicting requirements on the kernel writeback. On one
hand you want checkpoint data to steadily trickle to disk (well, trickle
isn't exactly the proper word since if you need to checkpoing 16 GB every 5
minutes than you need a steady throughput of ~50 MB/s just for
checkpointing) so you want to set dirty_background_bytes low, on the other
hand you don't want temporary files to get to disk so you want to set
dirty_background_bytes high. And also that changes of
dirty_background_bytes probably will not take into account other events
happening on the system (maybe a DB backup is running...). So I'm somewhat
skeptical you will be able to tune dirty_background_bytes frequently in a
useful way.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara
On Wed 15-01-14 14:38:44, Hannu Krosing wrote:
 On 01/15/2014 02:01 PM, Jan Kara wrote:
  On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
  On 01/14/2014 06:12 PM, Robert Haas wrote:
  This would be pretty similar to copy-on-write, except
  without the copying. It would just be
  forget-from-the-buffer-pool-on-write. 
  +1
 
  A version of this could probably already be implement using MADV_DONTNEED
  and MADV_WILLNEED
 
  Thet is, just after reading the page in, use MADV_DONTNEED on it. When
  evicting
  a clean page, check that it is still in cache and if it is, then
  MADV_WILLNEED it.
 
  Another nice thing to do would be dynamically adjusting kernel
  dirty_background_ratio
  and other related knobs in real time based on how many buffers are dirty
  inside postgresql.
  Maybe in background writer.
 
  Question to LKM folks - will kernel react well to frequent changes to
  /proc/sys/vm/dirty_*  ?
  How frequent can they be (every few second? every second? 100Hz ?)
So the question is what do you mean by 'react'. We check whether we
  should start background writeback every dirty_writeback_centisecs (5s). We
  will also check whether we didn't exceed the background dirty limit (and
  wake writeback thread) when dirtying pages. However this check happens once
  per several dirtied MB (unless we are close to dirty_bytes).
 
  When writeback is running we check roughly once per second (the logic is
  more complex there but I don't think explaining details would be useful
  here) whether we are below dirty_background_bytes and stop writeback in
  that case.
 
  So changing dirty_background_bytes every few seconds should work
  reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
  note that you have conflicting requirements on the kernel writeback. On one
  hand you want checkpoint data to steadily trickle to disk (well, trickle
  isn't exactly the proper word since if you need to checkpoing 16 GB every 5
  minutes than you need a steady throughput of ~50 MB/s just for
  checkpointing) so you want to set dirty_background_bytes low, on the other
  hand you don't want temporary files to get to disk so you want to set
  dirty_background_bytes high. 
 Is it possible to have more fine-grained control over writeback, like
 configuring dirty_background_bytes per file system / device (or even
 a file or a group of files) ?
  Currently it isn't possible to tune dirty_background_bytes per device
directly. However see below.

 If not, then how hard would it be to provide this ?
  We do track amount of dirty pages per device and the thread doing the
flushing is also per device. The thing is that currently we compute the
per-device background limit as dirty_background_bytes * p, where p is a
proportion of writeback happening on this device to total writeback in the
system (computed as floating average with exponential time-based backoff).
BTW, similarly maximum per-device dirty limit is derived from global
dirty_bytes in the same way. And you can also set bounds on the proportion
'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able
to set fixed background limit for a device by setting matching min and max
proportions.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
  Well, the problem is that you can't really use mmap() for the things we
  do. Postgres' durability works by guaranteeing that our journal entries
  (called WAL := Write Ahead Log) are written  synced to disk before the
  corresponding entries of tables and indexes reach the disk. That also
  allows to group together many random-writes into a few contiguous writes
  fdatasync()ed at once. Only during a checkpointing phase the big bulk of
  the data is then (slowly, in the background) synced to disk.
  Which is the exact algorithm most journalling filesystems use for
  ensuring durability of their metadata updates.  Indeed, here's an
  interesting piece of architecture that you might like to consider:
 
  * Neither XFS and BTRFS use the kernel page cache to back their
metadata transaction engines.
 But file system code is supposed to know much more about the
 underlying disk than a mere application program like postgresql.
 
 We do not want to start duplicating OS if we can avoid it.
 
 What we would like is to have a way to tell the kernel
 
 1) here is the modified copy of file page, it is now safe to write
 it back - the current 'lazy' write
 
 2) here is the page, write it back now, before returning success
 to me - unbuffered write or write + sync
 
 but we also would like to have
 
 3) here is the page as it is currently on disk, I may need it soon,
 so keep it together with your other clean pages accessed at time X
 - this is the non-dirtying write discussed

 the page may be in buffer cache, in which case just update its LRU
 position (to either current time or time provided by postgresql), or
 it may not be there, in which case put it there if reasonable by it's
 LRU position.
 
 And we would like all this to work together with other current linux
 kernel goodness of managing the whole disk-side interaction of
 efficient reading and writing and managing the buffers :)
  So when I was speaking about the proposed vrange() syscall in this thread,
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
 work
reasonably reliably but it is a tough problem, lots of complexity for not so
great gain...).

However, if you really issue the IO from the thread with low priority, it
will have low priority. So specifically if you call fsync() from a thread
with low IO priority, the flushing done by fsync() will have this low
IO priority.

Similarly if you called sync_file_range() once in a while from a thread
with low IO priority, the flushing IO will have low IO priority.  But I
would be really careful about the periodic sync_file_range() calls - it has
a potential of mixing with writeback from flusher thread and mixing these
two on different parts of a file can lead to bad IO patterns...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 06:42:43, Kevin Grittner wrote:
 First off, I want to give a +1 on everything in the recent posts
 from Heikki and Hannu.
 
 Jan Kara j...@suse.cz wrote:
 
  Now the aging of pages marked as volatile as it is currently
  implemented needn't be perfect for your needs but you still have
  time to influence what gets implemented... Actually developers of
  the vrange() syscall were specifically looking for some ideas
  what to base aging on. Currently I think it is first marked -
  first evicted.
 
 The first marked - first evicted seems like what we would want. 
 The ability to unmark and have the page no longer be considered
 preferred for eviction would be very nice.  That seems to me like
 it would cover the multiple layers of buffering *clean* pages very
 nicely (although I know nothing more about vrange() than what has
 been said on this thread, so I could be missing something).
  Here:
http://www.spinics.net/lists/linux-mm/msg67328.html
  is an email which introduces the syscall. As you say, it might be a
reasonable fit for your problems with double caching of clean pages.

 The other side of that is related avoiding multiple writes of the
 same page as much as possible, while avoid write gluts.  The issue
 here is that PostgreSQL tries to hang on to dirty pages for as long
 as possible before writing them to the OS cache, while the OS
 tries to avoid writing them to storage for as long as possible
 until they reach a (configurable) threshold or are fsync'd.  The
 problem is that a under various conditions PostgreSQL may need to
 write and fsync a lot of dirty pages it has accumulated in a short
 time.  That has an avalanche effect, creating a write glut
 which can stall all I/O for a period of many seconds up to a few
 minutes.  If the OS was aware of the dirty pages pending write in
 the application, and counted those for purposes of calculating when
 and how much to write, the glut could be avoided.  Currently,
 people configure the PostgreSQL background writer to be very
 aggressive, configure a small PostgreSQL shared_buffers setting,
 and/or set the OS thresholds low enough to minimize the problem;
 but all of these mitigation strategies have their own costs.
 
 A new hint that the application has dirtied a page could be used by
 the OS to improve things this way:  When the OS is notified that a
 page is dirty, it takes action depending on whether the page is
 considered dirty by the OS.  If it is not dirty, the page is
 immediately discarded from the OS cache.  It is known that the
 application has a modified version of the page that it intends to
 write, so the version in the OS cache has no value.  We don't want
 this page forcing eviction of vrange()-flagged pages.  If it is
 dirty, any write ordering to storage by the OS based on when the
 page was written to the OS would be pushed back as far as possible
 without crossing any write barriers, in hopes that the writes could
 be combined.  Either way, this page is counted toward dirty pages
 for purposes of calculating how much to write from the OS to
 storage, and the later write of the page doesn't redundantly add to
 this number.
  The evict if clean part is easy. That could be easily a new fadvise()
option - btw. note that POSIX_FADV_DONTNEED has quite close meaning. Only
that it also starts writeback on a dirty page if backing device isn't
congested. Which is somewhat contrary to what you want to achieve. But I'm
not sure the eviction would be a clear win since filesystem then has to
re-create the mapping from logical file block to disk block (it is cached
in the page) and that potentially needs to go to disk to fetch the mapping
data.

I have a hard time thinking how we would implement pushing back writeback
of a particular page (or better set of pages). When we need to write pages
because we are nearing dirty_bytes limit, we likely want to write these
marked pages anyway to make as many pages freeable as possible. So the only
thing we could do is to ignore these pages during periodic writeback and
I'm not sure that would make a big difference.

Just to get some idea about the sizes - how large are the checkpoints we
are talking about that cause IO stalls?

Honza

-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 10:04:16, Robert Haas wrote:
 On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara j...@suse.cz wrote:
  I thought that instead of injecting pages into pagecache for aging as you
  describe in 3), you would mark pages as volatile (i.e. for reclaim by
  kernel) through vrange() syscall. Next time you need the page, you check
  whether the kernel reclaimed the page or not. If yes, you reload it from
  disk, if not, you unmark it and use it.
 
  Now the aging of pages marked as volatile as it is currently implemented
  needn't be perfect for your needs but you still have time to influence what
  gets implemented... Actually developers of the vrange() syscall were
  specifically looking for some ideas what to base aging on. Currently I
  think it is first marked - first evicted.
 
 This is an interesting idea but it stinks of impracticality.
 Essentially when the last buffer pin on a page is dropped we'd have to
 mark it as discardable, and then the next person wanting to pin it
 would have to check whether it's still there.  But the system call
 overhead of calling vrange() every time the last pin on a page was
 dropped would probably hose us.
 
 *thinks*
 
 Well, I guess it could be done lazily: make periodic sweeps through
 shared_buffers, looking for pages that haven't been touched in a
 while, and vrange() them.  That's quite a bit of new mechanism, but in
 theory it could work out to a win.  vrange() would have to scale well
 to millions of separate ranges, though.  Will it?
  It is intented to be rather lightweight so I believe milions should be
OK. But I didn't try :).

 And a lot depends on whether the kernel makes the right decision about
 whether to chunk data from our vrange() vs. any other page it could have
 reclaimed.
  I think the intent is to reclaim pages in the following order:
used once pages - volatile pages - active pages, swapping

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jan Kara
On Mon 13-01-14 22:26:45, Mel Gorman wrote:
 The flipside is also meant to hold true. If you know data will be needed
 in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
 the implementation it does a forced read-ahead on the range of pages of
 interest. It doesn't look like it would block.
  That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
file offsets to physical disk blocks and create IO requests. This happens
synchronously. So if your disk is congested and relevant metadata is out of
cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
block for a significant amount of time.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jan Kara
On Mon 13-01-14 22:36:06, Mel Gorman wrote:
 On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
  On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:
   On 1/13/14, 2:19 PM, Claudio Freire wrote:
  
   On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
   wrote:
  
   On a related note, there's also the problem of double-buffering.  When
   we read a page into shared_buffers, we leave a copy behind in the OS
   buffers, and similarly on write-out.  It's very unclear what to do
   about this, since the kernel and PostgreSQL don't have intimate
   knowledge of what each other are doing, but it would be nice to solve
   somehow.
  
  
  
   There you have a much harder algorithmic problem.
  
   You can basically control duplication with fadvise and WONTNEED. The
   problem here is not the kernel and whether or not it allows postgres
   to be smart about it. The problem is... what kind of smarts
   (algorithm) to use.
  
  
   Isn't this a fairly simple matter of when we read a page into shared 
   buffers
   tell the kernel do forget that page? And a corollary to that for when we
   dump a page out of shared_buffers (here kernel, please put this back into
   your cache).
  
  
  That's my point. In terms of kernel-postgres interaction, it's fairly 
  simple.
  
  What's not so simple, is figuring out what policy to use. Remember,
  you cannot tell the kernel to put some page in its page cache without
  reading it or writing it. So, once you make the kernel forget a page,
  evicting it from shared buffers becomes quite expensive.
 
 posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
 forcing readahead. If you evict it prematurely then you do get kinda
 screwed because you pay the IO cost to read it back in again even if you
 had enough memory to cache it. Maybe this is the type of kernel-postgres
 interaction that is annoying you.
 
 If you don't evict, the kernel eventually steps in and evicts the wrong
 thing. If you do evict and it was unnecessarily you pay an IO cost.
 
 That could be something we look at. There are cases buried deep in the
 VM where pages get shuffled to the end of the LRU and get tagged for
 reclaim as soon as possible. Maybe you need access to something like
 that via posix_fadvise to say reclaim this page if you need memory but
 leave it resident if there is no memory pressure or something similar.
 Not exactly sure what that interface would look like or offhand how it
 could be reliably implemented.
  Well, kernel managing user space cache postgres guys talk about looks
pretty much like what volatile range patches are trying to achieve.

Note to postgres guys: I think you should have a look at the proposed
'vrange' system call. The latest posting is here:
http://www.spinics.net/lists/linux-mm/msg67328.html. It contains a rather
detailed description of the feature. And if the feature looks good to you,
you can add your 'me to' plus if anyone would be willing to try that out
with postgress that would be most welcome (although I understand you might
not want to burn your time on experimental kernel feature).

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers