subject:"\[HACKERS\] Linux kernel impact on PostgreSQL performance"

On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara  wrote:
> On Wed 15-01-14 10:12:38, Robert Haas wrote:
>> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
>> > Filesystems could in theory provide facility like atomic write (at least up
>> > to a certain size say in MB range) but it's not so easy and when there are
>> > no strong usecases fs people are reluctant to make their code more complex
>> > unnecessarily. OTOH without widespread atomic write support I understand
>> > application developers have similar stance. So it's kind of chicken and egg
>> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
>> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
>> > research on this, knitting some experimental ext4 patches should be doable.
>>
>> Atomic 8kB writes would improve performance for us quite a lot.  Full
>> page writes to WAL are very expensive.  I don't remember what
>> percentage of write-ahead log traffic that accounts for, but it's not
>> small.
>   OK, and do you need atomic writes on per-IO basis or per-file is enough?
> It basically boils down to - is all or most of IO to a file going to be
> atomic or it's a smaller fraction?

The write-ahead log wouldn't need it, but data files writes would.  So
we'd need it a lot, but not for absolutely everything.

For any given file, we'd either care about writes being atomic, or we wouldn't.

> As Dave notes, unless there is HW support (which is coming with newest
> solid state drives), ext4/xfs will have to implement this by writing data
> to a filesystem journal and after transaction commit checkpointing them to
> a final location. Which is exactly what you do with your WAL logs so
> it's not clear it will be a performance win. But it is easy enough to code
> for ext4 that I'm willing to try...

Yeah, hardware support would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
> Dave Chinner  writes:
> > On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> >> What we'd really like for checkpointing is to hand the kernel a boatload
> >> (several GB) of dirty pages and say "how about you push all this to disk
> >> over the next few minutes, in whatever way seems optimal given the storage
> >> hardware and system situation.  Let us know when you're done."
> 
> > The issue there is that the kernel has other triggers for needing to
> > clean data. We have no infrastructure to handle variable writeback
> > deadlines at the moment, nor do we have any infrastructure to do
> > roughly metered writeback of such files to disk. I think we could
> > add it to the infrastructure without too much perturbation of the
> > code, but as you've pointed out that still leaves the fact there's
> > no obvious interface to configure such behaviour. Would it need to
> > be persistent?
> 
> No, we'd be happy to re-request it during each checkpoint cycle, as
> long as that wasn't an unduly expensive call to make.  I'm not quite
> sure where such requests ought to "live" though.  One idea is to tie
> them to file descriptors; but the data to be written might be spread
> across more files than we really want to keep open at one time.

It would be a property of the inode, as that is how writeback is
tracked and timed. Set and queried through a file descriptor,
though - it's basically the same context that fadvise works
through.

> But the only other idea that comes to mind is some kind of global sysctl,
> which would probably have security and permissions issues.  (One thing
> that hasn't been mentioned yet in this thread, but maybe is worth pointing
> out now, is that Postgres does not run as root, and definitely doesn't
> want to.  So we don't want a knob that would require root permissions
> to twiddle.)

I have assumed all along that requiring root to do stuff would be a
bad thing. :)

> We could probably live with serially checkpointing data
> in sets of however-many-files-we-can-have-open, if file descriptors are
> the place to keep the requests.

Inodes live longer than file descriptors, but there's no guarantee
that they live from one fd context to another. Hence my question
about persistence ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
> > Filesystems could in theory provide facility like atomic write (at least up
> > to a certain size say in MB range) but it's not so easy and when there are
> > no strong usecases fs people are reluctant to make their code more complex
> > unnecessarily. OTOH without widespread atomic write support I understand
> > application developers have similar stance. So it's kind of chicken and egg
> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> > research on this, knitting some experimental ext4 patches should be doable.
> 
> Atomic 8kB writes would improve performance for us quite a lot.  Full
> page writes to WAL are very expensive.  I don't remember what
> percentage of write-ahead log traffic that accounts for, but it's not
> small.

Essentially, the "atomic writes" will essentially be journalled data
so initially there is not going to be any different in performance
between journalling the data in userspace and journalling it in the
filesystem journal. Indeed, it could be worse because the filesystem
journal is typically much smaller than a database WAL file, and it
will flush much more frequently and without the database having any
say in when that occurs.

AFAICT, we're stuck with sucky WAL until block layer and hardware
support atomic writes.

FWIW, I've certainly considered adding per-file data journalling
capabilities to XFS in the past. If we decide that this is the way
to proceed (i.e. as a stepping stone towards hardware atomic write
support), then I can go back to my notes from a few years ago and
see what still needs to be done to support it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote:
> Dave Chinner  writes:
> > On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> >> And most importantly, "Also, please don't freeze up everything else in the
> >> process"
> 
> > If you hand writeback off to the kernel, then writeback for memory
> > reclaim needs to take precedence over "metered writeback". If we are
> > low on memory, then cleaning dirty memory quickly to avoid ongoing
> > allocation stalls, failures and potentially OOM conditions is far more
> > important than anything else.
> 
> I think you're in violent agreement, actually.  Jeff's point is exactly
> that we'd rather the checkpoint deadline slid than that the system goes
> to hell in a handbasket for lack of I/O cycles.  Here "metered" really
> means "do it as a low-priority task".

No, I meant the opposite - in low memory situations, the system is
going to go to hell in a handbasket because we are going to cause a
writeback IO storm cleaning memory regardless of these IO
priorities. i.e. there is no way we'll let "low priority writeback
to avoid IO storms" cause OOM conditions to occur. That is, in OOM
conditions, cleaning dirty pages becomes one of the highest priority
tasks of the system

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane  wrote:
> 
> > Heikki Linnakangas  writes:
> > > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> > >> FWIW [and I know you're probably sick of hearing this by now], but
> > >> the blk-io throttling works almost perfectly with applications that
> > >> use direct IO.
> >
> > > For checkpoint writes, direct I/O actually would be reasonable.
> > > Bypassing the OS cache is a good thing in that case - we don't want the
> > > written pages to evict other pages from the OS cache, as we already have
> > > them in the PostgreSQL buffer cache.
> >
> > But in exchange for that, we'd have to deal with selecting an order to
> > write pages that's appropriate depending on the filesystem layout,
> > other things happening in the system, etc etc.  We don't want to build
> > an I/O scheduler, IMO, but we'd have to.
> >
> > > Writing one page at a time with O_DIRECT from a single process might be
> > > quite slow, so we'd probably need to use writev() or asynchronous I/O to
> > > work around that.
> >
> > Yeah, and if the system has multiple spindles, we'd need to be issuing
> > multiple O_DIRECT writes concurrently, no?
> >
> 
> writev effectively does do that, doesn't it?  But they do have to be on the
> same file handle, so that could be a problem.  I think we need something
> like sorted checkpoints sooner or later, anyway.

No, it doesn't. writev() allows you to supply multiple user buffers
for a single IO to fixed offset. If th efile is contiguous, then it
will be issued as a single IO. If you want concurrent DIO, then you
need to use multiple threads or AIO.

> > What we'd really like for checkpointing is to hand the kernel a boatload
> > (several GB) of dirty pages and say "how about you push all this to disk
> > over the next few minutes, in whatever way seems optimal given the storage
> > hardware and system situation.  Let us know when you're done."
> 
> And most importantly, "Also, please don't freeze up everything else in the
> process"

If you hand writeback off to the kernel, then writeback for memory
reclaim needs to take precedence over "metered writeback". If we are
low on memory, then cleaning dirty memory quickly to avoid ongoing
allocation stalls, failures and potentially OOM conditions is far more
important than anything else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> >> FWIW [and I know you're probably sick of hearing this by now], but
> >> the blk-io throttling works almost perfectly with applications that
> >> use direct IO.
> 
> > For checkpoint writes, direct I/O actually would be reasonable. 
> > Bypassing the OS cache is a good thing in that case - we don't want the 
> > written pages to evict other pages from the OS cache, as we already have 
> > them in the PostgreSQL buffer cache.
> 
> But in exchange for that, we'd have to deal with selecting an order to
> write pages that's appropriate depending on the filesystem layout,
> other things happening in the system, etc etc.  We don't want to build
> an I/O scheduler, IMO, but we'd have to.

I don't see that as necessary - nobody else needs to do this with
direct IO. Indeed, if the application does ascending offset order
writeback from within a file, then it's replicating exactly what the
kernel page cache writeback does. If what the kernel does is good
enough for you, then I can't see how doing the same thing with
a background thread doing direct IO is going to need any special
help

> > Writing one page at a time with O_DIRECT from a single process might be 
> > quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> > work around that.
> 
> Yeah, and if the system has multiple spindles, we'd need to be issuing
> multiple O_DIRECT writes concurrently, no?
> 
> What we'd really like for checkpointing is to hand the kernel a boatload
> (several GB) of dirty pages and say "how about you push all this to disk
> over the next few minutes, in whatever way seems optimal given the storage
> hardware and system situation.  Let us know when you're done."

The issue there is that the kernel has other triggers for needing to
clean data. We have no infrastructure to handle variable writeback
deadlines at the moment, nor do we have any infrastructure to do
roughly metered writeback of such files to disk. I think we could
add it to the infrastructure without too much perturbation of the
code, but as you've pointed out that still leaves the fact there's
no obvious interface to configure such behaviour. Would it need to
be persistent?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Robert Haas  writes:
> I don't see that as a problem.  What we're struggling with today is
> that, until we fsync(), the system is too lazy about writing back
> dirty pages.  And then when we fsync(), it becomes very aggressive and
> system-wide throughput goes into the tank.  What we're aiming to do
> here is get is to start the writeback sooner than it would otherwise
> start so that it is spread out over a longer period of time.

Yeah.  It's sounding more and more like the right semantics are to
give the kernel a hint that we're going to fsync these files later,
so it ought to get on with writing them anytime the disk has nothing
better to do.  I'm not sure if there's value in being specific about
how much later; that would probably depend on details of the scheduler
that I don't know.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
>> No, we'd be happy to re-request it during each checkpoint cycle, as
>> long as that wasn't an unduly expensive call to make.  I'm not quite
>> sure where such requests ought to "live" though.  One idea is to tie
>> them to file descriptors; but the data to be written might be spread
>> across more files than we really want to keep open at one time.

> It would be a property of the inode, as that is how writeback is
> tracked and timed. Set and queried through a file descriptor,
> though - it's basically the same context that fadvise works
> through.

Ah, got it.  That would be fine on our end, I think.

>> We could probably live with serially checkpointing data
>> in sets of however-many-files-we-can-have-open, if file descriptors are
>> the place to keep the requests.

> Inodes live longer than file descriptors, but there's no guarantee
> that they live from one fd context to another. Hence my question
> about persistence ;)

I plead ignorance about what an "fd context" is.  However, if what you're
saying is that there's a small chance of the kernel forgetting the request
during normal system operation, I think we could probably tolerate that,
if the API is designed so that we ultimately do an fsync on the file
anyway.  The point of the hint would be to try to ensure that the later
fsync had little to do.  If sometimes it didn't work, well, that's life.
We're ahead of the game as long as it usually works.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 7:22 PM, Dave Chinner  wrote:
> No, I meant the opposite - in low memory situations, the system is
> going to go to hell in a handbasket because we are going to cause a
> writeback IO storm cleaning memory regardless of these IO
> priorities. i.e. there is no way we'll let "low priority writeback
> to avoid IO storms" cause OOM conditions to occur. That is, in OOM
> conditions, cleaning dirty pages becomes one of the highest priority
> tasks of the system

I don't see that as a problem.  What we're struggling with today is
that, until we fsync(), the system is too lazy about writing back
dirty pages.  And then when we fsync(), it becomes very aggressive and
system-wide throughput goes into the tank.  What we're aiming to do
here is get is to start the writeback sooner than it would otherwise
start so that it is spread out over a longer period of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jim Nasby


On 1/15/14, 12:00 AM, Claudio Freire wrote:

My completely unproven theory is that swapping is overwhelmed by
near-misses. Ie: a process touches a page, and before it's actually
swapped in, another process touches it too, blocking on the other
process' read. But the second process doesn't account for that page
when evaluating predictive models (ie: read-ahead), so the next I/O by
process 2 is unexpected to the kernel. Then the same with 1. Etc... In
essence, swap, by a fluke of its implementation, fails utterly to
predict the I/O pattern, and results in far sub-optimal reads.

Explicit I/O is free from that effect, all read calls are accountable,
and that makes a difference.

Maybe, if the kernel could be fixed in that respect, you could
consider mmap'd files as a suitable form of temporary storage. But
that would depend on the success and availability of such a fix/patch.


Another option is to consider some of the more "radical" ideas in this thread, 
but only for temporary data. Our write sequencing and other needs are far less stringent 
for this stuff.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
>> And most importantly, "Also, please don't freeze up everything else in the
>> process"

> If you hand writeback off to the kernel, then writeback for memory
> reclaim needs to take precedence over "metered writeback". If we are
> low on memory, then cleaning dirty memory quickly to avoid ongoing
> allocation stalls, failures and potentially OOM conditions is far more
> important than anything else.

I think you're in violent agreement, actually.  Jeff's point is exactly
that we'd rather the checkpoint deadline slid than that the system goes
to hell in a handbasket for lack of I/O cycles.  Here "metered" really
means "do it as a low-priority task".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Dave Chinner  writes:
> On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
>> What we'd really like for checkpointing is to hand the kernel a boatload
>> (several GB) of dirty pages and say "how about you push all this to disk
>> over the next few minutes, in whatever way seems optimal given the storage
>> hardware and system situation.  Let us know when you're done."

> The issue there is that the kernel has other triggers for needing to
> clean data. We have no infrastructure to handle variable writeback
> deadlines at the moment, nor do we have any infrastructure to do
> roughly metered writeback of such files to disk. I think we could
> add it to the infrastructure without too much perturbation of the
> code, but as you've pointed out that still leaves the fact there's
> no obvious interface to configure such behaviour. Would it need to
> be persistent?

No, we'd be happy to re-request it during each checkpoint cycle, as
long as that wasn't an unduly expensive call to make.  I'm not quite
sure where such requests ought to "live" though.  One idea is to tie
them to file descriptors; but the data to be written might be spread
across more files than we really want to keep open at one time.
But the only other idea that comes to mind is some kind of global sysctl,
which would probably have security and permissions issues.  (One thing
that hasn't been mentioned yet in this thread, but maybe is worth pointing
out now, is that Postgres does not run as root, and definitely doesn't
want to.  So we don't want a knob that would require root permissions
to twiddle.)  We could probably live with serially checkpointing data
in sets of however-many-files-we-can-have-open, if file descriptors are
the place to keep the requests.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jeff Janes

On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane  wrote:

> Heikki Linnakangas  writes:
> > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> >> FWIW [and I know you're probably sick of hearing this by now], but
> >> the blk-io throttling works almost perfectly with applications that
> >> use direct IO.
>
> > For checkpoint writes, direct I/O actually would be reasonable.
> > Bypassing the OS cache is a good thing in that case - we don't want the
> > written pages to evict other pages from the OS cache, as we already have
> > them in the PostgreSQL buffer cache.
>
> But in exchange for that, we'd have to deal with selecting an order to
> write pages that's appropriate depending on the filesystem layout,
> other things happening in the system, etc etc.  We don't want to build
> an I/O scheduler, IMO, but we'd have to.
>
> > Writing one page at a time with O_DIRECT from a single process might be
> > quite slow, so we'd probably need to use writev() or asynchronous I/O to
> > work around that.
>
> Yeah, and if the system has multiple spindles, we'd need to be issuing
> multiple O_DIRECT writes concurrently, no?
>

writev effectively does do that, doesn't it?  But they do have to be on the
same file handle, so that could be a problem.  I think we need something
like sorted checkpoints sooner or later, anyway.



> What we'd really like for checkpointing is to hand the kernel a boatload
> (several GB) of dirty pages and say "how about you push all this to disk
> over the next few minutes, in whatever way seems optimal given the storage
> hardware and system situation.  Let us know when you're done."


And most importantly, "Also, please don't freeze up everything else in the
process"

Cheers,

Jeff

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara

On Wed 15-01-14 14:38:44, Hannu Krosing wrote:
> On 01/15/2014 02:01 PM, Jan Kara wrote:
> > On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> >> On 01/14/2014 06:12 PM, Robert Haas wrote:
> >>> This would be pretty similar to copy-on-write, except
> >>> without the copying. It would just be
> >>> forget-from-the-buffer-pool-on-write. 
> >> +1
> >>
> >> A version of this could probably already be implement using MADV_DONTNEED
> >> and MADV_WILLNEED
> >>
> >> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> >> evicting
> >> a clean page, check that it is still in cache and if it is, then
> >> MADV_WILLNEED it.
> >>
> >> Another nice thing to do would be dynamically adjusting kernel
> >> dirty_background_ratio
> >> and other related knobs in real time based on how many buffers are dirty
> >> inside postgresql.
> >> Maybe in background writer.
> >>
> >> Question to LKM folks - will kernel react well to frequent changes to
> >> /proc/sys/vm/dirty_*  ?
> >> How frequent can they be (every few second? every second? 100Hz ?)
> >   So the question is what do you mean by 'react'. We check whether we
> > should start background writeback every dirty_writeback_centisecs (5s). We
> > will also check whether we didn't exceed the background dirty limit (and
> > wake writeback thread) when dirtying pages. However this check happens once
> > per several dirtied MB (unless we are close to dirty_bytes).
> >
> > When writeback is running we check roughly once per second (the logic is
> > more complex there but I don't think explaining details would be useful
> > here) whether we are below dirty_background_bytes and stop writeback in
> > that case.
> >
> > So changing dirty_background_bytes every few seconds should work
> > reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> > note that you have conflicting requirements on the kernel writeback. On one
> > hand you want checkpoint data to steadily trickle to disk (well, trickle
> > isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> > minutes than you need a steady throughput of ~50 MB/s just for
> > checkpointing) so you want to set dirty_background_bytes low, on the other
> > hand you don't want temporary files to get to disk so you want to set
> > dirty_background_bytes high. 
> Is it possible to have more fine-grained control over writeback, like
> configuring dirty_background_bytes per file system / device (or even
> a file or a group of files) ?
  Currently it isn't possible to tune dirty_background_bytes per device
directly. However see below.

> If not, then how hard would it be to provide this ?
  We do track amount of dirty pages per device and the thread doing the
flushing is also per device. The thing is that currently we compute the
per-device background limit as dirty_background_bytes * p, where p is a
proportion of writeback happening on this device to total writeback in the
system (computed as floating average with exponential time-based backoff).
BTW, similarly maximum per-device dirty limit is derived from global
dirty_bytes in the same way. And you can also set bounds on the proportion
'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able
to set fixed background limit for a device by setting matching min and max
proportions.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire

On Wed, Jan 15, 2014 at 3:41 PM, Stephen Frost  wrote:
> * Claudio Freire (klaussfre...@gmail.com) wrote:
>> But, still, the implementation is very similar to what postgres needs:
>> sharing a physical page for two distinct logical pages, efficiently,
>> with efficient copy-on-write.
>
> Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
> guessing there's a reason the pagecache isn't included normally..

KSM does an active de-duplication. That's slow. This would be
leveraging KSM structures in the kernel (page sharing) but without all
the de-duplication logic.

>
>> So it'd be just a matter of removing that limitation regarding page
>> cache and shared pages.
>
> Any idea why that limitation is there?

No, but I'm guessing it's because nobody bothered to implement the
required copy-on-write in the page cache, which would be a PITA to
write - think of all the complexities with privilege checks and
everything - even though the benefits for many kinds of applications
would be important.

>> If you asked me, I'd implement it as copy-on-write on the page cache
>> (not the user page). That ought to be low-overhead.
>
> Not entirely sure I'm following this- if it's a shared page, it doesn't
> matter who starts writing to it, as soon as that happens, it need to get
> copied.  Perhaps you mean that the application should keep the
> "original" and that the page-cache should get the "copy" (or, really,
> perhaps just forget about the page existing at that point- we won't want
> it again...).
>
> Would that be a way to go, perhaps?  This does go back to the "make it
> act like mmap, but not *be* mmap", but the idea would be:
> open(..., O_ZEROCOPY_READ)
> read() - Goes to PG's shared buffers, pagecache and PG share the page
> page fault (PG writes to it) - pagecache forgets about the page
> write() / fsync() - operate as normal

Yep.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost

* Claudio Freire (klaussfre...@gmail.com) wrote:
> But, still, the implementation is very similar to what postgres needs:
> sharing a physical page for two distinct logical pages, efficiently,
> with efficient copy-on-write.

Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..

> So it'd be just a matter of removing that limitation regarding page
> cache and shared pages.

Any idea why that limitation is there?

> If you asked me, I'd implement it as copy-on-write on the page cache
> (not the user page). That ought to be low-overhead.

Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied.  Perhaps you mean that the application should keep the
"original" and that the page-cache should get the "copy" (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).

Would that be a way to go, perhaps?  This does go back to the "make it
act like mmap, but not *be* mmap", but the idea would be:

open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal

The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.

This isn't the "big win" of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire

On Wed, Jan 15, 2014 at 1:35 PM, Stephen Frost  wrote:
>> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
>>
>> I could try to see of madvising shared_buffers as mergeable helps. But
>> this should be an automatic case of KSM - ie, when reading into a
>> page-aligned address, the kernel should summarily apply KSM-style
>> sharing without hinting. The current madvise interface puts the burden
>> of figuring out what duplicates what on the kernel, but postgres
>> already knows.
>
> I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
> with 3.5.0-23-generic, it's not doing anything with just PG running.
> The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:
>
> 
> KSM is a memory-saving de-duplication feature, that merges anonymous
> (private) pages (not pagecache ones).
> 
>
> Looks like it won't merge between pagecache and private/application
> memory?  Or is it just that we're not madvise()'ing the shared buffers
> region?  I'd be happy to test doing that, if there's a chance it'll
> actually work..


Yes, it's onlyl *intended* for merging private memory.

But, still, the implementation is very similar to what postgres needs:
sharing a physical page for two distinct logical pages, efficiently,
with efficient copy-on-write.

So it'd be just a matter of removing that limitation regarding page
cache and shared pages.

If you asked me, I'd implement it as copy-on-write on the page cache
(not the user page). That ought to be low-overhead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Stephen Frost

* Claudio Freire (klaussfre...@gmail.com) wrote:
> Yes, that's basically zero-copy reads.
> 
> It could be done. The kernel can remap the page to the physical page
> holding the shared buffer and mark it read-only, then expire the
> buffer and transfer ownership of the page if any page fault happens.
> 
> But that incurrs:
>  - Page faults, lots
>  - Hugely bloated mappings, unless KSM is somehow leveraged for this

The page faults might be a problem but might be worth it.  Bloated
mappings sounds like a real issue though.

> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
> 
> I could try to see of madvising shared_buffers as mergeable helps. But
> this should be an automatic case of KSM - ie, when reading into a
> page-aligned address, the kernel should summarily apply KSM-style
> sharing without hinting. The current madvise interface puts the burden
> of figuring out what duplicates what on the kernel, but postgres
> already knows.

I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
with 3.5.0-23-generic, it's not doing anything with just PG running.
The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:

KSM is a memory-saving de-duplication feature, that merges anonymous
(private) pages (not pagecache ones).

Looks like it won't merge between pagecache and private/application
memory?  Or is it just that we're not madvise()'ing the shared buffers
region?  I'd be happy to test doing that, if there's a chance it'll
actually work..

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara  wrote:
> Filesystems could in theory provide facility like atomic write (at least up
> to a certain size say in MB range) but it's not so easy and when there are
> no strong usecases fs people are reluctant to make their code more complex
> unnecessarily. OTOH without widespread atomic write support I understand
> application developers have similar stance. So it's kind of chicken and egg
> problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> due to its data=journal mode so if someone on the PostgreSQL side wanted to
> research on this, knitting some experimental ext4 patches should be doable.

Atomic 8kB writes would improve performance for us quite a lot.  Full
page writes to WAL are very expensive.  I don't remember what
percentage of write-ahead log traffic that accounts for, but it's not
small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Heikki Linnakangas  writes:
> On 01/15/2014 07:50 AM, Dave Chinner wrote:
>> FWIW [and I know you're probably sick of hearing this by now], but
>> the blk-io throttling works almost perfectly with applications that
>> use direct IO.

> For checkpoint writes, direct I/O actually would be reasonable. 
> Bypassing the OS cache is a good thing in that case - we don't want the 
> written pages to evict other pages from the OS cache, as we already have 
> them in the PostgreSQL buffer cache.

But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc.  We don't want to build
an I/O scheduler, IMO, but we'd have to.

> Writing one page at a time with O_DIRECT from a single process might be 
> quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> work around that.

Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?

What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say "how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation.  Let us know when you're done."  Right now,
because there's no way to negotiate such behavior, we're reduced to having
to dribble out the pages (in what's very likely a non-optimal order) and
hope that the kernel is neither too lazy nor too aggressive about cleaning
dirty pages in its caches.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Heikki Linnakangas


On 01/15/2014 07:50 AM, Dave Chinner wrote:

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.


For checkpoint writes, direct I/O actually would be reasonable. 
Bypassing the OS cache is a good thing in that case - we don't want the 
written pages to evict other pages from the OS cache, as we already have 
them in the PostgreSQL buffer cache.


Writing one page at a time with O_DIRECT from a single process might be 
quite slow, so we'd probably need to use writev() or asynchronous I/O to 
work around that.


We'd still need to issue an fsync() to flush any already-written pages 
from the OS cache to disk, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 5:23 PM, Dave Chinner  wrote:
> By default, background writeback doesn't start until 10% of memory
> is dirtied, and on your machine that's 25GB of RAM. That's way to
> high for your workload.
>
> It appears to me that we are seeing large memory machines much more
> commonly in data centers - a couple of years ago 256GB RAM was only
> seen in supercomputers. Hence machines of this size are moving from
> "tweaking settings for supercomputers is OK" class to "tweaking
> settings for enterprise servers is not OK"
>
> Perhaps what we need to do is deprecate dirty_ratio and
> dirty_background_ratio as the default values as move to the byte
> based values as the defaults and cap them appropriately.  e.g.
> 10/20% of RAM for small machines down to a couple of GB for large
> machines

I think that's right.  In our case we know we're going to call fsync()
eventually and that's going to produce a torrent of I/O.  If that
torrent fits in downstream caches or can be satisfied quickly without
disrupting the rest of the system too much, then life is good.  But
the downstream caches don't typically grow proportionately to the size
of system memory.  Maybe a machine with 16GB has 1GB of battery-backed
write cache, but it doesn't follow that 256GB machine has 16GB of
battery-backed write cache.

> Essentially, changing dirty_background_bytes, dirty_bytes and
> dirty_expire_centiseconds to be much smaller should make the kernel
> start writeback much sooner and so you shouldn't have to limit the
> amount of buffers the application has to prevent major fsync
> triggered stalls...

I think this has been tried with some success, but I don't know the
details.  I think the bytes values are clearly more useful than the
percentages, because you can set them smaller and with better
granularity.

One thought that occurs to me is that it might be useful to have
PostgreSQL tell the system when we expect to perform an fsync.
Imagine fsync_is_coming(int fd, time_t).  We know long in advance
(minutes) when we're gonna do it, so in some sense what we'd like to
tell the kernel is: we're not in a hurry to get this data on disk
right now, but when the indicated time arrives, we are going to do
fsyncs of a bunch of files in rapid succession, so please arrange to
flush the data as close to that time as possible (to maximize
write-combining) while still finishing by that time (so that the
fsyncs are fast and more importantly so that they don't cause a
system-wide stall).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 4:23 PM, James Bottomley
 wrote:
> Yes, that's what I was thinking: it's a cache.  About how many files
> comprise this cache?  Are you thinking it's too difficult for every
> process to map the files?

No, I'm thinking that would throw cache coherency out the window.
Separate mappings are all well and good until somebody decides to
modify the page, but after that point the database processes need to
see the modified version of the page (which is, further, hedged about
with locks) yet the operating system MUST NOT see the modified version
of the page until the write-ahead log entry for the page modification
has been flushed to disk.  There's really no way to do that without
having our own private cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing

On 01/15/2014 02:01 PM, Jan Kara wrote:
> On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
>> On 01/14/2014 06:12 PM, Robert Haas wrote:
>>> This would be pretty similar to copy-on-write, except
>>> without the copying. It would just be
>>> forget-from-the-buffer-pool-on-write. 
>> +1
>>
>> A version of this could probably already be implement using MADV_DONTNEED
>> and MADV_WILLNEED
>>
>> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
>> evicting
>> a clean page, check that it is still in cache and if it is, then
>> MADV_WILLNEED it.
>>
>> Another nice thing to do would be dynamically adjusting kernel
>> dirty_background_ratio
>> and other related knobs in real time based on how many buffers are dirty
>> inside postgresql.
>> Maybe in background writer.
>>
>> Question to LKM folks - will kernel react well to frequent changes to
>> /proc/sys/vm/dirty_*  ?
>> How frequent can they be (every few second? every second? 100Hz ?)
>   So the question is what do you mean by 'react'. We check whether we
> should start background writeback every dirty_writeback_centisecs (5s). We
> will also check whether we didn't exceed the background dirty limit (and
> wake writeback thread) when dirtying pages. However this check happens once
> per several dirtied MB (unless we are close to dirty_bytes).
>
> When writeback is running we check roughly once per second (the logic is
> more complex there but I don't think explaining details would be useful
> here) whether we are below dirty_background_bytes and stop writeback in
> that case.
>
> So changing dirty_background_bytes every few seconds should work
> reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> note that you have conflicting requirements on the kernel writeback. On one
> hand you want checkpoint data to steadily trickle to disk (well, trickle
> isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> minutes than you need a steady throughput of ~50 MB/s just for
> checkpointing) so you want to set dirty_background_bytes low, on the other
> hand you don't want temporary files to get to disk so you want to set
> dirty_background_bytes high. 
Is it possible to have more fine-grained control over writeback, like
configuring dirty_background_bytes per file system / device (or even
a file or a group of files) ?

If not, then how hard would it be to provide this ?

This is a bit backwards from keeping-the-cache-clean perspective,
but would help a lot with hinting the writer that a big sync is coming.

> And also that changes of
> dirty_background_bytes probably will not take into account other events
> happening on the system (maybe a DB backup is running...). So I'm somewhat
> skeptical you will be able to tune dirty_background_bytes frequently in a
> useful way.
>


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara

On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> On 01/14/2014 06:12 PM, Robert Haas wrote:
> > This would be pretty similar to copy-on-write, except
> > without the copying. It would just be
> > forget-from-the-buffer-pool-on-write. 
> 
> +1
> 
> A version of this could probably already be implement using MADV_DONTNEED
> and MADV_WILLNEED
> 
> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> evicting
> a clean page, check that it is still in cache and if it is, then
> MADV_WILLNEED it.
> 
> Another nice thing to do would be dynamically adjusting kernel
> dirty_background_ratio
> and other related knobs in real time based on how many buffers are dirty
> inside postgresql.
> Maybe in background writer.
> 
> Question to LKM folks - will kernel react well to frequent changes to
> /proc/sys/vm/dirty_*  ?
> How frequent can they be (every few second? every second? 100Hz ?)
  So the question is what do you mean by 'react'. We check whether we
should start background writeback every dirty_writeback_centisecs (5s). We
will also check whether we didn't exceed the background dirty limit (and
wake writeback thread) when dirtying pages. However this check happens once
per several dirtied MB (unless we are close to dirty_bytes).

When writeback is running we check roughly once per second (the logic is
more complex there but I don't think explaining details would be useful
here) whether we are below dirty_background_bytes and stop writeback in
that case.

So changing dirty_background_bytes every few seconds should work
reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
note that you have conflicting requirements on the kernel writeback. On one
hand you want checkpoint data to steadily trickle to disk (well, trickle
isn't exactly the proper word since if you need to checkpoing 16 GB every 5
minutes than you need a steady throughput of ~50 MB/s just for
checkpointing) so you want to set dirty_background_bytes low, on the other
hand you don't want temporary files to get to disk so you want to set
dirty_background_bytes high. And also that changes of
dirty_background_bytes probably will not take into account other events
happening on the system (maybe a DB backup is running...). So I'm somewhat
skeptical you will be able to tune dirty_background_bytes frequently in a
useful way.

Honza
-- 
Jan Kara 
SUSE Labs, CR

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Jan Kara

On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote:
> On 01/15/2014 06:01 AM, Jim Nasby wrote:
> >For the sake of completeness... it's theoretically silly that Postgres
> >is doing all this stuff with WAL when the filesystem is doing something
> >very similar with it's journal. And an SSD drive (and next generation
> >spinning rust) is doing the same thing *again* in it's own journal.
> >
> >If all 3 communities (or even just 2 of them!) could agree on the
> >necessary interface a tremendous amount of this duplicated technology
> >could be eliminated.
> >
> >That said, I rather doubt the Postgres community would go this route,
> >not so much because of the presumably massive changes needed, but more
> >because our community is not a fan of restricting our users to things
> >like "Thou shalt use a journaled FS or risk all thy data!"
> 
> The WAL is also used for continuous archiving and replication, not
> just crash recovery. We could skip full-page-writes, though, if we
> knew that the underlying filesystem/storage is guaranteeing that a
> write() is atomic.
> 
> It might be useful for PostgreSQL somehow tell the filesystem that
> we're taking care of WAL-logging, so that the filesystem doesn't
> need to.
  Well, journalling fs generally cares about its metadata consistency. We
have much weaker guarantees regarding file data because those guarantees
come at a cost most people don't want to pay.

Filesystems could in theory provide facility like atomic write (at least up
to a certain size say in MB range) but it's not so easy and when there are
no strong usecases fs people are reluctant to make their code more complex
unnecessarily. OTOH without widespread atomic write support I understand
application developers have similar stance. So it's kind of chicken and egg
problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
due to its data=journal mode so if someone on the PostgreSQL side wanted to
research on this, knitting some experimental ext4 patches should be doable.

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
> On 1/14/14, 3:41 PM, Dave Chinner wrote:
> >On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> >>On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman 
> >>wrote: Whether the problem is with the system call or the
> >>programmer is harder to determine.  I think the problem is in
> >>part that it's not exactly clear when we should call it.  So
> >>suppose we want to do a checkpoint.  What we used to do a long
> >>time ago is write everything, and then fsync it all, and then
> >>call it good.  But that produced horrible I/O storms.  So what
> >>we do now is do the writes over a period of time, with sleeps in
> >>between, and then fsync it all at the end, hoping that the
> >>kernel will write some of it before the fsyncs arrive so that we
> >>don't get a huge I/O spike.  And that sorta works, and it's
> >>definitely better than doing it all at full speed, but it's
> >>pretty imprecise.  If the kernel doesn't write enough of the
> >>data out in advance, then there's still a huge I/O storm when we
> >>do the fsyncs and everything grinds to a halt.  If it writes out
> >>more data than needed in advance, it increases the total number
> >>of physical writes because we get less write-combining, and that
> >>hurts performance, too.
> 
> I think there's a pretty important bit that Robert didn't mention:
> we have a specific *time* target for when we want all the fsync's
> to complete. People that have problems here tend to tune
> checkpoints to complete every 5-15 minutes, and they want the
> write traffic for the checkpoint spread out over 90% of that time
> interval. To put it another way, fsync's should be done when 90%
> of the time to the next checkpoint hits, but preferably not a lot
> before then.

I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
problem.

> >It sounds to me like you want the kernel to start background
> >writeback earlier so that it doesn't build up as much dirty data
> >before you require a flush. There are several ways to do this by
> >tweaking writeback knobs. The simplest is probably just to set
> >/proc/sys/vm/dirty_background_bytes to an appropriate threshold
> >(say 50MB) and dirty_expire_centiseconds to a few seconds so that
> >background writeback starts and walks all dirty inodes almost
> >immediately. This will keep a steady stream of low level
> >background IO going, and fsync should then not take very long.
> 
> Except that still won't throttle writes, right? That's the big
> issue here: our users often can't tolerate big spikes in IO
> latency. They want user requests to always happen within a
> specific amount of time.

Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.

> So while delaying writes potentially reduces the total amount of
> data you're writing, users that run into problems here ultimately
> care more about ensuring that their foreground IO completes in a
> timely fashion.

Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound

> >Fundamentally, though, we need bug reports from people seeing
> >these problems when they see them so we can diagnose them on
> >their systems. Trying to discuss/diagnose these problems without
> >knowing anything about the storage, the kernel version, writeback
> >thresholds, etc really doesn't work because we can't easily
> >determine a root cause.
> 
> So is lsf...@linux-foundation.org the best way to accomplish that?

No. That is just the list for organising the LFSMM summit. ;)

For general pagecache and writeback issues, discussions, etc,
linux-fsde...@vger.kernel.org is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g. linux-e...@vger.kernel.org and
x...@oss.sgi.com). We tend to cross-post to other relevant lists as
tria

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman

On Mon, Jan 13, 2014 at 02:19:56PM -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> > On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > > Well, if we were to collaborate with the kernel community on this then
> > > > presumably we can do better than that for eviction... even to the
> > > > extent of "here's some data from this range in this file. It's (clean|
> > > > dirty). Put it in your cache. Just trust me on this."
> > > 
> > > This should be the madvise() interface (with MADV_WILLNEED and
> > > MADV_DONTNEED) is there something in that interface that is
> > > insufficient?
> > 
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces).
> 
> I understand, that's why you get double buffering: because we can't
> replace a page in the range you give us on read/write.  However, you
> don't have to switch entirely to mmap: you can use mmap/madvise
> exclusively for cache control and still use read/write (and still pay
> the double buffer penalty, of course).  It's only read/write with
> directio that would cause problems here (unless you're planning to
> switch to DIO?).
> 

There are hazards with using mmap/madvise that may or may not be a problem
for them. I think these are well known but just in case;

mmap/munmap intensive workloads may get hammered on taking mmap_sem for
write. The greatest costs are incurred if the application is threaded
if the parallel threads are fault-intensive. I do not think this is the
case for PostgreSQL as it is process based but it is a concern. Even it's
a single-threaded process, the cost of the mmap_sem cache line bouncing
can be a concern. Outside of that, the mmap/munmap paths are just really
costly and take a lot of work.

madvise has different hazards but lets take DONTNEED as an example because
it's the most likely candidate for use. A DONTNEED hint has three potential
downsides. The first is that mmap_sem taken for read can be very costly
for threaded applications as the cache line bounces. On NUMA machines it
can be a major problem for madvise-intensive workloads. The second is that
the page table teardown frees the pages with the associated costs but most
importantly, an IPI is required afterwards to flush the TLB. If that process
has been running on a lot of different CPUs then the IPI cost can be very
high. The third hazard is that a madvise(DONTNEED) region will incur page
faults on the next accesses again hammering into mmap_sem and all the faults
associated with faulting (allocating the same pages again, zeroing etc)

It may be the case that mmap/madvise is still required to handle a double
buffering problem but it's far from being a free lunch and it has costs
that read/write does not have to deal with. Maybe some of these problems
can be fixed or mitigated but it is a case where a test case demonstrates
the problem even if that requires patching PostgreSQL.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing

On 01/15/2014 12:16 PM, Hannu Krosing wrote:
> On 01/14/2014 06:12 PM, Robert Haas wrote:
>> This would be pretty similar to copy-on-write, except
>> without the copying. It would just be
>> forget-from-the-buffer-pool-on-write. 
> +1
>
> A version of this could probably already be implement using MADV_DONTNEED
> and MADV_WILLNEED
>
> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> evicting
> a clean page, check that it is still in cache and if it is, then
> MADV_WILLNEED it.
>
> Another nice thing to do would be dynamically adjusting kernel
> dirty_background_ratio
> and other related knobs in real time based on how many buffers are dirty
> inside postgresql.
> Maybe in background writer.
>
> Question to LKM folks - will kernel react well to frequent changes to
> /proc/sys/vm/dirty_*  ?
> How frequent can they be (every few second? every second? 100Hz ?)
One obvious use case of this would be changing dirty_background_bytes
linearly to almost zero during a checkpoint to make final fsync fast.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Hannu Krosing

On 01/14/2014 06:12 PM, Robert Haas wrote:
> This would be pretty similar to copy-on-write, except
> without the copying. It would just be
> forget-from-the-buffer-pool-on-write. 

+1

A version of this could probably already be implement using MADV_DONTNEED
and MADV_WILLNEED

Thet is, just after reading the page in, use MADV_DONTNEED on it. When
evicting
a clean page, check that it is still in cache and if it is, then
MADV_WILLNEED it.

Another nice thing to do would be dynamically adjusting kernel
dirty_background_ratio
and other related knobs in real time based on how many buffers are dirty
inside postgresql.
Maybe in background writer.

Question to LKM folks - will kernel react well to frequent changes to
/proc/sys/vm/dirty_*  ?
How frequent can they be (every few second? every second? 100Hz ?)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman

On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > What's not so simple, is figuring out what policy to use. Remember,
> > > you cannot tell the kernel to put some page in its page cache without
> > > reading it or writing it. So, once you make the kernel forget a page,
> > > evicting it from shared buffers becomes quite expensive.
> >
> > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> > forcing readahead.
> 
> 
> But telling the kernel to forget a page, then telling it to read it in
> again from disk because it might be needed again in the near future is
> itself very expensive.  We would need to hand the page to the kernel so it
> has it without needing to go to disk to get it.
> 

Yes, this is the unnecessary IO cost I was thinking of.

> 
> > If you evict it prematurely then you do get kinda
> > screwed because you pay the IO cost to read it back in again even if you
> > had enough memory to cache it. Maybe this is the type of kernel-postgres
> > interaction that is annoying you.
> >
> > If you don't evict, the kernel eventually steps in and evicts the wrong
> > thing. If you do evict and it was unnecessarily you pay an IO cost.
> >
> > That could be something we look at. There are cases buried deep in the
> > VM where pages get shuffled to the end of the LRU and get tagged for
> > reclaim as soon as possible. Maybe you need access to something like
> > that via posix_fadvise to say "reclaim this page if you need memory but
> > leave it resident if there is no memory pressure" or something similar.
> > Not exactly sure what that interface would look like or offhand how it
> > could be reliably implemented.
> >
> 
> I think the "reclaim this page if you need memory but leave it resident if
> there is no memory pressure" hint would be more useful for temporary
> working files than for what was being discussed above (shared buffers).
>  When I do work that needs large temporary files, I often see physical
> write IO spike but physical read IO does not.  I interpret that to mean
> that the temporary data is being written to disk to satisfy either
> dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> cache and so disk reads are not needed to satisfy it.  So a hint that says
> "this file will never be fsynced so please ignore dirty_*bytes and
> dirty_expire_centisecs. 

It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
were the problem here. An interface that forces a dirty page to stay dirty
regardless of the global system would be a major hazard. It potentially
allows the creator of the temporary file to stall all other processes
dirtying pages for an unbounded period of time. I proposed in another part
of the thread a hint for open inodes to have the background writer thread
ignore dirty pages belonging to that inode. Dirty limits and fsync would
still be obeyed. It might also be workable for temporary files but the
proposal could be full of holes.

Your alternative here is to create a private anonymous mapping as they
are not subject to dirty limits. This is only a sensible option if the
temporarily data is guaranteeed to be relatively small. If the shared
buffers, page cache and your temporary data exceed the size of RAM then
data will get discarded or your temporary data will get pushed to swap
and performance will hit the floor.

FWIW, the performance of some IO "benchmarks" used to depend on whether they
could create, write and delete files before any of the data actually hit
the disk -- pretty much exactly the type of behaviour you are looking for.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Heikki Linnakangas


On 01/15/2014 06:01 AM, Jim Nasby wrote:

For the sake of completeness... it's theoretically silly that Postgres
is doing all this stuff with WAL when the filesystem is doing something
very similar with it's journal. And an SSD drive (and next generation
spinning rust) is doing the same thing *again* in it's own journal.

If all 3 communities (or even just 2 of them!) could agree on the
necessary interface a tremendous amount of this duplicated technology
could be eliminated.

That said, I rather doubt the Postgres community would go this route,
not so much because of the presumably massive changes needed, but more
because our community is not a fan of restricting our users to things
like "Thou shalt use a journaled FS or risk all thy data!"


The WAL is also used for continuous archiving and replication, not just 
crash recovery. We could skip full-page-writes, though, if we knew that 
the underlying filesystem/storage is guaranteeing that a write() is atomic.


It might be useful for PostgreSQL somehow tell the filesystem that we're 
taking care of WAL-logging, so that the filesystem doesn't need to.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Wed, Jan 15, 2014 at 1:07 AM, Jim Nasby  wrote:
>>>
>>> Though, it also occurs to me... perhaps it would be better for us to
>>> simply
>>> map temp objects to memory and let the kernel swap them out if needed...
>>
>>
>>
>> Oum... bad idea.
>>
>> Swap logic has very poor taste for I/O patterns.
>
>
> Well, to be honest, so do we. Practically zero in fact...

I've used mmap'd files for years, they're great for sharing mutable
memory across unrelated (as in out-of-heirarchy) processes.

And my experience is, that when swapping to-from disk is expectably a
significant percentage of the workload, explicit I/O of even the
dumbest kind far outperforms swap-based I/O.

I've read the kernel code and I'm not 100% sure of why is that, but I
have a suspect.

My completely unproven theory is that swapping is overwhelmed by
near-misses. Ie: a process touches a page, and before it's actually
swapped in, another process touches it too, blocking on the other
process' read. But the second process doesn't account for that page
when evaluating predictive models (ie: read-ahead), so the next I/O by
process 2 is unexpected to the kernel. Then the same with 1. Etc... In
essence, swap, by a fluke of its implementation, fails utterly to
predict the I/O pattern, and results in far sub-optimal reads.

Explicit I/O is free from that effect, all read calls are accountable,
and that makes a difference.

Maybe, if the kernel could be fixed in that respect, you could
consider mmap'd files as a suitable form of temporary storage. But
that would depend on the success and availability of such a fix/patch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/14/14, 6:36 PM, Claudio Freire wrote:

On Tue, Jan 14, 2014 at 9:22 PM, Jim Nasby  wrote:

On 1/14/14, 11:30 AM, Jeff Janes wrote:


I think the "reclaim this page if you need memory but leave it resident if
there is no memory pressure" hint would be more useful for temporary working
files than for what was being discussed above (shared buffers).  When I do
work that needs large temporary files, I often see physical write IO spike
but physical read IO does not.  I interpret that to mean that the temporary
data is being written to disk to satisfy either dirty_expire_centisecs or
dirty_*bytes, but the data remains in the FS cache and so disk reads are not
needed to satisfy it.  So a hint that says "this file will never be fsynced
so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it
again relatively soon (but not after a reboot), but will do so mostly
sequentially, so please don't evict this without need, but if you do need to
then it is a good candidate" would be good.



I also frequently see this, and it has an even larger impact if pgsql_tmp is
on the same filesystem as WAL. Which *theoretically* shouldn't matter with a
BBU controller, except that when the kernel suddenly decides your
*temporary* data needs to hit the media you're screwed.

Though, it also occurs to me... perhaps it would be better for us to simply
map temp objects to memory and let the kernel swap them out if needed...



Oum... bad idea.

Swap logic has very poor taste for I/O patterns.


Well, to be honest, so do we. Practically zero in fact...

In fact, the kernel might even be in a better position than we are since you 
can presumably count page faults much more cheaply than we can.

BTW, if you guys are looking at ARC you should absolutely read discussion about 
that in our archives (http://lnk.nu/postgresql.org/2zeu/ as a starting point). 
We put considerable effort into it, had it in two minor versions, and then 
switched to a clock-sweep algorithm that's similar to what FreeBSD used, at 
least in the 4.x days. Definitely not claiming what we've got is the best (in 
fact, I think we're hurt by not maintaining a real free list), but the ARC info 
there is probably valuable.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/14/14, 10:08 AM, Tom Lane wrote:

Trond Myklebust  writes:

On Jan 14, 2014, at 10:39, Tom Lane  wrote:

"Don't be aggressive" isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.



Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.


As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
"you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there".  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the "current"
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.


For the sake of completeness... it's theoretically silly that Postgres is doing 
all this stuff with WAL when the filesystem is doing something very similar 
with it's journal. And an SSD drive (and next generation spinning rust) is 
doing the same thing *again* in it's own journal.

If all 3 communities (or even just 2 of them!) could agree on the necessary 
interface a tremendous amount of this duplicated technology could be eliminated.

That said, I rather doubt the Postgres community would go this route, not so much because 
of the presumably massive changes needed, but more because our community is not a fan of 
restricting our users to things like "Thou shalt use a journaled FS or risk all thy 
data!"


Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.


Yet another problem with trying to combine database and journaled FS efforts... 
:(
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/14/14, 3:41 PM, Dave Chinner wrote:

On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:

On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman  wrote:

IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...


I belive the PG community understands that, but thanks for the heads-up.


Whether the problem is with the system
call or the programmer is harder to determine.  I think the problem is
in part that it's not exactly clear when we should call it.  So
suppose we want to do a checkpoint.  What we used to do a long time
ago is write everything, and then fsync it all, and then call it good.
  But that produced horrible I/O storms.  So what we do now is do the
writes over a period of time, with sleeps in between, and then fsync
it all at the end, hoping that the kernel will write some of it before
the fsyncs arrive so that we don't get a huge I/O spike.
And that sorta works, and it's definitely better than doing it all at
full speed, but it's pretty imprecise.  If the kernel doesn't write
enough of the data out in advance, then there's still a huge I/O storm
when we do the fsyncs and everything grinds to a halt.  If it writes
out more data than needed in advance, it increases the total number of
physical writes because we get less write-combining, and that hurts
performance, too.


I think there's a pretty important bit that Robert didn't mention: we have a 
specific *time* target for when we want all the fsync's to complete. People 
that have problems here tend to tune checkpoints to complete every 5-15 
minutes, and they want the write traffic for the checkpoint spread out over 90% 
of that time interval. To put it another way, fsync's should be done when 90% 
of the time to the next checkpoint hits, but preferably not a lot before then.


Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.

It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.


Except that still won't throttle writes, right? That's the big issue here: our 
users often can't tolerate big spikes in IO latency. They want user requests to 
always happen within a specific amount of time.

So while delaying writes potentially reduces the total amount of data you're 
writing, users that run into problems here ultimately care more about ensuring 
that their foreground IO completes in a timely fashion.


Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.


So is lsf...@linux-foundation.org the best way to accomplish that?

Also, along the lines of collaboration, it would also be awesome to see kernel hackers at 
PGCon (http://pgcon.org) for further discussion of this stuff. That is the conference 
that has more Postgres internal developers than any other. There's a variety of different 
ways collaboration could happen there, so it's probably best to start a separate 
discussion with those from the linux community who'd be interested in attending. PGCon 
also directly follows BSDCan (http://bsdcan.org) at the same venue... so we could 
potentially kill two OS birds with one stone, so to speak... :) If there's enough 
interest we could potentially do a "mini Postgres/OS conference" in-between 
BSDCan and the formal PGCon. There's also potential for the Postgres community to sponsor 
attendance for kernel hackers if money is a factor.

Like I said... best to start a separate thread if there's significant interest 
on meeting at PGCon. :)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/14/14, 4:21 AM, Mel Gorman wrote:

There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.


That's really not the case in Postgres. There's essentially 3 main areas for IO 
requests to come from:

- Individual "backends". These are processes forked off of our startup process 
(postmaster) for the purpose of serving user connections. This is always "foreground" IO 
and should be avoided as much as possible (but is still a large percentage).
- autovacuum. This is a set of "clean-up" processes, meant to be low impact, 
background only. Similar to garbage collection is GC languages.
- bgwriter. This process is meant to greatly reduce the need for user backends 
to write data out.

Generally speaking, read requests are most likely to come from user backends. 
autovacuum can issue them too, but it's got a throttling mechanism so generally 
shouldn't be that much of the workload.

Ideally most write traffic would come from bgwriter (and autovacuum, though again we 
don't care too much about it). In reality though, that's going to depend very highly on a 
user's actual workload. To start, backends normally must write all write-ahead-log 
traffic before they finalize (COMMIT) a transaction for the user. COMMIT is sort of 
similar in idea to fsync... "When this returns I guarantee I've permanently stored 
your data."

The amount of WAL data generated for a transaction will vary enormously, even 
as a percentage of raw page data written. In some cases a very small (10s-100s 
of bytes) amount of WAL data will cover 1 or more base data pages (8k by 
default, up to 64k). But to protect against torn page writes, by default we 
write a complete copy of a data page to WAL the first time the page is dirtied 
after a checkpoint. So the opposite scenario is we actually write slightly MORE 
data to WAL than we do to the data pages.

What makes WAL even trickier is that bgwritter tries to write WAL data out 
before backends need to. In a system with a fairly low transaction rate that 
can work... but with a higher rate most WAL data will be written by a backend 
trying to issue a COMMIT. Note however that COMMIT needs to write ALL WAL data 
up to a given point, so one backend that only needs to write 100 bytes can 
easily end up flushing (and fsync'ing) megabytes of data written by some other 
backend.

Further complicating things is temporary storage, either in the form of user 
defined temporary tables, or temporary storage needed by the database itself. 
It's hard to characterize these workloads other than to say that typically 
reading and writing to them will want to move a relatively large amount of data 
at once.

BTW, because Postgres doesn't have terribly sophisticated memory management, 
it's very common to create temporary file data that will never, ever, ever 
actually NEED to hit disk. Where I work being able to tell the kernel to avoid 
flushing those files unless the kernel thinks it's got better things to do with 
that memory would be EXTREMELY valuable, because it's all temp data anyway: if 
the database or server crashes it's just going to get throw away. It might be a 
good idea for the Postgres to look at simply putting this data into plain 
memory now and relying on the OS to swap it as needed. That'd be more 
problematic for temp tables, but in that case mmap might work very well, 
because that data is currently never shared by other processes, though if we 
start doing parallel query execution that will change.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner

On Tue, Jan 14, 2014 at 03:03:39PM -0800, Kevin Grittner wrote:
> Dave Chinner  write:
> 
> > Essentially, changing dirty_background_bytes, dirty_bytes and
> > dirty_expire_centiseconds to be much smaller should make the
> > kernel start writeback much sooner and so you shouldn't have to
> > limit the amount of buffers the application has to prevent major
> > fsync triggered stalls...
> 
> Is there any "rule of thumb" about where to start with these?

There's no absolute rule here, but the threshold for background
writeback needs to consider the amount of dirty data being
generated, the rate at which it can be retired and the checkpoint
period the application is configured with. i.e. it needs to be slow
enough to not cause serious read IO perturbations, but still fast
enough that it avoids peaks at synchronisation points. And most
importantly, it needs to be fast enought that it can complete
writeback of all the dirty data in a checkpoint before the next
checkpoint is triggered.

In general, I find that threshold to be somewhere around 2-5s worth
of data writeback - enough to keep a good amount of write combining
and the IO pipeline full as work is done, but no more.

e.g. if your workload results in writeback rates of 500MB/s, then
I'd be setting the dirty limit somewhere around 1-2GB as an initial
guess. It's basically a simple trade off buffering space for
writeback latency. Some applications perform well with increased
buffering space (e.g. 10-20s of writeback) while others perform
better with extremely low writeback latency (e.g. 0.5-1s). 

>   For
> example, should a database server maybe have dirty_background_bytes
> set to 75% of the non-volatile write cache present on the
> controller, in an attempt to make sure that there is always some
> "slack" space for writes?

I don't think the hardware cache size matters as it's easy to fill
them very quickly and so after a couple of seconds the controller
will fall back to disk speed anyway. IMO, what matters is that the
threshold is large enough to adequately buffer writes to smooth
peaks and troughs in the pipeline.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 9:22 PM, Jim Nasby  wrote:
> On 1/14/14, 11:30 AM, Jeff Janes wrote:
>>
>> I think the "reclaim this page if you need memory but leave it resident if
>> there is no memory pressure" hint would be more useful for temporary working
>> files than for what was being discussed above (shared buffers).  When I do
>> work that needs large temporary files, I often see physical write IO spike
>> but physical read IO does not.  I interpret that to mean that the temporary
>> data is being written to disk to satisfy either dirty_expire_centisecs or
>> dirty_*bytes, but the data remains in the FS cache and so disk reads are not
>> needed to satisfy it.  So a hint that says "this file will never be fsynced
>> so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it
>> again relatively soon (but not after a reboot), but will do so mostly
>> sequentially, so please don't evict this without need, but if you do need to
>> then it is a good candidate" would be good.
>
>
> I also frequently see this, and it has an even larger impact if pgsql_tmp is
> on the same filesystem as WAL. Which *theoretically* shouldn't matter with a
> BBU controller, except that when the kernel suddenly decides your
> *temporary* data needs to hit the media you're screwed.
>
> Though, it also occurs to me... perhaps it would be better for us to simply
> map temp objects to memory and let the kernel swap them out if needed...


Oum... bad idea.

Swap logic has very poor taste for I/O patterns.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On 1/14/14, 11:30 AM, Jeff Janes wrote:

I think the "reclaim this page if you need memory but leave it resident if there is no memory
pressure" hint would be more useful for temporary working files than for what was being
discussed above (shared buffers). When I do work that needs large temporary files, I often see
physical write IO spike but physical read IO does not. I interpret that to mean that the temporary
data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the
data remains in the FS cache and so disk reads are not needed to satisfy it. So a hint that says
"this file will never be fsynced so please ignore dirty_*bytes and dirty_expire_centisecs. I
will need it again relatively soon (but not after a reboot), but will do so mostly sequentially, so
please don't evict this without need, but if you do need to then it is a good candidate" would
be good.

I also frequently see this, and it has an even larger impact if pgsql_tmp is on
the same filesystem as WAL. Which *theoretically* shouldn't matter with a BBU
controller, except that when the kernel suddenly decides your *temporary* data
needs to hit the media you're screwed.

Though, it also occurs to me... perhaps it would be better for us to simply map
temp objects to memory and let the kernel swap them out if needed...
--
Jim C. Nasby, Data Architect j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
>  wrote:
> >> Doesn't sound exactly like what I had in mind.  What I was suggesting
> >> is an analogue of read() that, if it reads full pages of data to a
> >> page-aligned address, shares the data with the buffer cache until it's
> >> first written instead of actually copying the data.
> >
> > The only way to make this happen is mmap the file to the buffer and use
> > MADV_WILLNEED.
> >
> >>   The pages are
> >> write-protected so that an attempt to write the address range causes a
> >> page fault.  In response to such a fault, the pages become anonymous
> >> memory and the buffer cache no longer holds a reference to the page.
> >
> > OK, so here I thought of another madvise() call to switch the region to
> > anonymous memory.  A page fault works too, of course, it's just that one
> > per page in the mapping will be expensive.
> 
> I don't think either of these ideas works for us.  We start by
> creating a chunk of shared memory that all processes (we do not use
> threads) will have mapped at a common address, and we read() and
> write() into that chunk.

Yes, that's what I was thinking: it's a cache.  About how many files
comprise this cache?  Are you thinking it's too difficult for every
process to map the files?

> > Do you care about handling aliases ... what happens if someone else
> > reads from the file, or will that never occur?  The reason for asking is
> > that it's much easier if someone else mmapping the file gets your
> > anonymous memory than we create an alias in the page cache.
> 
> All reads and writes go through the buffer pool stored in shared
> memory, but any of the processes that have that shared memory region
> mapped could be responsible for any individual I/O request.

That seems to be possible with the abstraction.  The initial mapping
gets the file backed pages: you can do madvise to read them (using
readahead), flush them (using wontneed) and flip them to anonymous
(using something TBD).  Since it's a shared mapping API based on the
file, any of the mapping processes can do any operation.  Future mappers
of the file get the mix of real and anon memory, so it's truly shared.

Given that you want to use this as a shared cache, it seems that the API
to flip back from anon to file mapped is wontneed.  That would also
trigger writeback of any dirty pages in the previously anon region ...
which you could force with msync.  As far as I can see, this is
identical to read/write on a shared region with the exception that you
don't need to copy in and out of the page cache.

>From our point of view, the implementation is nice because the pages
effectively never leave the page cache.  We just use an extra per page
flag (which I'll get shot for suggesting) to alter the writeout path
(which is where the complexity which may kill the implementation is).

James

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner

On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:
> Robert Haas  wrote:
> > Jan Kara  wrote:
> >
> >> Just to get some idea about the sizes - how large are the
> >> checkpoints we are talking about that cause IO stalls?
> >
> > Big.
> 
> To quantify that, in a production setting we were seeing pauses of
> up to two minutes with shared_buffers set to 8GB and default dirty
   ^
> page settings for Linux, on a machine with 256GB RAM and 512MB
  ^
There's your problem.

By default, background writeback doesn't start until 10% of memory
is dirtied, and on your machine that's 25GB of RAM. That's way to
high for your workload.

It appears to me that we are seeing large memory machines much more
commonly in data centers - a couple of years ago 256GB RAM was only
seen in supercomputers. Hence machines of this size are moving from
"tweaking settings for supercomputers is OK" class to "tweaking
settings for enterprise servers is not OK"

Perhaps what we need to do is deprecate dirty_ratio and
dirty_background_ratio as the default values as move to the byte
based values as the defaults and cap them appropriately.  e.g.
10/20% of RAM for small machines down to a couple of GB for large
machines

> non-volatile cache on the RAID controller.  To eliminate stalls we
> had to drop shared_buffers to 2GB (to limit how many dirty pages
> could build up out-of-sight from the OS), spread checkpoints to 90%
> of allowed time (almost no gap between finishing one checkpoint and
> starting the next) and crank up the background writer so that no
> dirty page sat unwritten in PostgreSQL shared_buffers for more than
> 4 seconds. Less aggressive pushing to the OS resulted in the
> avalanche of writes I previously described, with the corresponding
> I/O stalls.  We approached that incrementally, and that's the point
> where stalls stopped occurring.  We did not adjust the OS
> thresholds for writing dirty pages, although I know of others who
> have had to do so.

Essentially, changing dirty_background_bytes, dirty_bytes and
dirty_expire_centiseconds to be much smaller should make the kernel
start writeback much sooner and so you shouldn't have to limit the
amount of buffers the application has to prevent major fsync
triggered stalls...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner

On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman  wrote:
> >> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> >> setting zone_reclaim_mode; is there some other problem besides that?
> >
> > Really?
> >
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> 
> By "set" I mean "set to zero".  We've seen multiple of instances of
> people complaining about large amounts of system memory going unused
> because this setting defaulted to 1.
> 
> >> The other thing that comes to mind is the kernel's caching behavior.
> >> We've talked a lot over the years about the difficulties of getting
> >> the kernel to write data out when we want it to and to not write data
> >> out when we don't want it to.
> >
> > Is sync_file_range() broke?
> 
> I don't know.  I think a few of us have played with it and not been
> able to achieve a clear win.

Before you go back down the sync_file_range path, keep in mind that
it is not a guaranteed data integrity operation: it does not force
device cache flushes like fsync/fdatasync(). Hence it does not
guarantee that the metadata that points at the data written nor the
volatile caches in the storage path has been flushed...

IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...

> Whether the problem is with the system
> call or the programmer is harder to determine.  I think the problem is
> in part that it's not exactly clear when we should call it.  So
> suppose we want to do a checkpoint.  What we used to do a long time
> ago is write everything, and then fsync it all, and then call it good.
>  But that produced horrible I/O storms.  So what we do now is do the
> writes over a period of time, with sleeps in between, and then fsync
> it all at the end, hoping that the kernel will write some of it before
> the fsyncs arrive so that we don't get a huge I/O spike.
> And that sorta works, and it's definitely better than doing it all at
> full speed, but it's pretty imprecise.  If the kernel doesn't write
> enough of the data out in advance, then there's still a huge I/O storm
> when we do the fsyncs and everything grinds to a halt.  If it writes
> out more data than needed in advance, it increases the total number of
> physical writes because we get less write-combining, and that hurts
> performance, too. 

Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.

It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.

Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
>  wrote:
> > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
> >> > In terms of avoiding double-buffering, here's my thought after reading
> >> > what's been written so far.  Suppose we read a page into our buffer
> >> > pool.  Until the page is clean, it would be ideal for the mapping to
> >> > be shared between the buffer cache and our pool, sort of like
> >> > copy-on-write.  That way, if we decide to evict the page, it will
> >> > still be in the OS cache if we end up needing it again (remember, the
> >> > OS cache is typically much larger than our buffer pool).  But if the
> >> > page is dirtied, then instead of copying it, just have the buffer pool
> >> > forget about it, because at that point we know we're going to write
> >> > the page back out anyway before evicting it.
> >> >
> >> > This would be pretty similar to copy-on-write, except without the
> >> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> >>
> >> But... either copy-on-write or forget-on-write needs a page fault, and
> >> thus a page mapping.
> >>
> >> Is a page fault more expensive than copying 8k?
> >>
> >> (I really don't know).
> >
> > A page fault can be expensive, yes ... but perhaps you don't need one.
> >
> > What you want is a range of memory that's read from a file but treated
> > as anonymous for writeout (i.e. written to swap if we need to reclaim
> > it). Then at some time later, you want to designate it as written back
> > to the file instead so you control the writeout order.  I'm not sure we
> > can do this: the separation between file backed and anonymous pages is
> > pretty deeply ingrained into the OS, but if it were possible, is that
> > what you want?
> 
> Doesn't sound exactly like what I had in mind.  What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.

The only way to make this happen is mmap the file to the buffer and use
MADV_WILLNEED.

>   The pages are
> write-protected so that an attempt to write the address range causes a
> page fault.  In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.

OK, so here I thought of another madvise() call to switch the region to
anonymous memory.  A page fault works too, of course, it's just that one
per page in the mapping will be expensive.

Do you care about handling aliases ... what happens if someone else
reads from the file, or will that never occur?  The reason for asking is
that it's much easier if someone else mmapping the file gets your
anonymous memory than we create an alias in the page cache.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Dave Chinner  write:

> Essentially, changing dirty_background_bytes, dirty_bytes and
> dirty_expire_centiseconds to be much smaller should make the
> kernel start writeback much sooner and so you shouldn't have to
> limit the amount of buffers the application has to prevent major
> fsync triggered stalls...

Is there any "rule of thumb" about where to start with these?  For
example, should a database server maybe have dirty_background_bytes
set to 75% of the non-volatile write cache present on the
controller, in an attempt to make sure that there is always some
"slack" space for writes?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

James Bottomley  wrote:

>> We start by creating a chunk of shared memory that all processes
>> (we do not use threads) will have mapped at a common address,
>> and we read() and write() into that chunk.
>
> Yes, that's what I was thinking: it's a cache.  About how many
> files comprise this cache?  Are you thinking it's too difficult
> for every process to map the files?

It occurred to me that I don't remember seeing any indication of
how many processes we're talking about.  There is once process per
database connection, plus some administrative processes, like the
checkpoint process and the background writer.  At the low end,
about 10 processes would be connected to the shared memory.  The
highest I've personally seen is about 3000; I don't know how far
above that people might try to push it.  I always recommend a
connection pool to limit the number of database connections to
something near ((2 * core count) + effective spindle count), since
that's where I typically see best performance; but people don't
always do that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

I wrote:

> to avoid write gluts it must often be limited to 1GB to 1GB.

That should have been "1GB to 2GB."


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

James Bottomley  wrote:

> About how many files comprise this cache?  Are you thinking it's
> too difficult for every process to map the files?

The shared_buffers area can be mapping anywhere from about 200
files to millions of files, representing a total space of about 6MB
on the low end to over 100TB on the high end.  For many workloads
performance falls off above a shared_buffers size of about 8GB,
although for data warehousing environments larger sizes sometimes
work out and to avoid write gluts it must often be limited to 1GB
to 1GB.

Data access is in fixed-sized pages, normally of 8KB each.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas

On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
 wrote:
>> Doesn't sound exactly like what I had in mind.  What I was suggesting
>> is an analogue of read() that, if it reads full pages of data to a
>> page-aligned address, shares the data with the buffer cache until it's
>> first written instead of actually copying the data.
>
> The only way to make this happen is mmap the file to the buffer and use
> MADV_WILLNEED.
>
>>   The pages are
>> write-protected so that an attempt to write the address range causes a
>> page fault.  In response to such a fault, the pages become anonymous
>> memory and the buffer cache no longer holds a reference to the page.
>
> OK, so here I thought of another madvise() call to switch the region to
> anonymous memory.  A page fault works too, of course, it's just that one
> per page in the mapping will be expensive.

I don't think either of these ideas works for us.  We start by
creating a chunk of shared memory that all processes (we do not use
threads) will have mapped at a common address, and we read() and
write() into that chunk.

> Do you care about handling aliases ... what happens if someone else
> reads from the file, or will that never occur?  The reason for asking is
> that it's much easier if someone else mmapping the file gets your
> anonymous memory than we create an alias in the page cache.

All reads and writes go through the buffer pool stored in shared
memory, but any of the processes that have that shared memory region
mapped could be responsible for any individual I/O request.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

Robert Haas  wrote:
> Jan Kara  wrote:
>
>> Just to get some idea about the sizes - how large are the
>> checkpoints we are talking about that cause IO stalls?
>
> Big.

To quantify that, in a production setting we were seeing pauses of
up to two minutes with shared_buffers set to 8GB and default dirty
page settings for Linux, on a machine with 256GB RAM and 512MB
non-volatile cache on the RAID controller.  To eliminate stalls we
had to drop shared_buffers to 2GB (to limit how many dirty pages
could build up out-of-sight from the OS), spread checkpoints to 90%
of allowed time (almost no gap between finishing one checkpoint and
starting the next) and crank up the background writer so that no
dirty page sat unwritten in PostgreSQL shared_buffers for more than
4 seconds. Less aggressive pushing to the OS resulted in the
avalanche of writes I previously described, with the corresponding
I/O stalls.  We approached that incrementally, and that's the point
where stalls stopped occurring.  We did not adjust the OS
thresholds for writing dirty pages, although I know of others who
have had to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> I dunno what a typical checkpoint size is but I don't think you'll be
> exaggerating much if you imagine that everything that could possibly
> be dirty is.

This is not uncommon for us, at least:

checkpoint complete: wrote 425844 buffers (20.3%); 0 transaction log
file(s) added, 0 removed, 249 recycled; write=175.535 s, sync=17.428 s,
total=196.357 s; sync files=1011, longest=2.675 s, average=0.017 s

That's a checkpoint writing out 20% of 16GB, or over 3GB, and that's
just from one of the four postmasters running- we get this kind of
checkpointing happening on all of them.  All told, it's easy for us to
want to write over 12GB during a single checkpoint period on this box.
(checkpoint_timeout is 5m, checkpoint_target is 0.9).

Thankfully, the box has 256G of RAM and so the shared buffers only use
up 25% of the RAM in the box. :)

I'm sure others could post larger numbers.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara

On Tue 14-01-14 10:04:16, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara  wrote:
> > I thought that instead of injecting pages into pagecache for aging as you
> > describe in 3), you would mark pages as volatile (i.e. for reclaim by
> > kernel) through vrange() syscall. Next time you need the page, you check
> > whether the kernel reclaimed the page or not. If yes, you reload it from
> > disk, if not, you unmark it and use it.
> >
> > Now the aging of pages marked as volatile as it is currently implemented
> > needn't be perfect for your needs but you still have time to influence what
> > gets implemented... Actually developers of the vrange() syscall were
> > specifically looking for some ideas what to base aging on. Currently I
> > think it is first marked - first evicted.
> 
> This is an interesting idea but it stinks of impracticality.
> Essentially when the last buffer pin on a page is dropped we'd have to
> mark it as discardable, and then the next person wanting to pin it
> would have to check whether it's still there.  But the system call
> overhead of calling vrange() every time the last pin on a page was
> dropped would probably hose us.
> 
> *thinks*
> 
> Well, I guess it could be done lazily: make periodic sweeps through
> shared_buffers, looking for pages that haven't been touched in a
> while, and vrange() them.  That's quite a bit of new mechanism, but in
> theory it could work out to a win.  vrange() would have to scale well
> to millions of separate ranges, though.  Will it?
  It is intented to be rather lightweight so I believe milions should be
OK. But I didn't try :).

> And a lot depends on whether the kernel makes the right decision about
> whether to chunk data from our vrange() vs. any other page it could have
> reclaimed.
  I think the intent is to reclaim pages in the following order:
used once pages -> volatile pages -> active pages, swapping

Honza
-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas

On Tue, Jan 14, 2014 at 1:37 PM, Jan Kara  wrote:
> Just to get some idea about the sizes - how large are the checkpoints we
> are talking about that cause IO stalls?

Big.  Potentially, we might have dirtied all of shared_buffers and
then started evicting pages from there to the OS buffer pool and
dirties as much memory as the OS will allow, and then the OS might
have started writeback and filled up all the downstream caches between
the OS and the disk.  And then just then the checkpoint hits.

I dunno what a typical checkpoint size is but I don't think you'll be
exaggerating much if you imagine that everything that could possibly
be dirty is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara

On Tue 14-01-14 06:42:43, Kevin Grittner wrote:
> First off, I want to give a +1 on everything in the recent posts
> from Heikki and Hannu.
> 
> Jan Kara  wrote:
> 
> > Now the aging of pages marked as volatile as it is currently
> > implemented needn't be perfect for your needs but you still have
> > time to influence what gets implemented... Actually developers of
> > the vrange() syscall were specifically looking for some ideas
> > what to base aging on. Currently I think it is first marked -
> > first evicted.
> 
> The "first marked - first evicted" seems like what we would want. 
> The ability to "unmark" and have the page no longer be considered
> preferred for eviction would be very nice.  That seems to me like
> it would cover the multiple layers of buffering *clean* pages very
> nicely (although I know nothing more about vrange() than what has
> been said on this thread, so I could be missing something).
  Here:
http://www.spinics.net/lists/linux-mm/msg67328.html
  is an email which introduces the syscall. As you say, it might be a
reasonable fit for your problems with double caching of clean pages.

> The other side of that is related avoiding multiple writes of the
> same page as much as possible, while avoid write gluts.  The issue
> here is that PostgreSQL tries to hang on to dirty pages for as long
> as possible before "writing" them to the OS cache, while the OS
> tries to avoid writing them to storage for as long as possible
> until they reach a (configurable) threshold or are fsync'd.  The
> problem is that a under various conditions PostgreSQL may need to
> write and fsync a lot of dirty pages it has accumulated in a short
> time.  That has an "avalanche" effect, creating a "write glut"
> which can stall all I/O for a period of many seconds up to a few
> minutes.  If the OS was aware of the dirty pages pending write in
> the application, and counted those for purposes of calculating when
> and how much to write, the glut could be avoided.  Currently,
> people configure the PostgreSQL background writer to be very
> aggressive, configure a small PostgreSQL shared_buffers setting,
> and/or set the OS thresholds low enough to minimize the problem;
> but all of these mitigation strategies have their own costs.
> 
> A new hint that the application has dirtied a page could be used by
> the OS to improve things this way:  When the OS is notified that a
> page is dirty, it takes action depending on whether the page is
> considered dirty by the OS.  If it is not dirty, the page is
> immediately discarded from the OS cache.  It is known that the
> application has a modified version of the page that it intends to
> write, so the version in the OS cache has no value.  We don't want
> this page forcing eviction of vrange()-flagged pages.  If it is
> dirty, any write ordering to storage by the OS based on when the
> page was written to the OS would be pushed back as far as possible
> without crossing any write barriers, in hopes that the writes could
> be combined.  Either way, this page is counted toward dirty pages
> for purposes of calculating how much to write from the OS to
> storage, and the later write of the page doesn't redundantly add to
> this number.
  The evict if clean part is easy. That could be easily a new fadvise()
option - btw. note that POSIX_FADV_DONTNEED has quite close meaning. Only
that it also starts writeback on a dirty page if backing device isn't
congested. Which is somewhat contrary to what you want to achieve. But I'm
not sure the eviction would be a clear win since filesystem then has to
re-create the mapping from logical file block to disk block (it is cached
in the page) and that potentially needs to go to disk to fetch the mapping
data.

I have a hard time thinking how we would implement pushing back writeback
of a particular page (or better set of pages). When we need to write pages
because we are nearing dirty_bytes limit, we likely want to write these
marked pages anyway to make as many pages freeable as possible. So the only
thing we could do is to ignore these pages during periodic writeback and
I'm not sure that would make a big difference.

Just to get some idea about the sizes - how large are the checkpoints we
are talking about that cause IO stalls?

Honza

-- 
Jan Kara 
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Stephen Frost

* Claudio Freire (klaussfre...@gmail.com) wrote:
> On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas  wrote:
> > I don't know either.  I wasn't thinking so much that it would save CPU
> > time as that it would save memory.  Consider a system with 32GB of
> > RAM.  If you set shared_buffers=8GB, then in the worst case you've got
> > 25% of your RAM wasted storing pages that already exist, dirtied, in
> > shared_buffers.  It's easy to imagine scenarios in which that results
> > in lots of extra I/O, so that the CPU required to do the accounting
> > comes to seem cheap by comparison.
> 
> Not necessarily, you pay the CPU cost on each page fault (ie: first
> write to the buffer at least), each time the page checks into the
> shared buffers level.

I'm really not sure that this is a real issue for us, but if it is,
perhaps having this as an option for each read() call would work..?
That is to say, rather than have this be an open() flag or similar, it's
normal read() with a flags field where we could decide when we want
pages to be write-protected this way and when we don't (perhaps because
we know we're about to write to them).

I'm not 100% sure it'd be easy for us to manage that flag perfectly, but
it's our issue and it'd be on us to deal with as the kernel can't
possibly guess our intentions.

There were concerns brought up earlier that such a zero-copy-read option
wouldn't be performant though and I'm curious to hear more about those
and if we could avoid the performance issues by manging the
zero-copy-read case ourselves as Robert suggests.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 2:39 PM, Robert Haas  wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
>  wrote:
>> On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
>>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
>>> > In terms of avoiding double-buffering, here's my thought after reading
>>> > what's been written so far.  Suppose we read a page into our buffer
>>> > pool.  Until the page is clean, it would be ideal for the mapping to
>>> > be shared between the buffer cache and our pool, sort of like
>>> > copy-on-write.  That way, if we decide to evict the page, it will
>>> > still be in the OS cache if we end up needing it again (remember, the
>>> > OS cache is typically much larger than our buffer pool).  But if the
>>> > page is dirtied, then instead of copying it, just have the buffer pool
>>> > forget about it, because at that point we know we're going to write
>>> > the page back out anyway before evicting it.
>>> >
>>> > This would be pretty similar to copy-on-write, except without the
>>> > copying.  It would just be forget-from-the-buffer-pool-on-write.
>>>
>>> But... either copy-on-write or forget-on-write needs a page fault, and
>>> thus a page mapping.
>>>
>>> Is a page fault more expensive than copying 8k?
>>>
>>> (I really don't know).
>>
>> A page fault can be expensive, yes ... but perhaps you don't need one.
>>
>> What you want is a range of memory that's read from a file but treated
>> as anonymous for writeout (i.e. written to swap if we need to reclaim
>> it). Then at some time later, you want to designate it as written back
>> to the file instead so you control the writeout order.  I'm not sure we
>> can do this: the separation between file backed and anonymous pages is
>> pretty deeply ingrained into the OS, but if it were possible, is that
>> what you want?
>
> Doesn't sound exactly like what I had in mind.  What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.  The pages are
> write-protected so that an attempt to write the address range causes a
> page fault.  In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.


Yes, that's basically zero-copy reads.

It could be done. The kernel can remap the page to the physical page
holding the shared buffer and mark it read-only, then expire the
buffer and transfer ownership of the page if any page fault happens.

But that incurrs:
 - Page faults, lots
 - Hugely bloated mappings, unless KSM is somehow leveraged for this

And there's a nice bingo. Had forgotten about KSM. KSM could help lots.

I could try to see of madvising shared_buffers as mergeable helps. But
this should be an automatic case of KSM - ie, when reading into a
page-aligned address, the kernel should summarily apply KSM-style
sharing without hinting. The current madvise interface puts the burden
of figuring out what duplicates what on the kernel, but postgres
already knows.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas  wrote:
> On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire  
> wrote:
>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
>>> In terms of avoiding double-buffering, here's my thought after reading
>>> what's been written so far.  Suppose we read a page into our buffer
>>> pool.  Until the page is clean, it would be ideal for the mapping to
>>> be shared between the buffer cache and our pool, sort of like
>>> copy-on-write.  That way, if we decide to evict the page, it will
>>> still be in the OS cache if we end up needing it again (remember, the
>>> OS cache is typically much larger than our buffer pool).  But if the
>>> page is dirtied, then instead of copying it, just have the buffer pool
>>> forget about it, because at that point we know we're going to write
>>> the page back out anyway before evicting it.
>>>
>>> This would be pretty similar to copy-on-write, except without the
>>> copying.  It would just be forget-from-the-buffer-pool-on-write.
>>
>> But... either copy-on-write or forget-on-write needs a page fault, and
>> thus a page mapping.
>>
>> Is a page fault more expensive than copying 8k?
>
> I don't know either.  I wasn't thinking so much that it would save CPU
> time as that it would save memory.  Consider a system with 32GB of
> RAM.  If you set shared_buffers=8GB, then in the worst case you've got
> 25% of your RAM wasted storing pages that already exist, dirtied, in
> shared_buffers.  It's easy to imagine scenarios in which that results
> in lots of extra I/O, so that the CPU required to do the accounting
> comes to seem cheap by comparison.

Not necessarily, you pay the CPU cost on each page fault (ie: first
write to the buffer at least), each time the page checks into the
shared buffers level.

It's like a tiered cache.

When promoting is expensive, one must be careful. The traffic to/from
the L0 (shared buffers) and L1 (page cache) will be considerable, even
if everything fits in RAM.

I guess it's the constant battle between inclusive and exclusive caches.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jeff Janes

On Mon, Jan 13, 2014 at 6:44 PM, Dave Chinner  wrote:

> On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
> > On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> > > a file into a user provided buffer, thus obtaining a page cache entry
> > > and a copy in their userspace buffer, then insert the page of the user
> > > buffer back into the page cache as the page cache page ... that's
> right,
> > > isn't it postgress people?
> >
> > Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
> > isn't needed anymore when reading. And we'd normally write if the page
> > is dirty.
>
> So why, exactly, do you even need the kernel page cache here?

We don't need it, but it would be nice.

> You've
> got direct access to the copy of data read into userspace, and you
> want direct control of when and how the data in that buffer is
> written and reclaimed. Why push that data buffer back into the
> kernel and then have to add all sorts of kernel interfaces to
> control the page you already have control of?
>

Say 25% of the RAM is dedicated to the database's shared buffers, and 75%
is left to the kernel's judgement.  It sure would be nice if the kernel had
the capability of using some of that 75% for database pages, if it thought
that that was the best use for it.

Which is what we do get now, at the expense of quite a lot of double
buffering (by which I mean, a lot of pages are both in the kernel cache and
the database cache--not just transiently during the copy process, but for
quite a while).  If we had the ability to re-inject clean pages into the
kernel's cache, we would get that benefit without the double buffering.

Cheers,

Jeff

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
> James Bottomley  writes:
> > The current mechanism for coherency between a userspace cache and the
> > in-kernel page cache is mmap ... that's the only way you get the same
> > page in both currently.
> 
> Right.
> 
> > glibc used to have an implementation of read/write in terms of mmap, so
> > it should be possible to insert it into your current implementation
> > without a major rewrite.  The problem I think this brings you is
> > uncontrolled writeback: you don't want dirty pages to go to disk until
> > you issue a write()
> 
> Exactly.
> 
> > I think we could fix this with another madvise():
> > something like MADV_WILLUPDATE telling the page cache we expect to alter
> > the pages again, so don't be aggressive about cleaning them.
> 
> "Don't be aggressive" isn't good enough.  The prohibition on early write
> has to be absolute, because writing a dirty page before we've done
> whatever else we need to do results in a corrupt database.  It has to
> be treated like a write barrier.
> 
> > The problem is we can't give you absolute control of when pages are
> > written back because that interface can be used to DoS the system: once
> > we get too many dirty uncleanable pages, we'll thrash looking for memory
> > and the system will livelock.
> 
> Understood, but that makes this direction a dead end.  We can't use
> it if the kernel might decide to write anyway.

No, I'm sorry, that's never going to be possible.  No user space
application has all the facts.  If we give you an interface to force
unconditional holding of dirty pages in core you'll livelock the system
eventually because you made a wrong decision to hold too many dirty
pages.   I don't understand why this has to be absolute: if you advise
us to hold the pages dirty and we do up until it becomes a choice to
hold on to the pages or to thrash the system into a livelock, why would
you ever choose the latter?  And if, as I'm assuming, you never would,
why don't you want the kernel to make that choice for you?

James



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
> >
> > In terms of avoiding double-buffering, here's my thought after reading
> > what's been written so far.  Suppose we read a page into our buffer
> > pool.  Until the page is clean, it would be ideal for the mapping to
> > be shared between the buffer cache and our pool, sort of like
> > copy-on-write.  That way, if we decide to evict the page, it will
> > still be in the OS cache if we end up needing it again (remember, the
> > OS cache is typically much larger than our buffer pool).  But if the
> > page is dirtied, then instead of copying it, just have the buffer pool
> > forget about it, because at that point we know we're going to write
> > the page back out anyway before evicting it.
> >
> > This would be pretty similar to copy-on-write, except without the
> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> 
> 
> But... either copy-on-write or forget-on-write needs a page fault, and
> thus a page mapping.
> 
> Is a page fault more expensive than copying 8k?
> 
> (I really don't know).

A page fault can be expensive, yes ... but perhaps you don't need one. 

What you want is a range of memory that's read from a file but treated
as anonymous for writeout (i.e. written to swap if we need to reclaim
it).  Then at some time later, you want to designate it as written back
to the file instead so you control the writeout order.  I'm not sure we
can do this: the separation between file backed and anonymous pages is
pretty deeply ingrained into the OS, but if it were possible, is that
what you want?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance