Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Greg Stark
On Wed, Jan 15, 2014 at 7:53 AM, Mel Gorman mgor...@suse.de wrote:
 The second is have
 pages that are strictly kept dirty until the application syncs them. An
 unbounded number of these pages would blow up but maybe bounds could be
 placed on it. There are no solid conclusions on that part yet.

I think the interface would be subtler than that. The current
architecture is that if an individual process decides to evict one of
these pages it knows how much of the log needs to be flushed and
fsynced before it can do so and proceeds to do it itself. This is a
situation to be avoided as much as possible but there are workloads
where it's inevitable (the typical example is mass data loads).

There would need to be some kind of similar interface where there
would be some way for the kernel to force log pages to be written to
allow it to advance the epoch. Either some way to wake Postgres up and
inform it of the urgency or better yet Postgres would just always be
writing out pages without fsyncing them and instead be issuing some
other syscall to mark the points in the log file that correspond to
the write barriers that would unpin these buffers.

Ted T'so was concerned this would all be a massive layering violation
and I have to admit that's a huge risk. It would take some clever API
engineering to come with a clean set of primitives to express the kind
of ordering guarantees we need without being too tied to Postgres's
specific implementation. The reason I think it's more interesting
though is that Postgres's journalling and checkpointing architecture
is pretty bog-standard CS stuff and there are hundreds or thousands of
pieces of software out there that do pretty much the same work and
trying to do it efficiently with fsync or O_DIRECT is like working
with both hands tied to your feet.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Jeff Janes
On Thursday, January 16, 2014, Dave Chinner
da...@fromorbit.comjavascript:_e({}, 'cvml',
'da...@fromorbit.com');
wrote:

 On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
  On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com
 wrote:
 
   On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
On 1/15/14, 12:00 AM, Claudio Freire wrote:
My completely unproven theory is that swapping is overwhelmed by
near-misses. Ie: a process touches a page, and before it's
actually swapped in, another process touches it too, blocking on
the other process' read. But the second process doesn't account
for that page when evaluating predictive models (ie: read-ahead),
so the next I/O by process 2 is unexpected to the kernel. Then
the same with 1. Etc... In essence, swap, by a fluke of its
implementation, fails utterly to predict the I/O pattern, and
results in far sub-optimal reads.

Explicit I/O is free from that effect, all read calls are
accountable, and that makes a difference.

Maybe, if the kernel could be fixed in that respect, you could
consider mmap'd files as a suitable form of temporary storage.
But that would depend on the success and availability of such a
fix/patch.
   
Another option is to consider some of the more radical ideas in
this thread, but only for temporary data. Our write sequencing and
other needs are far less stringent for this stuff.  -- Jim C.
  
   I suspect that a lot of the temporary data issues can be solved by
   using tmpfs for temporary files
  
 
  Temp files can collectively reach hundreds of gigs.

 So unless you have terabytes of RAM you're going to have to write
 them back to disk.


If they turn out to be hundreds of gigs, then yes they have to hit disk (at
least on my hardware).  But if they are 10 gig, then maybe not (depending
on whether other people decide to do similar things at the same time I'm
going to be doing it--something which is often hard to predict).   But now
for every action I take, I have to decide, is this going to take 10 gig, or
14 gig, and how absolutely certain am I?  And is someone else going to try
something similar at the same time?  What a hassle.  It would be so much
nicer to say This is accessed sequentially, and will never be fsynced.
 Maybe it will fit entirely in memory, maybe it won't, either way, you know
what to do.

If I start out writing to tmpfs, I can't very easily change my mind 94% of
the way through and decide to go somewhere else.  But the kernel,
effectively, can.


 But there's something here that I'm not getting - you're talking
 about a data set that you want ot keep cache resident that is at
 least an order of magnitude larger than the cyclic 5-15 minute WAL
 dataset that ongoing operations need to manage to avoid IO storms.


Those are mostly orthogonal issues.  The permanent files need to be fsynced
on a regular basis, and might have gigabytes of data dirtied at random from
within terabytes of underlying storage.  We better start writing that
pretty quickly or when do issue the fsyncs, the world will fall apart.

The temporary files will never need to be fsynced, and can be written out
sequentially if they do ever need to be written out.  Better to delay this
as much as feasible.


Where do these temporary files fit into this picture, how fast do
 they grow and why are do they need to be so large in comparison to
 the ongoing modifications being made to the database?


The permanent files tend to be things like Jane Doe just bought a pair of
green shoes from Hendrick Green Shoes Limited--record that, charge her
credit card, and schedule delivery.  The temp files are more like It is
the end of the year, how many shoes have been purchased in each color from
each manufacturer for each quarter over the last 6 years?   So the temp
files quickly manipulate data that has slowly been accumulating over very
long times, while the permanent files represent the processes of those
accumulations.

If you are Amazon, of course, you have thousands of people who can keep two
sets of records, one organized for fast update and one slightly delayed
copy reorganized for fast analysis, and also do partial analysis on an
ongoing basis and roll them up in ways that can be incrementally updated.
 If you are not Amazon, it would be nice if one system did a better job of
doing both things with the trade off between the two being dynamic and
automatic.

Cheers,

Jeff


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
(This thread is now massive and I have not read it all yet. If anything
I say has already been discussed then whoops)

On Tue, Jan 14, 2014 at 12:09:46PM +1100, Dave Chinner wrote:
 On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:
  On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   For one, postgres doesn't use mmap for files (and can't without major
   new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
   horrible consequences for performance/scalability - very quickly you
   contend on locks in the kernel.
  
  I may as well dump this in this thread. We've discussed this in person
  a few times, including at least once with Ted T'so when he visited
  Dublin last year.
  
  The fundamental conflict is that the kernel understands better the
  hardware and other software using the same resources, Postgres
  understands better its own access patterns. We need to either add
  interfaces so Postgres can teach the kernel what it needs about its
  access patterns or add interfaces so Postgres can find out what it
  needs to know about the hardware context.
 
 In my experience applications don't need to know anything about the
 underlying storage hardware - all they need is for someone to 
 tell them the optimal IO size and alignment to use.
 

That potentially misses details on efficient IO patterns. They might
submit many small requests for example each of which are of the optimal
IO size and alignment but which is sub-optimal overall. While these
still go through the underlying block layers there is no guarantee that
the requests will arrive in time for efficient merging to occur.

  The more ambitious and interesting direction is to let Postgres tell
  the kernel what it needs to know to manage everything. To do that we
  would need the ability to control when pages are flushed out. This is
  absolutely necessary to maintain consistency. Postgres would need to
  be able to mark pages as unflushable until some point in time in the
  future when the journal is flushed. We discussed various ways that
  interface could work but it would be tricky to keep it low enough
  overhead to be workable.
 
 IMO, the concept of allowing userspace to pin dirty page cache
 pages in memory is just asking for trouble. Apart from the obvious
 memory reclaim and OOM issues, some filesystems won't be able to
 move their journals forward until the data is flushed. i.e. ordered
 mode data writeback on ext3 will have all sorts of deadlock issues
 that result from pinning pages and then issuing fsync() on another
 file which will block waiting for the pinned pages to be flushed.
 

That applies if the dirty pages are forced to be kept dirty. You call
this pinned but pinned has special meaning so I would suggest calling it
something like dirty-sticky pages. It could be the case that such hinting
will have the pages excluded from dirty background writing but can still
be cleaned if dirty limits are hit or if fsync is called. It's a hint,
not a forced guarantee.

It's still a hand grenade because if this is tracked on a per-page basis
because of what happens if the process crashes? Those pages stay dirty
potentially forever. An alternative would be to track this on a per-inode
instead of per-page basis. The hint would only exist where there is an
open fd for that inode.  Treat it as a privileged call with a sysctl
controlling how many dirty-sticky pages can exist in the system with the
information presented during OOM kills and maybe it starts becoming a bit
more manageable. Dirty-sticky pages are not guaranteed to stay dirty
until userspace action, the kernel just stays away until there are no
other sensible options.

 Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
 If fsync() blocks because there are pinned pages, and there's no
 other thread to unpin them, then that code just deadlocked.

Indeed. Forcing pages with this hint to stay dirty until user space decides
to clean them is eventually going to blow up.

 SNIP
 H.  What happens if the process crashes after pinning the dirty
 pages? How do we even know what process pinned the dirty pages so
 we can clean up after it? What happens if the same page is pinned by
 multiple processes? What happens on truncate/hole punch if the
 partial pages in the range that need to be zeroed and written are
 pinned? What happens if we do direct IO to a range with pinned,
 unflushable pages in the page cache?
 

Proposal: A process with an open fd can hint that pages managed by this
inode will have dirty-sticky pages. Pages will be ignored by
dirty background writing unless there is an fsync call or
dirty page limits are hit. The hint is cleared when no process
has the file open.

If the process crashes, the hint is cleared and the pages get cleaned as
normal

Multiple processes do not matter as such as all of them will have the file
open. There is a problem if the processes 

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Wed, Jan 15, 2014 at 09:44:21AM +, Mel Gorman wrote:
  SNIP
  H.  What happens if the process crashes after pinning the dirty
  pages? How do we even know what process pinned the dirty pages so
  we can clean up after it? What happens if the same page is pinned by
  multiple processes? What happens on truncate/hole punch if the
  partial pages in the range that need to be zeroed and written are
  pinned? What happens if we do direct IO to a range with pinned,
  unflushable pages in the page cache?
  
 
 Proposal: A process with an open fd can hint that pages managed by this
   inode will have dirty-sticky pages. Pages will be ignored by
   dirty background writing unless there is an fsync call or
   dirty page limits are hit. The hint is cleared when no process
   has the file open.
 

I'm still processing the rest of the thread and putting it into my head
but it's at least clear that this proposal would only cover the case where
large temporarily files are created that do not necessarily need to be
persisted. They still have cases where the ordering of writes matter and
the kernel cleaning pages behind their back would lead to corruption.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman mgor...@suse.de wrote:
 That applies if the dirty pages are forced to be kept dirty. You call
 this pinned but pinned has special meaning so I would suggest calling it
 something like dirty-sticky pages. It could be the case that such hinting
 will have the pages excluded from dirty background writing but can still
 be cleaned if dirty limits are hit or if fsync is called. It's a hint,
 not a forced guarantee.

 It's still a hand grenade because if this is tracked on a per-page basis
 because of what happens if the process crashes? Those pages stay dirty
 potentially forever. An alternative would be to track this on a per-inode
 instead of per-page basis. The hint would only exist where there is an
 open fd for that inode.  Treat it as a privileged call with a sysctl
 controlling how many dirty-sticky pages can exist in the system with the
 information presented during OOM kills and maybe it starts becoming a bit
 more manageable. Dirty-sticky pages are not guaranteed to stay dirty
 until userspace action, the kernel just stays away until there are no
 other sensible options.

I think this discussion is vividly illustrating why this whole line of
inquiry is a pile of fail.  If all the processes that have the file
open crash, the changes have to be *thrown away* not written to disk
whenever the kernel likes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman mgor...@suse.de wrote:
  That applies if the dirty pages are forced to be kept dirty. You call
  this pinned but pinned has special meaning so I would suggest calling it
  something like dirty-sticky pages. It could be the case that such hinting
  will have the pages excluded from dirty background writing but can still
  be cleaned if dirty limits are hit or if fsync is called. It's a hint,
  not a forced guarantee.
 
  It's still a hand grenade because if this is tracked on a per-page basis
  because of what happens if the process crashes? Those pages stay dirty
  potentially forever. An alternative would be to track this on a per-inode
  instead of per-page basis. The hint would only exist where there is an
  open fd for that inode.  Treat it as a privileged call with a sysctl
  controlling how many dirty-sticky pages can exist in the system with the
  information presented during OOM kills and maybe it starts becoming a bit
  more manageable. Dirty-sticky pages are not guaranteed to stay dirty
  until userspace action, the kernel just stays away until there are no
  other sensible options.
 
 I think this discussion is vividly illustrating why this whole line of
 inquiry is a pile of fail.  If all the processes that have the file
 open crash, the changes have to be *thrown away* not written to disk
 whenever the kernel likes.
 

I realise that now and sorry for the noise.

I later read the parts of the thread that covered the strict ordering
requirements and in a summary mail I split the requirements in two. In one,
there are dirty sticky pages that the kernel should not writeback unless
it has no other option or fsync is called. This may be suitable for large
temporary files that Postgres does not necessarily want to hit the platter
but also does not have strict ordering requirements for. The second is have
pages that are strictly kept dirty until the application syncs them. An
unbounded number of these pages would blow up but maybe bounds could be
placed on it. There are no solid conclusions on that part yet.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Robert Haas
On Wed, Jan 15, 2014 at 10:53 AM, Mel Gorman mgor...@suse.de wrote:
 I realise that now and sorry for the noise.

 I later read the parts of the thread that covered the strict ordering
 requirements and in a summary mail I split the requirements in two. In one,
 there are dirty sticky pages that the kernel should not writeback unless
 it has no other option or fsync is called. This may be suitable for large
 temporary files that Postgres does not necessarily want to hit the platter
 but also does not have strict ordering requirements for. The second is have
 pages that are strictly kept dirty until the application syncs them. An
 unbounded number of these pages would blow up but maybe bounds could be
 placed on it. There are no solid conclusions on that part yet.

I think that the bottom line is that we're not likely to make massive
changes to the way that we do block caching now.  Even if some other
scheme could work much better on Linux (and so far I'm unconvinced
that any of the proposals made here would in fact work much better),
we aim to be portable to Windows as well as other UNIX-like systems
(BSD, Solaris, etc.).  So using completely Linux-specific technology
in an overhaul of our block cache seems to me to have no future.

On the other hand, giving the kernel hints about what we're doing that
would enable it to be smarter seems to me to have a lot of potential.
Ideas so far mentioned include:

- Hint that we're going to do an fsync on file X at time Y, so that
the kernel can schedule the write-out to complete right around that
time.
- Hint that a block is a good candidate for reclaim without actually
purging it if there's no memory pressure.
- Hint that a page we modify in our cache should be dropped from the
kernel cache.
- Hint that a page we write back to the operating system should be
dropped from the kernel cache after the I/O completes.

It's hard to say which of these ideas will work well without testing
them, and the overhead of the extra system calls might be significant
in some of those cases, but it seems a promising line of inquiry.

And the idea of being able to do an 8kB atomic write with OS support
so that we don't have to save full page images in our write-ahead log
to cover the torn page scenario seems very intriguing indeed.  If
that worked well, it would be a *big* deal for us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I think that the bottom line is that we're not likely to make massive
 changes to the way that we do block caching now.  Even if some other
 scheme could work much better on Linux (and so far I'm unconvinced
 that any of the proposals made here would in fact work much better),
 we aim to be portable to Windows as well as other UNIX-like systems
 (BSD, Solaris, etc.).  So using completely Linux-specific technology
 in an overhaul of our block cache seems to me to have no future.

Unfortunately, I have to agree with this.  Even if there were a way to
merge our internal buffers with the kernel's, it would surely be far
too invasive to coexist with buffer management that'd still work on
more traditional platforms.

But we could add hint calls, or modify the I/O calls we use, and that
ought to be a reasonably localized change.

 And the idea of being able to do an 8kB atomic write with OS support
 so that we don't have to save full page images in our write-ahead log
 to cover the torn page scenario seems very intriguing indeed.  If
 that worked well, it would be a *big* deal for us.

+1.  That would be a significant win, and trivial to implement, since
we already have a way to switch off full-page images for people who
trust their filesystems to do atomic writes.  It's just that safe
use of that switch isn't widely possible ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Claudio Freire
On Wed, Jan 15, 2014 at 2:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think that the bottom line is that we're not likely to make massive
 changes to the way that we do block caching now.  Even if some other
 scheme could work much better on Linux (and so far I'm unconvinced
 that any of the proposals made here would in fact work much better),
 we aim to be portable to Windows as well as other UNIX-like systems
 (BSD, Solaris, etc.).  So using completely Linux-specific technology
 in an overhaul of our block cache seems to me to have no future.

 Unfortunately, I have to agree with this.  Even if there were a way to
 merge our internal buffers with the kernel's, it would surely be far
 too invasive to coexist with buffer management that'd still work on
 more traditional platforms.

 But we could add hint calls, or modify the I/O calls we use, and that
 ought to be a reasonably localized change.


That's what's pretty nice with the zero-copy read idea. It's almost
transparent. You read to a page-aligned address, and it works. The
only code change would be enabling zero-copy reads, which I'm not sure
will be low-overhead enough to leave enabled by default.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/13/2014 11:22 PM, James Bottomley wrote:

 The less exciting, more conservative option would be to add kernel
 interfaces to teach Postgres about things like raid geometries. Then
 Postgres could use directio and decide to do prefetching based on the
 raid geometry, how much available i/o bandwidth and iops is available,
 etc.

 Reimplementing i/o schedulers and all the rest of the work that the
 kernel provides inside Postgres just seems like something outside our
 competency and that none of us is really excited about doing.
 This would also be a well trodden path ... I believe that some large
 database company introduced Direct IO for roughly this purpose.

The file system at that time were much worse than they are now,
so said large companies had no choice but to write their own.

As linux file handling has been much better for most of active
development of postgresql we have been able to avoid
it and still have reasonable performance.

What was been pointed out above are some (allegedly
desktop/mobile influenced) decisions which broke good
performance.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:
 On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:
  For one, postgres doesn't use mmap for files (and can't without major
  new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
  horrible consequences for performance/scalability - very quickly you
  contend on locks in the kernel.
 
 I may as well dump this in this thread. We've discussed this in person
 a few times, including at least once with Ted T'so when he visited
 Dublin last year.
 
 The fundamental conflict is that the kernel understands better the
 hardware and other software using the same resources, Postgres
 understands better its own access patterns. We need to either add
 interfaces so Postgres can teach the kernel what it needs about its
 access patterns or add interfaces so Postgres can find out what it
 needs to know about the hardware context.

In my experience applications don't need to know anything about the
underlying storage hardware - all they need is for someone to 
tell them the optimal IO size and alignment to use.

 The more ambitious and interesting direction is to let Postgres tell
 the kernel what it needs to know to manage everything. To do that we
 would need the ability to control when pages are flushed out. This is
 absolutely necessary to maintain consistency. Postgres would need to
 be able to mark pages as unflushable until some point in time in the
 future when the journal is flushed. We discussed various ways that
 interface could work but it would be tricky to keep it low enough
 overhead to be workable.

IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees

H.  What happens if the process crashes after pinning the dirty
pages?  How do we even know what process pinned the dirty pages so
we can clean up after it? What happens if the same page is pinned by
multiple processes? What happens on truncate/hole punch if the
partial pages in the range that need to be zeroed and written are
pinned? What happens if we do direct IO to a range with pinned,
unflushable pages in the page cache?

These are all complex corner cases that are introduced by allowing
applications to pin dirty pages in memory. I've only spent a few
minutes coming up with these, and I'm sure there's more of them.
As such, I just don't see that allowing userspace to pin dirty
page cache pages in memory being a workable solution.

 The less exciting, more conservative option would be to add kernel
 interfaces to teach Postgres about things like raid geometries. Then

/sys/block/dev/queue/* contains all the information that is
exposed to filesystems to optimise layout for storage geometry.
Some filesystems can already expose the relevant parts of this
information to userspace, others don't.

What I think we really need to provide is a generic interface
similar to the old XFS_IOC_DIOINFO ioctl that can be used to
expose IO characteristics to applications in a simple, easy to
gather manner.  Something like:

struct io_info {
u64 minimum_io_size;/* sector size */
u64 maximum_io_size;/* currently 2GB */
u64 optimal_io_size;/* stripe unit/width */
u64 optimal_io_alignment;   /* stripe unit/width */
u64 mem_alignment;  /* PAGE_SIZE */
u32 queue_depth;/* max IO concurrency */
};

 Postgres could use directio and decide to do prefetching based on the
 raid geometry,

Underlying storage array raid geometry and optimal IO sizes for the
filesystem may be different. Hence you want what the filesystem
considers optimal, not what the underlying storage is configured
with. Indeed, a filesystem might be able to supply per-file IO
characteristics depending on where it is located in the filesystem
(think tiered storage)

 how much available i/o bandwidth and iops is available,
 etc.

The kernel 

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Gavin Flower

On 14/01/14 14:09, Dave Chinner wrote:

On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:

On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:

[...]

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees


[...]

What if Postgres could tell the kernel how strongly that it wanted to 
hold on to the pages?


Say a byte (this is arbitrary, it could be a single hint bit which meant 
please, Please, PLEASE don't flush, if that is okay with you Mr 
Kernel...), so strength would be S = (unsigned byte value)/256, so 0 = 
S  1.


S = 0  flush now.
0  S  1  flush if the 'need' is greater than the S
S = 1  never flush (note a value of 1 cannot occur, as max S = 255/256)

Postgres could use low non-zero S values if it thinks that pages /might/ 
still be useful later, and very high values when it is /more certain/.  
I am sure Postgres must sometimes know when some pages are more 
important to held onto than others, hence my feeling that S should be 
more than one bit.


The kernel might simply flush pages starting at ones with low values of 
S working upwards until it has freed enough memory to resolve its memory 
pressure.  So an explicit numerical value of 'need' (as implied above) 
is not required.  Also any practical implementation would not use 'S' as 
a float/double, but use integer values for 'S'  'need' - assuming that 
'need' did have to be an actual value, which I suspect would not be 
reequired.


This way the kernel is free to flush all such pages, when sufficient 
need arises - yet usually, when there is sufficient memory, the pages 
will be held unflushed.



Cheers,
Gavin


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 2:03 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 Say a byte (this is arbitrary, it could be a single hint bit which meant
 please, Please, PLEASE don't flush, if that is okay with you Mr
 Kernel...), so strength would be S = (unsigned byte value)/256, so 0 = S 
 1.

 S = 0  flush now.
 0  S  1  flush if the 'need' is greater than the S
 S = 1  never flush (note a value of 1 cannot occur, as max S = 255/256)

 Postgres could use low non-zero S values if it thinks that pages might still
 be useful later, and very high values when it is more certain.  I am sure
 Postgres must sometimes know when some pages are more important to held onto
 than others, hence my feeling that S should be more than one bit.

 The kernel might simply flush pages starting at ones with low values of S
 working upwards until it has freed enough memory to resolve its memory
 pressure.  So an explicit numerical value of 'need' (as implied above) is
 not required.  Also any practical implementation would not use 'S' as a
 float/double, but use integer values for 'S'  'need' - assuming that 'need'
 did have to be an actual value, which I suspect would not be reequired.

 This way the kernel is free to flush all such pages, when sufficient need
 arises - yet usually, when there is sufficient memory, the pages will be
 held unflushed.

Well, this just begs the question of what value PG ought to pass as
the parameter.

I think the alternate don't-need semantics (we don't think we need
this but please don't throw it away arbitrarily if there's no memory
pressure) would be a big win.  I don't think we know enough in user
space to be more precise than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Wed, Jan 15, 2014 at 08:03:28AM +1300, Gavin Flower wrote:
 On 14/01/14 14:09, Dave Chinner wrote:
 On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:
 On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 [...]
 The more ambitious and interesting direction is to let Postgres tell
 the kernel what it needs to know to manage everything. To do that we
 would need the ability to control when pages are flushed out. This is
 absolutely necessary to maintain consistency. Postgres would need to
 be able to mark pages as unflushable until some point in time in the
 future when the journal is flushed. We discussed various ways that
 interface could work but it would be tricky to keep it low enough
 overhead to be workable.
 IMO, the concept of allowing userspace to pin dirty page cache
 pages in memory is just asking for trouble. Apart from the obvious
 memory reclaim and OOM issues, some filesystems won't be able to
 move their journals forward until the data is flushed. i.e. ordered
 mode data writeback on ext3 will have all sorts of deadlock issues
 that result from pinning pages and then issuing fsync() on another
 file which will block waiting for the pinned pages to be flushed.
 
 Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
 If fsync() blocks because there are pinned pages, and there's no
 other thread to unpin them, then that code just deadlocked. If
 fsync() doesn't block and skips the pinned pages, then we haven't
 done an fsync() at all, and so violated the expectation that users
 have that after fsync() returns their data is safe on disk. And if
 we return an error to fsync(), then what the hell does the user do
 if it is some other application we don't know about that has pinned
 the pages? And if the kernel unpins them after some time, then we
 just violated the application's consistency guarantees
 
 [...]
 
 What if Postgres could tell the kernel how strongly that it wanted
 to hold on to the pages?

That doesn't get rid of the problems, it just makes it harder to
diagnose them when they occur. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jonathan Corbet
On Wed, 15 Jan 2014 09:23:52 +1100
Dave Chinner da...@fromorbit.com wrote:

 It appears to me that we are seeing large memory machines much more
 commonly in data centers - a couple of years ago 256GB RAM was only
 seen in supercomputers. Hence machines of this size are moving from
 tweaking settings for supercomputers is OK class to tweaking
 settings for enterprise servers is not OK
 
 Perhaps what we need to do is deprecate dirty_ratio and
 dirty_background_ratio as the default values as move to the byte
 based values as the defaults and cap them appropriately.  e.g.
 10/20% of RAM for small machines down to a couple of GB for large
 machines

I had thought that was already in the works...it hits people on far
smaller systems than those described here.

http://lwn.net/Articles/572911/

I wonder if anybody ever finished this work out for 3.14?

jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 05:38:10PM -0700, Jonathan Corbet wrote:
 On Wed, 15 Jan 2014 09:23:52 +1100
 Dave Chinner da...@fromorbit.com wrote:
 
  It appears to me that we are seeing large memory machines much more
  commonly in data centers - a couple of years ago 256GB RAM was only
  seen in supercomputers. Hence machines of this size are moving from
  tweaking settings for supercomputers is OK class to tweaking
  settings for enterprise servers is not OK
  
  Perhaps what we need to do is deprecate dirty_ratio and
  dirty_background_ratio as the default values as move to the byte
  based values as the defaults and cap them appropriately.  e.g.
  10/20% of RAM for small machines down to a couple of GB for large
  machines
 
 I had thought that was already in the works...it hits people on far
 smaller systems than those described here.
 
   http://lwn.net/Articles/572911/
 
 I wonder if anybody ever finished this work out for 3.14?

Not that I know of.  This patch was suggested as the solution to the
slow/fast drive issue that started the whole thread:

http://thread.gmane.org/gmane.linux.kernel/1584789/focus=1587059

but I don't see it in a current kernel. It might be in Andrew's tree
for 3.14, but I haven't checked.

However, most of the discussion in that thread about dirty limits
was a side show that rehashed old territory. Rate limiting and
throttling in a generic, scalable manner is a complex problem. We've
got some of the infrastructure we need to solve the problem, but
there was no conclusion as to the correct way to connect all the
dots.  Perhaps it's another topic for the LSFMM conf?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Greg Stark
On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:
 For one, postgres doesn't use mmap for files (and can't without major
 new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
 horrible consequences for performance/scalability - very quickly you
 contend on locks in the kernel.


I may as well dump this in this thread. We've discussed this in person
a few times, including at least once with Ted T'so when he visited
Dublin last year.

The fundamental conflict is that the kernel understands better the
hardware and other software using the same resources, Postgres
understands better its own access patterns. We need to either add
interfaces so Postgres can teach the kernel what it needs about its
access patterns or add interfaces so Postgres can find out what it
needs to know about the hardware context.

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

The less exciting, more conservative option would be to add kernel
interfaces to teach Postgres about things like raid geometries. Then
Postgres could use directio and decide to do prefetching based on the
raid geometry, how much available i/o bandwidth and iops is available,
etc.

Reimplementing i/o schedulers and all the rest of the work that the
kernel provides inside Postgres just seems like something outside our
competency and that none of us is really excited about doing.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus
Everyone,

I am looking for one or more hackers to go to Collab with me to discuss
this.  If you think that might be you, please let me know and I'll look
for funding for your travel.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley
On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote:
 On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:
  For one, postgres doesn't use mmap for files (and can't without major
  new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
  horrible consequences for performance/scalability - very quickly you
  contend on locks in the kernel.
 
 
 I may as well dump this in this thread. We've discussed this in person
 a few times, including at least once with Ted T'so when he visited
 Dublin last year.
 
 The fundamental conflict is that the kernel understands better the
 hardware and other software using the same resources, Postgres
 understands better its own access patterns. We need to either add
 interfaces so Postgres can teach the kernel what it needs about its
 access patterns or add interfaces so Postgres can find out what it
 needs to know about the hardware context.
 
 The more ambitious and interesting direction is to let Postgres tell
 the kernel what it needs to know to manage everything. To do that we
 would need the ability to control when pages are flushed out. This is
 absolutely necessary to maintain consistency. Postgres would need to
 be able to mark pages as unflushable until some point in time in the
 future when the journal is flushed. We discussed various ways that
 interface could work but it would be tricky to keep it low enough
 overhead to be workable.

So in this case, the question would be what additional information do we
need to exchange that's not covered by the existing interfaces.  Between
madvise and splice, we seem to have most of what you want; what's
missing?

 The less exciting, more conservative option would be to add kernel
 interfaces to teach Postgres about things like raid geometries. Then
 Postgres could use directio and decide to do prefetching based on the
 raid geometry, how much available i/o bandwidth and iops is available,
 etc.
 
 Reimplementing i/o schedulers and all the rest of the work that the
 kernel provides inside Postgres just seems like something outside our
 competency and that none of us is really excited about doing.

This would also be a well trodden path ... I believe that some large
database company introduced Direct IO for roughly this purpose.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers