Re: [HACKERS] WALWriteLock contention

2015-05-18 Thread Jeff Janes

 
  My goal there was to further improve group commit.  When running pgbench
  -j10 -c10, it was common to see fsyncs that alternated between flushing 1
  transaction, and 9 transactions. Because the first one to the gate would
 go
  through it and slam it on all the others, and it would take one fsync
 cycle
  for it reopen.

 Hmm, yeah.  I remember somewhat (Peter Geoghegan, I think) mentioning
 behavior like that before, but I had not made the connection to this
 issue at that time.  This blog post is pretty depressing:

 http://oldblog.antirez.com/post/fsync-different-thread-useless.html

 It suggests that an fsync in progress blocks out not only other
 fsyncs, but other writes to the same file, which for our purposes is
 just awful.  More Googling around reveals that this is apparently
 well-known to Linux kernel developers and that they don't seem excited
 about fixing it.  :-(


I think they already did.  I don't see the effect in ext4, even on a rather
old kernel like 2.6.32, using the code from the link above.



 crazy-ideaI wonder if we could write WAL to two different files in
 alternation, so that we could be writing to one file which fsync-ing
 the other./crazy-idea


I thought the most promising things, once there were timers and sleeps with
resolution much better than centisecond, was to record the time at which
each fsync finished, and then sleep until then + commit_delay.  That way
you don't do any harm to the sleeper, as the write head is not positioned
to process the fsync until then anyway, and give other workers the chance
to get their commit records in.

But then I kind of lost interest, because anyone who cares very much about
commit performance will probably get a nonvolatile write cache, and
anything done would be too hardware/platform dependent.

Of course a BBU isn't magic, the kernel still has to spend time scrubbing
the buffer pool and sending the dirty ones to the disk/controller when it
gets an fsync, even if the confirmation does come back quickly.  But it
still seems too hardware/platform dependent to find a general purpose
optimization.

Cheers,

Jeff


Re: [HACKERS] WALWriteLock contention

2015-05-18 Thread Amit Kapila
On Mon, May 18, 2015 at 1:53 AM, Robert Haas robertmh...@gmail.com wrote:

 On May 17, 2015, at 11:04 AM, Amit Kapila amit.kapil...@gmail.com wrote:

 On Sun, May 17, 2015 at 7:45 AM, Robert Haas robertmh...@gmail.com
wrote:
 
  crazy-ideaI wonder if we could write WAL to two different files in
  alternation, so that we could be writing to one file which fsync-ing
  the other./crazy-idea

 Won't the order of transactions replay during recovery can cause
 problems if we do alternation while writing.  I think this is one of
 the reasons WAL is written sequentially.  Another thing is that during
 recovery, currently whenever we encounter mismatch in stored CRC
 and actual record CRC, we call it end of recovery, but with writing
 to 2 files simultaneously we might need to rethink that rule.


 Well, yeah. That's why I said it was a crazy idea.


Another idea could be try to write as per disk sector size which I think
in most cases is 512 bytes (some latest disks do have larger size
sectors, so it should be configurable in some way).   I think with this
ideally we don't need CRC for each WAL record, as that data will be
either written or not written.  Even if we don't want to rely on the fact
that sector-sized writes are atomic, we can have a configurable CRC
per writeable-unit (which in this scheme would be 512 bytes).

It can have dual benefit.  First it can help us in minimizing repeated
writes problem and second is that by eliminating the need to have CRC
for each record it can reduce the WAL volume and CPU load.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WALWriteLock contention

2015-05-17 Thread Amit Kapila
On Sun, May 17, 2015 at 7:45 AM, Robert Haas robertmh...@gmail.com wrote:

 crazy-ideaI wonder if we could write WAL to two different files in
 alternation, so that we could be writing to one file which fsync-ing
 the other./crazy-idea


Won't the order of transactions replay during recovery can cause
problems if we do alternation while writing.  I think this is one of
the reasons WAL is written sequentially.  Another thing is that during
recovery, currently whenever we encounter mismatch in stored CRC
and actual record CRC, we call it end of recovery, but with writing
to 2 files simultaneously we might need to rethink that rule.

I think first point in your mail related to rewrite of 8K block each
time needs more thought and may be some experimentation to
check whether writing in lesser units based on OS page size or
sector size leads to any meaningful gains.  Another thing is that
if there is high write activity, then group commits should help in
reducing IO for repeated writes and in the tests we can try by changing
commit_delay to see if that can help (if the tests are already tuned
with respect to commit_delay, then ignore this point).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WALWriteLock contention

2015-05-17 Thread Robert Haas
On May 17, 2015, at 5:57 PM, Thomas Munro thomas.mu...@enterprisedb.com wrote:
 On Sun, May 17, 2015 at 2:15 PM, Robert Haas robertmh...@gmail.com wrote:
 http://oldblog.antirez.com/post/fsync-different-thread-useless.html
 
 It suggests that an fsync in progress blocks out not only other
 fsyncs, but other writes to the same file, which for our purposes is
 just awful.  More Googling around reveals that this is apparently
 well-known to Linux kernel developers and that they don't seem excited
 about fixing it.  :-(
 
 He doesn't say, but I wonder if that is really Linux, or if it is the
 ext2, 3 and maybe 4 filesystems specifically.  This blog post talks
 about the per-inode mutex that is held while writing with direct IO.

Good point. We should probably test ext4 and xfs on a newish kernel.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriteLock contention

2015-05-17 Thread Thomas Munro
On Sun, May 17, 2015 at 2:15 PM, Robert Haas robertmh...@gmail.com wrote:
 http://oldblog.antirez.com/post/fsync-different-thread-useless.html

 It suggests that an fsync in progress blocks out not only other
 fsyncs, but other writes to the same file, which for our purposes is
 just awful.  More Googling around reveals that this is apparently
 well-known to Linux kernel developers and that they don't seem excited
 about fixing it.  :-(

He doesn't say, but I wonder if that is really Linux, or if it is the
ext2, 3 and maybe 4 filesystems specifically.  This blog post talks
about the per-inode mutex that is held while writing with direct IO.
Maybe fsyncing buffered IO is similarly constrained in those
filesystems.

https://www.facebook.com/notes/mysql-at-facebook/xfs-ext-and-per-inode-mutexes/10150210901610933

 crazy-ideaI wonder if we could write WAL to two different files in
 alternation, so that we could be writing to one file which fsync-ing
 the other./crazy-idea

If that is an ext3-specific problem, using multiple files might not
help you anyway because ext3 famously fsyncs *all* files when you
asked for one file to be fsynced, as discussed in Greg Smith's
PostgreSQL 9.0 High Performance in chapter 4 (page 79).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriteLock contention

2015-05-17 Thread Robert Haas
On May 17, 2015, at 11:04 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Sun, May 17, 2015 at 7:45 AM, Robert Haas robertmh...@gmail.com wrote:
 
  crazy-ideaI wonder if we could write WAL to two different files in
  alternation, so that we could be writing to one file which fsync-ing
  the other./crazy-idea
 
 Won't the order of transactions replay during recovery can cause
 problems if we do alternation while writing.  I think this is one of
 the reasons WAL is written sequentially.  Another thing is that during
 recovery, currently whenever we encounter mismatch in stored CRC
 and actual record CRC, we call it end of recovery, but with writing
 to 2 files simultaneously we might need to rethink that rule.

Well, yeah. That's why I said it was a crazy idea.

 I think first point in your mail related to rewrite of 8K block each
 time needs more thought and may be some experimentation to
 check whether writing in lesser units based on OS page size or
 sector size leads to any meaningful gains.  Another thing is that
 if there is high write activity, then group commits should help in
 reducing IO for repeated writes and in the tests we can try by changing
 commit_delay to see if that can help (if the tests are already tuned
 with respect to commit_delay, then ignore this point).

I am under the impression that using commit_delay usefully is pretty hard but, 
of course, I could be wrong.

...Robert

Re: [HACKERS] WALWriteLock contention

2015-05-16 Thread Robert Haas
On Fri, May 15, 2015 at 9:15 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I implemented this 2-3 years ago, just dropping the WALWriteLock immediately
 before the fsync and then picking it up again immediately after, and was
 surprised that I saw absolutely no improvement.  Of course it surely depends
 on the IO stack, but from what I saw it seemed that once a fsync landed in
 the kernel, any future ones on that file were blocked rather than
 consolidated.

Interesting.

 Alas I can't find the patch anymore, I can make more of an
 effort to dig it up if anyone cares.  Although it would probably be easier
 to reimplement it than it would be to find it and rebase it.

 I vaguely recall thinking that the post-fsync bookkeeping could be moved to
 a spin lock, with a fair bit of work, so that the WALWriteLock would not
 need to be picked up again, but the whole avenue didn't seem promising
 enough for me to worry about that part in detail.

 My goal there was to further improve group commit.  When running pgbench
 -j10 -c10, it was common to see fsyncs that alternated between flushing 1
 transaction, and 9 transactions. Because the first one to the gate would go
 through it and slam it on all the others, and it would take one fsync cycle
 for it reopen.

Hmm, yeah.  I remember somewhat (Peter Geoghegan, I think) mentioning
behavior like that before, but I had not made the connection to this
issue at that time.  This blog post is pretty depressing:

http://oldblog.antirez.com/post/fsync-different-thread-useless.html

It suggests that an fsync in progress blocks out not only other
fsyncs, but other writes to the same file, which for our purposes is
just awful.  More Googling around reveals that this is apparently
well-known to Linux kernel developers and that they don't seem excited
about fixing it.  :-(

crazy-ideaI wonder if we could write WAL to two different files in
alternation, so that we could be writing to one file which fsync-ing
the other./crazy-idea

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WALWriteLock contention

2015-05-15 Thread Robert Haas
WALWriteLock contention is measurable on some workloads.  In studying
the problem briefly, a couple of questions emerged:

1. Doesn't it suck to rewrite an entire 8kB block every time, instead
of only the new bytes (and maybe a few bytes following that to spoil
any old data that might be there)?  I mean, the OS page size is 4kB on
Linux.  If we generate 2kB of WAL and then flush, we're likely to
dirty two OS blocks instead of one.  The OS isn't going to be smart
enough to notice that one of those pages didn't really change, so
we're potentially generating some extra disk I/O.  My colleague Jan
Wieck has some (inconclusive) benchmark results that suggest this
might actually be hurting us significantly.  More research is needed,
but I thought I'd ask if we've ever considered NOT doing that, or if
we should consider it.

2. I don't really understand why WALWriteLock is set up to prohibit
two backends from flushing WAL at the same time.  That seems
unnecessary.  Suppose we've got two backends that flush WAL one after
the other.  Assume (as is not unlikely) that the second one's flush
position is ahead of the first one's flush position.  So the first one
grabs WALWriteLock and does the flush, and then the second one grabs
WALWriteLock for its turn to flush and has to wait for an entire spin
of the platter to complete before its fsync() can be satisfied.  If
we'd just let the second guy issue his fsync() right away, odds are
good that the disk would have satisfied both in a single rotation.
Now it's possible that the second request would've arrived too late
for that to work out, but AFAICS in that case we're no worse off than
we are now.  And if it does work out we're better off.  The only
reasons I can see why we might NOT want to do this are (1) if we're
trying to compensate for some OS-level bugginess, which is a
horrifying thought, or (2) if we think the extra system calls will
cost more than we save by piggybacking the flushes more efficiently.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriteLock contention

2015-05-15 Thread Jeff Janes
On Fri, May 15, 2015 at 9:06 AM, Robert Haas robertmh...@gmail.com wrote:

 WALWriteLock contention is measurable on some workloads.  In studying
 the problem briefly, a couple of questions emerged:

 ...



 2. I don't really understand why WALWriteLock is set up to prohibit
 two backends from flushing WAL at the same time.  That seems
 unnecessary.  Suppose we've got two backends that flush WAL one after
 the other.  Assume (as is not unlikely) that the second one's flush
 position is ahead of the first one's flush position.  So the first one
 grabs WALWriteLock and does the flush, and then the second one grabs
 WALWriteLock for its turn to flush and has to wait for an entire spin
 of the platter to complete before its fsync() can be satisfied.  If
 we'd just let the second guy issue his fsync() right away, odds are
 good that the disk would have satisfied both in a single rotation.
 Now it's possible that the second request would've arrived too late
 for that to work out, but AFAICS in that case we're no worse off than
 we are now.  And if it does work out we're better off.  The only
 reasons I can see why we might NOT want to do this are (1) if we're
 trying to compensate for some OS-level bugginess, which is a
 horrifying thought, or (2) if we think the extra system calls will
 cost more than we save by piggybacking the flushes more efficiently.


I implemented this 2-3 years ago, just dropping the WALWriteLock
immediately before the fsync and then picking it up again immediately
after, and was surprised that I saw absolutely no improvement.  Of course
it surely depends on the IO stack, but from what I saw it seemed that once
a fsync landed in the kernel, any future ones on that file were blocked
rather than consolidated.  Alas I can't find the patch anymore, I can make
more of an effort to dig it up if anyone cares.  Although it would probably
be easier to reimplement it than it would be to find it and rebase it.

I vaguely recall thinking that the post-fsync bookkeeping could be moved to
a spin lock, with a fair bit of work, so that the WALWriteLock would not
need to be picked up again, but the whole avenue didn't seem promising
enough for me to worry about that part in detail.

My goal there was to further improve group commit.  When running pgbench
-j10 -c10, it was common to see fsyncs that alternated between flushing 1
transaction, and 9 transactions. Because the first one to the gate would go
through it and slam it on all the others, and it would take one fsync cycle
for it reopen.

Cheers,

Jeff


Re: [HACKERS] WALWriteLock contention

2015-05-15 Thread Joshua D. Drake


On 05/15/2015 09:06 AM, Robert Haas wrote:


2. I don't really understand why WALWriteLock is set up to prohibit
two backends from flushing WAL at the same time.  That seems
unnecessary.  Suppose we've got two backends that flush WAL one after
the other.  Assume (as is not unlikely) that the second one's flush
position is ahead of the first one's flush position.  So the first one
grabs WALWriteLock and does the flush, and then the second one grabs
WALWriteLock for its turn to flush and has to wait for an entire spin
of the platter to complete before its fsync() can be satisfied.  If
we'd just let the second guy issue his fsync() right away, odds are
good that the disk would have satisfied both in a single rotation.
Now it's possible that the second request would've arrived too late
for that to work out, but AFAICS in that case we're no worse off than
we are now.  And if it does work out we're better off.  The only


This is a bit out of my depth but it sounds similar to (from a user 
perspective) the difference between synchronous and asynchronous commit. 
If we are willing to trust that PostgreSQL/OS will do what it is 
supposed to do, then it seems logical that what you describe above would 
definitely be a net win.


JD
--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriteLock contention

2015-05-15 Thread Robert Haas
On Fri, May 15, 2015 at 1:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 WALWriteLock contention is measurable on some workloads.  In studying
 the problem briefly, a couple of questions emerged:

 1. Doesn't it suck to rewrite an entire 8kB block every time, instead
 of only the new bytes (and maybe a few bytes following that to spoil
 any old data that might be there)?

 It does, but it's not clear how to avoid torn-write conditions without
 that.

Can you elaborate?   I don't understand how repeatedly overwriting the
same bytes with themselves accomplishes anything at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WALWriteLock contention

2015-05-15 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 WALWriteLock contention is measurable on some workloads.  In studying
 the problem briefly, a couple of questions emerged:

 1. Doesn't it suck to rewrite an entire 8kB block every time, instead
 of only the new bytes (and maybe a few bytes following that to spoil
 any old data that might be there)?

It does, but it's not clear how to avoid torn-write conditions without
that.

 2. I don't really understand why WALWriteLock is set up to prohibit
 two backends from flushing WAL at the same time.  That seems
 unnecessary.

Hm, perhaps so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers