Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-05 Thread didier
hi

On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 It's imo quite clearly better to keep it allocated. For one after
 postmaster started the checkpointer successfully you don't need to be
 worried about later failures to allocate memory if you allocate it once
 (unless the checkpointer FATALs out which should be exceedingly rare -
 we're catching ERRORs). It's much much more likely to succeed
 initially. Secondly it's not like there's really that much time where no
 checkpointer isn't running.

 In principle you could do the sort with the full sized array and then
 compress it to a list of buffer IDs that need to be written out. This
 way most of the time you only need a small array and the large array
 is only needed for a fraction of a second.

 It's not the size of the array that's the problem; it's the size of
 the detonation when the allocation fails.

You can use a file backed memory array
Or because it's only a hint and
- keys are in buffers (BufferTag), right?
- transition is only from 'data I care to data I don't care' if a
buffer is concurrently evicted when sorting.

Use a pre allocate buffer index array an read keys from buffers when
sorting, without memory barrier, spinlocks, whatever.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-05 Thread Robert Haas
On Fri, Sep 5, 2014 at 4:17 AM, didier did...@gmail.com wrote:
 It's not the size of the array that's the problem; it's the size of
 the detonation when the allocation fails.

 You can use a file backed memory array
 Or because it's only a hint and
 - keys are in buffers (BufferTag), right?
 - transition is only from 'data I care to data I don't care' if a
 buffer is concurrently evicted when sorting.

 Use a pre allocate buffer index array an read keys from buffers when
 sorting, without memory barrier, spinlocks, whatever.

If by a file-backed memory array you mean mmap() something, that
doesn't do anything to address the possibility of failure.  mmap(),
like malloc(), can fail.

But I think we don't really need to have this argument.  Andres's
proposal to put this in the main shared memory segment sounds fine to
me.  If allocating that fails, you'll know to reduce shared_buffers
and try again.  If it succeeds, you should be safe after that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-05 Thread Claudio Freire
On Fri, Sep 5, 2014 at 3:09 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 4:17 AM, didier did...@gmail.com wrote:
 It's not the size of the array that's the problem; it's the size of
 the detonation when the allocation fails.

 You can use a file backed memory array
 Or because it's only a hint and
 - keys are in buffers (BufferTag), right?
 - transition is only from 'data I care to data I don't care' if a
 buffer is concurrently evicted when sorting.

 Use a pre allocate buffer index array an read keys from buffers when
 sorting, without memory barrier, spinlocks, whatever.

 If by a file-backed memory array you mean mmap() something, that
 doesn't do anything to address the possibility of failure.  mmap(),
 like malloc(), can fail.

It's worse. If your process needs to page from disk when touching a
byte in the mmap'd array, and the disk somehow fails to work, you get
a segmentation fault.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-04 Thread Ants Aasma
On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on.

 ...
 So, *very* clearly sorting is a benefit.

 pg_bench alone doesn't convince me on this.  The original thread found
 cases where it was a loss, IIRC; you will need to test many more than
 one scenario to prove the point.

The same objection came up last time I tried to push for sorted
checkpoints. I did not find any reference to where it caused a loss,
nor was I able to come up with a case where writing out in arbitrary
order would be better than writing out in file sequential order. In
fact if we ask for low latency this means that the OS must keep the
backlog small eliminating any chance of write combining writes that
arrive out of order.

I have a use case where the system continuously loads data into time
partitioned indexed tables, at every checkpoint all of the indexes of
the latest partition need to be written out. The only way I could get
the write out to happen with sequential I/O was to set
checkpoint_completion_target to zero and ensure OS cache allows for
enough dirty pages to absorb the whole checkpoint. The fsync that
followed did obviously nasty things to latency. Whereas sorted
checkpoints were able to do sequential I/O with checkpoint spreading
and low latency tuned OS virtual memory settings. I can create a
benchmark that shows this behavior if you need additional data points
to pgbench's OLTP workload to convince you that sorting checkpoint
writes is a good idea.

I did just come up with a case where plain sorting might cause an
issue. If the writes go to different I/O devices then naive sorting
will first use one device then the other, whereas arbitrary writing
will load balance between the devices. Assuming that separate
tablespaces are used for separate I/O devices, it should be enough to
just interleave writes of each tablespace, weighed by the amount of
writes per tablespace.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-04 Thread Ants Aasma
On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's imo quite clearly better to keep it allocated. For one after
 postmaster started the checkpointer successfully you don't need to be
 worried about later failures to allocate memory if you allocate it once
 (unless the checkpointer FATALs out which should be exceedingly rare -
 we're catching ERRORs). It's much much more likely to succeed
 initially. Secondly it's not like there's really that much time where no
 checkpointer isn't running.

In principle you could do the sort with the full sized array and then
compress it to a list of buffer IDs that need to be written out. This
way most of the time you only need a small array and the large array
is only needed for a fraction of a second.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-04 Thread Robert Haas
On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's imo quite clearly better to keep it allocated. For one after
 postmaster started the checkpointer successfully you don't need to be
 worried about later failures to allocate memory if you allocate it once
 (unless the checkpointer FATALs out which should be exceedingly rare -
 we're catching ERRORs). It's much much more likely to succeed
 initially. Secondly it's not like there's really that much time where no
 checkpointer isn't running.

 In principle you could do the sort with the full sized array and then
 compress it to a list of buffer IDs that need to be written out. This
 way most of the time you only need a small array and the large array
 is only needed for a fraction of a second.

It's not the size of the array that's the problem; it's the size of
the detonation when the allocation fails.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-03 Thread Robert Haas
On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund and...@2ndquadrant.com wrote:
 If the sort buffer is allocated when the checkpointer is started, not
 everytime we sort, as you've done in your version of the patch I think
 that risk is pretty manageable. If we really want to be sure nothing is
 happening at runtime, even if the checkpointer was restarted, we can put
 the sort array in shared memory.

I don't think that allocating the array at checkpointer start time
helps.  If it works, then you're strictly worse off than if you
allocate it at every checkpoint, because you're holding onto the
memory all the time instead of only when it's being used.  And if it
fails, what then?  Sure, you can have that copy of the checkpointer
process exit, but that does nothing good.  The postmaster will keep on
restarting it and it will keep on dying for lack of memory, and no
checkpoints will complete.  Oops.

So it seems to me that the possibly-sensible approaches are:

1. Allocate an array when we need to sort, and if the allocation
fails, have some kind of fallback strategy, like logging a WARNING an
writing the buffers out without sorting them.  If it succeeds, do the
checkpoint and then free the memory until we need it again.

2. Putting the array in shared_memory, so that once the server is
started, we can be sure the memory is allocated and the sort will
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-03 Thread Andres Freund
On 2014-09-03 17:08:12 -0400, Robert Haas wrote:
 On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund and...@2ndquadrant.com wrote:
  If the sort buffer is allocated when the checkpointer is started, not
  everytime we sort, as you've done in your version of the patch I think
  that risk is pretty manageable. If we really want to be sure nothing is
  happening at runtime, even if the checkpointer was restarted, we can put
  the sort array in shared memory.
 
 I don't think that allocating the array at checkpointer start time
 helps.  If it works, then you're strictly worse off than if you
 allocate it at every checkpoint, because you're holding onto the
 memory all the time instead of only when it's being used.  And if it
 fails, what then?  Sure, you can have that copy of the checkpointer
 process exit, but that does nothing good.  The postmaster will keep on
 restarting it and it will keep on dying for lack of memory, and no
 checkpoints will complete.  Oops.

It's imo quite clearly better to keep it allocated. For one after
postmaster started the checkpointer successfully you don't need to be
worried about later failures to allocate memory if you allocate it once
(unless the checkpointer FATALs out which should be exceedingly rare -
we're catching ERRORs). It's much much more likely to succeed
initially. Secondly it's not like there's really that much time where no
checkpointer isn't running.

 So it seems to me that the possibly-sensible approaches are:
 
 1. Allocate an array when we need to sort, and if the allocation
 fails, have some kind of fallback strategy, like logging a WARNING an
 writing the buffers out without sorting them.  If it succeeds, do the
 checkpoint and then free the memory until we need it again.

I think if we want to go that way I vote for keeping the array allocated
and continuing to try to allocate it after allocation failures. And, as
you suggest, fall back to a simple sequential search through all
buffers.

 2. Putting the array in shared_memory, so that once the server is
 started, we can be sure the memory is allocated and the sort will
 work.

But I prefer this approach. If we ever want to have more than one
process writing out data for checkpoints we're going to need it
anyway. And that's something not that far away for large setups
imo. Especially due to checksums.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-02 Thread Jeff Janes
On Tue, Aug 26, 2014 at 1:02 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello again,


  I have not found any mean to force bgwriter to send writes when it can.
 (Well, I have: create a process which sends CHECKPOINT every 0.2
 seconds... it works more or less, but this is not my point:-)


 There is scan_whole_pool_milliseconds, which currently forces bgwriter to
 circle the buffer pool at least once every 2 minutes.  It is currently
 fixed, but it should be trivial to turn it into an experimental guc that
 you could use to test your hypothesis.


 I recompiled with the variable coldly set to 1000 instead of 12. The
 situation is slightly degraded (15% of transactions were above 200 ms
 late). However it seems that bgwriter did not write much more pages:



You should probably try it set to 200 rather than 1000, to put it on an
equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

Not that I think this will improve the situation.  Afterall, my theory is
that it does not matter who *writes* the pages, it only matters how they
get fsynced.



   buffers_checkpoint = 26065
   buffers_clean = 5263
   buffers_backend = 367

 Or I may have a problem interpreting pg_stat_bgwriter.



For this experiment, what was checkpoint_timeout set to?

Cheers,

Jeff


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-02 Thread Fabien COELHO



There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.


I recompiled with the variable coldly set to 1000 instead of 12. The
situation is slightly degraded (15% of transactions were above 200 ms
late). However it seems that bgwriter did not write much more pages:



You should probably try it set to 200 rather than 1000, to put it on an
equal footing with the checkpoint_timeout of 0.2 seconds you reported on.


As I understand it, the setting makes the bgwriter processe scan all 
shared_buffers every this amount of time... but ITSM that the key point is 
that bgwriter has no insentive to start writing out buffers anyway with 
its current decision rules, and that should not change with the frequency 
at which they are scanned (?)



For this experiment, what was checkpoint_timeout set to?


AFAICR, the default, 5min.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-02 Thread Jeff Janes
On Tue, Sep 2, 2014 at 8:14 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


  There is scan_whole_pool_milliseconds, which currently forces bgwriter to
 circle the buffer pool at least once every 2 minutes.  It is currently
 fixed, but it should be trivial to turn it into an experimental guc that
 you could use to test your hypothesis.


 I recompiled with the variable coldly set to 1000 instead of 12. The
 situation is slightly degraded (15% of transactions were above 200 ms
 late). However it seems that bgwriter did not write much more pages:



 You should probably try it set to 200 rather than 1000, to put it on an
 equal footing with the checkpoint_timeout of 0.2 seconds you reported on.


 As I understand it, the setting makes the bgwriter processe scan all
 shared_buffers every this amount of time... but ITSM that the key point is
 that bgwriter has no insentive to start writing out buffers anyway with its
 current decision rules, and that should not change with the frequency at
 which they are scanned (?)


Ah, I see now.  The usage counts are not zero, so it visits the buffer and
then leaves it alone.

Cheers,

Jeff


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-31 Thread Fabien COELHO


Hello Heikki,

For the kicks, I wrote a quick  dirty patch for interleaving the fsyncs, see 
attached. It works by repeatedly scanning the buffer pool, writing buffers 
belonging to a single relation segment at a time.


I tried this patch on the same host I used with the same -R 25 -L 200 -T 
5000, alas without significant positive effects, about 6% of transactions 
were lost, including stretches of several seconds of unresponsiveness.


Maybe this is because pgbench -N only basically touches 2 tables 
(accounts  history), so there are few file syncs involved, thus it does 
not help much with spreading a lot of writes.


  2014-08-30 23:52:24.167 CEST:
   LOG:  checkpoint starting: xlog
  2014-08-30 23:54:09.239 CEST:
   LOG:  checkpoint sync: number=1 file=base/16384/11902 time=24.529 msec
  2014-08-30 23:54:23.812 CEST:
   LOG:  checkpoint sync: number=1 file=base/16384/16397_vm time=32.547 msec
  2014-08-30 23:54:24.771 CEST:
   LOG:  checkpoint sync: number=1 file=base/16384/11873 time=557.470 msec
  2014-08-30 23:54:37.419 CEST:
   LOG:  checkpoint complete: wrote 4931 buffers (30.1%);
 0 transaction log file(s) added, 0 removed, 3 recycled;
 write=122.946 s, sync=10.129 s, total=133.252 s;
 sync files=6, longest=9.854 s, average=1.790 s

Note that given the small load and table size, pgbench implies random 
I/Os and basically nothing else.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on. IIRC the biggest
 worry with that patch was that sorting the buffers requires a fairly large
 amount of memory, and making a large allocation in the checkpointer might
 cause an out-of-memory, which would be bad.
 
 I don't think anyone's seriously worked on this area since. If the impact on
 responsiveness or performance is significant, I'm pretty sure the OOM
 problem could be alleviated somehow.

I've dug up that patch (after a bit of fighting with the archives) and
refreshed it. It's *clearly* beneficial:

master:
andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 
180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219
...
progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720
progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228
progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546
progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740
progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934
..
progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685
progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342
progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916
progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448
progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339
progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340
progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648
progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297
progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224
progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901
progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576
progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520
progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346
progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670
progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411
progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096
progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690
progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366
progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900
progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774
progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501
progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921
progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426
progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247
progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082
progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577
progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611
progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498
progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970
progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848
progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592
progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851
progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976
progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057
progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994
progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553
progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508
progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053
...
progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210
progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411
progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650
progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596
progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174
progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444
progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611
progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121
progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377
progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927
progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472
progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491
progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754
progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857
progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488
progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162
progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952
progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539
progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094
progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200
progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on.

 ...
 So, *very* clearly sorting is a benefit.

pg_bench alone doesn't convince me on this.  The original thread found
cases where it was a loss, IIRC; you will need to test many more than
one scenario to prove the point.

Also, it does not matter how good it looks in test cases if it causes
outright failures due to OOM; unlike you, I am not prepared to just wave
away that risk.  A possible compromise is to sort a limited number of
buffers  say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
  A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
  them out in order 
  (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
  The performance impact of that was inconclusive, but one thing that it
  allows nicely is to interleave the fsyncs, so that you write all the 
  buffers
  for one file, then fsync it, then next file and so on.
 
  ...
  So, *very* clearly sorting is a benefit.
 
 pg_bench alone doesn't convince me on this.  The original thread found
 cases where it was a loss, IIRC; you will need to test many more than
 one scenario to prove the point.

Sure. And I'm not claiming Itagaki/your old patch is immediately going
to be ready for commit. But our checkpoint performance has sucked for
years in the field. I don't think we can wave that away.

I think the primary reason it wasn't easily visible as being beneficial
back then was that only the throughput, not the latency and such were
looked at.

 Also, it does not matter how good it looks in test cases if it causes
 outright failures due to OOM; unlike you, I am not prepared to just wave
 away that risk.

I'm not waving away any risks.

If the sort buffer is allocated when the checkpointer is started, not
everytime we sort, as you've done in your version of the patch I think
that risk is pretty manageable. If we really want to be sure nothing is
happening at runtime, even if the checkpointer was restarted, we can put
the sort array in shared memory.

We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 %
overhead over shared_buffers here. If we decide to got that way, it's a
pretty darn small to price not to regularly have stalls that last
minutes.

 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.

Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers. Doing so is rather beneficial because
then fewer independently dirtied pages have to be flushed out - reducing
the impact of the fsync().

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.

 Yea, that's what I suggested nearby. But I don't really like it, because
 it robs us of the the chance to fsync() a relfilenode immediately after
 having synced all its buffers.

Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
  A possible compromise is to sort a limited number of
  buffers  say, collect a few thousand dirty buffers then sort, dump and
  fsync them, repeat as needed.
 
  Yea, that's what I suggested nearby. But I don't really like it, because
  it robs us of the the chance to fsync() a relfilenode immediately after
  having synced all its buffers.
 
 Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
 once per sort-group and not just once per checkpoint, but it's not clear
 that that's a loss as long as the group size isn't tiny.

Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.

A fsync() on a file with dirty data often causes *serious* latency
spikes - we should try hard to avoid superflous calls.

As an example: Calling fsync() on pgbench_accounts's underlying files,
from outside postgres, *before* postgres even started its first
checkpoint does this:
progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567
progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436
progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634
progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463
progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309
progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667
progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954
progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751
progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933
progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198
progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241

Note the dip from 4k tps to 130 tps.

We can get a handle on that (on some platforms at least) for writes
issued during the buffer sync by forcing the kernel to write out the
pages in small increments; but I doubt we want to do that for writes by
backends themselves.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Heikki Linnakangas

On 08/30/2014 09:45 PM, Andres Freund wrote:

On 2014-08-30 14:16:10 -0400, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2014-08-30 13:50:40 -0400, Tom Lane wrote:

A possible compromise is to sort a limited number of
buffers  say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.



Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers.


Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.


Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.


I wonder how much of the benefit from sorting comes from sorting the 
pages within each file, and how much just from grouping all the writes 
of each file together. In other words, how much difference is there 
between sorting, and fsyncing between each file, or the crude patch I 
posted earlier.


If we're going to fsync between each file, there's no need to sort all 
the buffers at once. It's enough to pick one file as the target - like 
in my crude patch - and sort only the buffers for that file. Then fsync 
that file and move on to the next file. That requires scanning the 
buffers multiple times, but I think that's OK.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Andres Freund
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
 On 08/30/2014 09:45 PM, Andres Freund wrote:
 On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
 A possible compromise is to sort a limited number of
 buffers  say, collect a few thousand dirty buffers then sort, dump and
 fsync them, repeat as needed.
 
 Yea, that's what I suggested nearby. But I don't really like it, because
 it robs us of the the chance to fsync() a relfilenode immediately after
 having synced all its buffers.
 
 Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
 once per sort-group and not just once per checkpoint, but it's not clear
 that that's a loss as long as the group size isn't tiny.
 
 Because it wouldn't have the benefit of sycing the minimal amount of
 data anymore. If lots of other relfilenodes have been synced inbetween
 the amount of newly dirtied pages in the os' buffercache (written by
 backends, bgwriter) for a individual relfilenode is much higher.
 
 I wonder how much of the benefit from sorting comes from sorting the pages
 within each file, and how much just from grouping all the writes of each
 file together. In other words, how much difference is there between sorting,
 and fsyncing between each file, or the crude patch I posted earlier.

I haven't implemented fsync()ing between files so far. From the io stats
I'm seing the performance improvements come from the OS being able to
write data back in bigger chunks. Which seems entirely reasonable. If
the database and the write load are big enough, so writeback will be
triggered repeatedly during one checkpoint, the OS's buffercache will
have lots of nonsequential data to flush. Leading to much smaller IOs,
more seeks and deeper queues (= latency increases).

 If we're going to fsync between each file, there's no need to sort all the
 buffers at once. It's enough to pick one file as the target - like in my
 crude patch - and sort only the buffers for that file. Then fsync that file
 and move on to the next file. That requires scanning the buffers multiple
 times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-30 Thread Mitsumasa KONDO
Hi,

2014-08-31 8:10 GMT+09:00 Andres Freund and...@2ndquadrant.com:

 On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:

 If we're going to fsync between each file, there's no need to sort all the
  buffers at once. It's enough to pick one file as the target - like in my
  crude patch - and sort only the buffers for that file. Then fsync that
 file
  and move on to the next file. That requires scanning the buffers multiple
  times, but I think that's OK.

 I really can't see that working out. Production instances of postgres
 with large shared_buffers settings (say 96GB in one case) have tens of
 thousands of relations (~34500 in the same case). And that's a database
 with a relatively simple schema. I've seen much worse.

Yeah, it is impossible in one checkpointer process. All buffer search cost
is
relatively high than we expect. We need clever algorithm for efficient and
distributed buffer search using multi process or threads.

Regards,
--
Mitsumasa KONDO


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-28 Thread Fabien COELHO


Hello Aidan,


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


I tried that by setting:
  vm.dirty_expire_centisecs = 100
  vm.dirty_writeback_centisecs = 100

So it should start writing returned buffers at most 2s after they are 
returned, if I understood the doc correctly, instead of at most 35s.


The result is that with a 5000s 25tps pretty small load (the system can do 
300tps with the default configuration), I lost 2091 (1.7%) of 
transactions, that is they were beyond the 200ms schedule limit. Another 
change is that overall the lost transactions are more spread than without 
this setting, although there still is stretches of unresponsiveness.


So although the situation is significantly better, it is still far from 
good with the much reduced value I tried.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-28 Thread Claudio Freire
On Thu, Aug 28, 2014 at 3:27 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Hello Aidan,


 If all you want is to avoid the write storms when fsyncs start happening
 on
 slow storage, can you not just adjust the kernel vm.dirty* tunables to
 start making the kernel write out dirty buffers much sooner instead of
 letting them accumulate until fsyncs force them out all at once?


 I tried that by setting:
   vm.dirty_expire_centisecs = 100
   vm.dirty_writeback_centisecs = 100

 So it should start writing returned buffers at most 2s after they are
 returned, if I understood the doc correctly, instead of at most 35s.

 The result is that with a 5000s 25tps pretty small load (the system can do
 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
 that is they were beyond the 200ms schedule limit. Another change is that
 overall the lost transactions are more spread than without this setting,
 although there still is stretches of unresponsiveness.

 So although the situation is significantly better, it is still far from good
 with the much reduced value I tried.


What do the checkpoint logs look like now? (especially interested in sync times)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-28 Thread Fabien COELHO



I tried that by setting:
  vm.dirty_expire_centisecs = 100
  vm.dirty_writeback_centisecs = 100

So it should start writing returned buffers at most 2s after they are
returned, if I understood the doc correctly, instead of at most 35s.

The result is that with a 5000s 25tps pretty small load (the system can do
300tps with the default configuration), I lost 2091 (1.7%) of transactions,
that is they were beyond the 200ms schedule limit. Another change is that
overall the lost transactions are more spread than without this setting,
although there still is stretches of unresponsiveness.

So although the situation is significantly better, it is still far from good
with the much reduced value I tried.


What do the checkpoint logs look like now? (especially interested in sync times)


Here is an extract:

  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 4893 buffers (4.8%);
0 transaction log file(s) added, 0 removed, 0 recycled;
write=128.430 s, sync=0.378 s, total=128.921 s;
sync files=11, longest=0.375 s, average=0.034 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6200 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=128.934 s, sync=0.240 s, total=129.280 s;
sync files=7, longest=0.132 s, average=0.034 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6177 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.146 s, sync=0.322 s, total=130.592 s;
sync files=7, longest=0.185 s, average=0.046 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6185 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=132.673 s, sync=0.143 s, total=132.909 s;
sync files=5, longest=0.078 s, average=0.028 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6188 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.415 s, sync=0.170 s, total=130.676 s;
sync files=5, longest=0.132 s, average=0.034 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6203 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
   write=131.726 s, sync=0.444 s, total=132.256 s;
   sync files=5, longest=0.441 s, average=0.088 s
  ...

Basically sync is between 0.1-0.5 seconds.

I had one particulary bad stretch of unresponsiveness, which is not 
clearly related to a slow checkpoint sync:


  progress: 4065.0 s, 28.7 tps, lat 11.886 ms stddev 48.470, lag 18.851 ms, 0 
skipped
  progress: 4066.2 s, 1.6 tps, lat 952.830 ms stddev 280.673, lag 155.310 ms, 
25 skipped
  progress: 4067.0 s, 2.7 tps, lat 1067.730 ms stddev 316.321, lag 171.766 ms, 
14 skipped
  progress: 4068.4 s, 1.4 tps, lat 1147.392 ms stddev 240.337, lag 132.297 ms, 
35 skipped
  progress: 4069.7 s, 2.3 tps, lat 1034.543 ms stddev 213.786, lag 154.453 ms, 
35 skipped
  progress: 4070.2 s, 1.8 tps, lat 562.715 ms stddev 0.000, lag 182.833 ms, 12 
skipped
  progress: 4071.3 s, 3.6 tps, lat 600.929 ms stddev 108.232, lag 162.723 ms, 
25 skipped
  progress: 4072.5 s, 1.7 tps, lat 1077.187 ms stddev 77.654, lag 168.849 ms, 
31 skipped
  progress: 4073.3 s, 1.3 tps, lat 1298.093 ms stddev 0.000, lag 168.882 ms, 21 
skipped
  progress: 4074.0 s, 2.7 tps, lat 920.704 ms stddev 183.587, lag 165.333 ms, 
24 skipped
  progress: 4075.3 s, 2.4 tps, lat 756.051 ms stddev 118.241, lag 176.863 ms, 
29 skipped
  progress: 4076.1 s, 1.3 tps, lat 1424.548 ms stddev 0.000, lag 0.000 ms, 17 
skipped
  progress: 4077.3 s, 2.4 tps, lat 791.090 ms stddev 338.729, lag 155.403 ms, 
26 skipped
  progress: 4078.1 s, 27.4 tps, lat 46.834 ms stddev 198.812, lag 3.915 ms, 0 
skipped



--
Fabien


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello Andres,


[...]
I think you're misunderstanding how spread checkpoints work.


Yep, definitely:-) On the other hand I though I was seeking something 
simple, namely correct latency under small load, that I would expect out 
of the box.


What you describe is reasonable, and is more or less what I was hoping 
for, although I thought that bgwriter was involved from the start and 
checkpoint would only do what is needed in the end. My mistake.


When the checkpointer process starts a spread checkpoint it first writes 
all buffers to the kernel in a paced manner.
That pace is determined by checkpoint_completion_target and 
checkpoint_timeout.


This pacing does not seem to work, even at slow pace.


If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.


I do not believe in delaying as much as possible writing do disk to handle 
a small load as a viable strategy.  However, to show my good will, I have 
tried to follow your advices: I've launched a 5000 seconds test with 50 
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less 
than 1/10 of the maximum throughput.


There are only two time-triggered checkpoints:

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
  1 transaction log file(s) added, 0 removed, 0 recycled;
  write=1619.750 s, sync=27.675 s, total=1647.932 s;
  sync files=14, longest=27.593 s, average=1.976 s

  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
  0 transaction log file(s) added, 0 removed, 23 recycled;
  write=826.919 s, sync=9.989 s, total=837.023 s;
  sync files=8, longest=6.742 s, average=1.248 s

For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to 
complete, so it means 30 buffer writes per second... should be ok. However 
sync costs 27 seconds nevertheless, and the server was more or less 
offline for about 30 seconds flat. For the second one, 180 MB to write, 10 
seconds offline. For some reason the target time is reduced. I have also 
tried with the deadline IO scheduler which make more sense than the 
default cfq, but the result was similar. Not sure how software RAID 
interacts with IO scheduling, though.


Overall result: over the 5000s test, I have lost (i.e. more than 200ms 
behind schedule) more than 2.5% of transactions (1/40). Due to the 
unfinished cycle, the long term average is probably about 3%. Although it 
is better than 10%, it is not good. I would expect/hope for something 
pretty close to 0, even with ext4 on Linux, for a dedicated host which has 
nothing else to do but handle two dozen transactions per second.


Current conclusion: I have not found any way to improve the situation to 
good with parameters from the configuration. Currently a small load 
results in periodic offline time, that can be delayed but not avoided. The 
delaying tactic results in less frequent but longer downtime. I prefer 
frequent very short downtime instead.


I really think that something is amiss. Maybe pg does not handle pacing as 
it should.


For the record, a 25tps bench with a small config (default 3 segments, 
5min timeout, 0.5 completion target) and with a parallel:


while true ; do echo CHECKPOINT;; sleep 0.2s; done | psql

results in losing only 0.01% of transactions (12 transactions out of 
125893 where behind more than 200ms in 5000 seconds). Although you may 
think it stupid, from my point of view it shows that it is possible to 
coerce pg to behave.


With respect to the current status:

(1) the ability to put checkpoint_timeout to values smaller than 30s could 
help, although obviously there would be other consequences. But the 
ability to avoid periodic offline time looks like a desirable objective.


(2) I still think that a parameter to force bgwriter to write more stuff 
could help, but this is not tested.


(3) Any other effective idea to configure for responsiveness is welcome!

If someone wants to repeat these tests, it is easy and only takes a few 
minutes:


  sh createdb test
  sh pgbench -i -s 100 -F 95 test
  sh pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test  pgb.out

Note: the -L to limit latency is a submitted patch. Without this, 
unresponsiveness shows as increasing laging time.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 09:32:16 +0200, Fabien COELHO wrote:
 
 Hello Andres,
 
 [...]
 I think you're misunderstanding how spread checkpoints work.
 
 Yep, definitely:-) On the other hand I though I was seeking something
 simple, namely correct latency under small load, that I would expect out
 of the box.

Yea. The current situation *sucks*. Both from the utterly borked
behaviour of ext4 and other filesystems and the lack of workaround from 
postgres.

 When the checkpointer process starts a spread checkpoint it first writes
 all buffers to the kernel in a paced manner.
 That pace is determined by checkpoint_completion_target and
 checkpoint_timeout.
 
 This pacing does not seem to work, even at slow pace.

It definitely does in some cases. What's your evidence the pacing
doesn't work? Afaik it's the fsync that causes the problem, not the the
writes themselves.

 If you have a stall of roughly the same magnitude (say a factor
 of two different), the smaller once a minute, the larger once an
 hour. Obviously the once-an-hour one will have a better latency in many,
 many more transactions.
 
 I do not believe in delaying as much as possible writing do disk to handle a
 small load as a viable strategy.  However, to show my good will, I have
 tried to follow your advices: I've launched a 5000 seconds test with 50
 segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
 than 1/10 of the maximum throughput.
 
 There are only two time-triggered checkpoints:
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
   1 transaction log file(s) added, 0 removed, 0 recycled;
   write=1619.750 s, sync=27.675 s, total=1647.932 s;
   sync files=14, longest=27.593 s, average=1.976 s
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
   0 transaction log file(s) added, 0 removed, 23 recycled;
   write=826.919 s, sync=9.989 s, total=837.023 s;
   sync files=8, longest=6.742 s, average=1.248 s

The write pacing itself doesn't seem to be bad. The bad thing is the
'sync' times here. Those are *NOT* paced and kernel probably has delayed
flushing out much the writes...

 (1) the ability to put checkpoint_timeout to values smaller than 30s could
 help, although obviously there would be other consequences. But the ability
 to avoid periodic offline time looks like a desirable objective.

I'd rather not do that. It's a utterly horrible hack to go this write.

 (2) I still think that a parameter to force bgwriter to write more stuff
 could help, but this is not tested.

It's going to be random writes. That's not going to be helpful.

 (3) Any other effective idea to configure for responsiveness is welcome!

I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?

Did you test xfs already?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


[...] What's your evidence the pacing doesn't work? Afaik it's the fsync 
that causes the problem, not the the writes themselves.


Hmmm. My (poor) understanding is that fsync would work fine if everything 
was already written beforehand:-) that is it has nothing to do but assess 
that all is already written. If there is remaining write work, it starts 
doing it now with the disastrous effects I'm complaining about.


When I say pacing does not work, I mean that things where not written 
out to disk by the OS, it does not mean that pg did not ask for it.


However it does not make much sense for an OS scheduler to wait several 
minutes with tens of thousands of pages to write and do nothing about 
it... So I'm wondering.



[...]

(1) the ability to put checkpoint_timeout to values smaller than 30s could
help, although obviously there would be other consequences. But the ability
to avoid periodic offline time looks like a desirable objective.


I'd rather not do that. It's a utterly horrible hack to go this write.


Hmmm. It does solve the issue, though:-) It would be the administrator 
choice. It is better than nothing, which is the current status.



(2) I still think that a parameter to force bgwriter to write more stuff
could help, but this is not tested.


It's going to be random writes. That's not going to be helpful.


The -N small OLTP load on a large (GB) table *is* random writes anyway, 
whether they occur at checkpoint or at any other time. Random writes are 
fine in this case, the load is small, there should be no problem.


(3) Any other effective idea to configure for responsiveness is 
welcome!


I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?


I can test a couple of patches. I already did one on someone advice (make 
bgwriter round all stuff in 1s instead of 120s, without positive effect.



Did you test xfs already?


No. I cannot without reinstalling, which I cannot do on a remote host, and 
I will probably not have time to do it when I'll have physical access. 
Only one partition on the host. My mistake. Will not do it again. Shame on 
me.


If someone out there has an XFS setup, it is very easy to test and only 
takes a couple of minutes, really. It takes less time to do it than to 
write a mail about it afterwards:-)


I have tested FreeBSD/UFS with similar results, a few periodic offlines. 
UFS journaled file system is probably not ideal for database work, but yet 
again the load is small, it should be able to cope without going offline.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
 I can test a couple of patches. I already did one on someone advice (make
 bgwriter round all stuff in 1s instead of 120s, without positive effect.

I've quickly cobbled together the attached patch (which at least doesn't
seem to crash  burn). It tries to trigger pages being flushed out
during the paced phase of checkpoints instead of the fsync phase. The
sync_on_checkpoint_flush can be used to enable/disable that behaviour.

I'd be interested to hear whether that improves your latency numbers. I
unfortunately don't have more time to spend on this right now :(.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
 On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
  I can test a couple of patches. I already did one on someone advice (make
  bgwriter round all stuff in 1s instead of 120s, without positive effect.
 
 I've quickly cobbled together the attached patch (which at least doesn't
 seem to crash  burn). It tries to trigger pages being flushed out
 during the paced phase of checkpoints instead of the fsync phase. The
 sync_on_checkpoint_flush can be used to enable/disable that behaviour.
 
 I'd be interested to hear whether that improves your latency numbers. I
 unfortunately don't have more time to spend on this right now :(.

And actually attached. Note that it's linux only...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4576970..eab30a6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -78,6 +78,7 @@ bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
 double		bgwriter_lru_multiplier = 2.0;
 bool		track_io_timing = false;
+bool		sync_on_checkpoint_flush = false;
 
 /*
  * How many buffers PrefetchBuffer callers should try to stay ahead of their
@@ -356,7 +357,7 @@ static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
+static int	SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
@@ -369,7 +370,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
-static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int	rnode_comparator(const void *p1, const void *p2);
@@ -947,7 +948,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 smgr-smgr_rnode.node.dbNode,
 			  smgr-smgr_rnode.node.relNode);
 
-FlushBuffer(buf, NULL);
+FlushBuffer(buf, NULL, false);
 LWLockRelease(buf-content_lock);
 
 TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
@@ -1573,7 +1574,7 @@ BufferSync(int flags)
 		 */
 		if (bufHdr-flags  BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
+			if (SyncOneBuffer(buf_id, false, true)  BUF_WRITTEN)
 			{
 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 BgWriterStats.m_buf_written_checkpoints++;
@@ -1850,7 +1851,7 @@ BgBufferSync(void)
 	/* Execute the LRU scan */
 	while (num_to_scan  0  reusable_buffers  upcoming_alloc_est)
 	{
-		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+		int			buffer_state = SyncOneBuffer(next_to_clean, true, false);
 
 		if (++next_to_clean = NBuffers)
 		{
@@ -1927,7 +1928,7 @@ BgBufferSync(void)
  * Note: caller must have done ResourceOwnerEnlargeBuffers.
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used)
+SyncOneBuffer(int buf_id, bool skip_recently_used, bool flush_to_disk)
 {
 	volatile BufferDesc *bufHdr = BufferDescriptors[buf_id];
 	int			result = 0;
@@ -1966,7 +1967,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(bufHdr-content_lock, LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, flush_to_disk);
 
 	LWLockRelease(bufHdr-content_lock);
 	UnpinBuffer(bufHdr, true);
@@ -2230,7 +2231,7 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  * as the second parameter.  If not, pass NULL.
  */
 static void
-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2342,6 +2343,10 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	   reln-smgr_rnode.node.spcNode,
 	   reln-smgr_rnode.node.dbNode,
 	   reln-smgr_rnode.node.relNode);
+	if (flush_to_disk  sync_on_checkpoint_flush)
+		smgrflush(reln,
+  buf-tag.forkNum,
+  buf-tag.blockNum);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
@@ -2769,7 +2774,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr-content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel-rd_smgr);
+			

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 11:19:22 +0200, Andres Freund wrote:
 On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
  On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
   I can test a couple of patches. I already did one on someone advice (make
   bgwriter round all stuff in 1s instead of 120s, without positive effect.
  
  I've quickly cobbled together the attached patch (which at least doesn't
  seem to crash  burn). It tries to trigger pages being flushed out
  during the paced phase of checkpoints instead of the fsync phase. The
  sync_on_checkpoint_flush can be used to enable/disable that behaviour.
  
  I'd be interested to hear whether that improves your latency numbers. I
  unfortunately don't have more time to spend on this right now :(.
 
 And actually attached. Note that it's linux only...

I got curious and ran a quick test:

config:
log_checkpoints=on
checkpoint_timeout=1min
checkpoint_completion_target=0.95
checkpoint_segments=100
synchronous_commit=on
fsync=on
huge_pages=on
max_connections=200
shared_buffers=6GB
wal_level=hot_standby

off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 20189
latency average: 23.136 ms
latency stddev: 59.044 ms
rate limit schedule lag: avg 4.599 (max 199.975) ms
number of skipped transactions: 1345 (6.246 %)
tps = 167.664290 (including connections establishing)
tps = 167.675679 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12754 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 2 recycled; write=56.928 s, sync=3.639 s, 
total=60.749 s; sync files=20, longest=2.741 s, average=0.181 s
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12269 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 6 recycled; write=20.701 s, sync=8.568 s, 
total=29.444 s; sync files=10, longest=3.568 s, average=0.856 s


on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 21327
latency average: 20.735 ms
latency stddev: 14.643 ms
rate limit schedule lag: avg 4.965 (max 185.003) ms
number of skipped transactions: 1 (0.005 %)
tps = 177.214391 (including connections establishing)
tps = 177.225476 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12217 buffers (1.6%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=57.022 s, sync=0.203 s, 
total=57.377 s; sync files=19, longest=0.033 s, average=0.010 s
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 13185 buffers (1.7%); 0 transaction log 
file(s) added, 0 removed, 6 recycled; write=56.628 s, sync=0.019 s, 
total=56.803 s; sync files=11, longest=0.017 s, average=0.001 s


That machine is far from idle right now, so the noise is pretty
high. But rather nice initial results.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello Amit,

I see there is some merit in your point which is to make bgwriter more 
useful than its current form. I could see 3 top level points to think 
about whether improvement in any of those can improve the current 
situation:



a. Scanning of buffer pool to find the dirty buffers that can
be flushed.


Yep, that could be more aggressive, or the aggressiveness could be made 
into an adjustible parameter. There is a comment about that in the source. 
However I tested setting the round to 1s instead of 120s, and it had no 
positive effect on my test.



b. Deciding on what is criteria to flush a buffer


Indeed.

For instance when the IO load is low, there is no reason not to send out 
some buffers, it is a free lunch even if it must be done again later. If 
the load is high, this is another matter. So it would mean being able to 
decide whether the current IO load is low or not. It could be different on 
different disk... Hmmm, not that simple.



c. Sync of buffers


Yes. That is more or less the effect of my rough approach of calling 
CHECKPOINT every 0.2 seconds.  Andres Freund just implemented a quick 
linux-specific patch which does that within CHECKPOINT pacing, and which 
greatly improves the situation, although it could still be a little 
better.



[separate xlog test]


I do not have a setup available for that right now. Not sure this would be 
an issue for low loads.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
 that causes the problem, not the the writes themselves.


 Hmmm. My (poor) understanding is that fsync would work fine if everything
 was already written beforehand:-) that is it has nothing to do but assess
 that all is already written. If there is remaining write work, it starts
 doing it now with the disastrous effects I'm complaining about.

 When I say pacing does not work, I mean that things where not written out
 to disk by the OS, it does not mean that pg did not ask for it.

 However it does not make much sense for an OS scheduler to wait several
 minutes with tens of thousands of pages to write and do nothing about it...
 So I'm wondering.

Maybe what's needed, is to slightly tweak checkpoint logic to give the
kernel some time to flush buffers.

Correct me if I'm wrong, but the checkpointer does the sync right
after the reads. Of course there will be about 30s-worth of
accumulated writes (it's the default amount of time the kernel holds
on to dirty buffers).

Perhaps it should be delayed a small time, say 30s, to let the kernel
do the writing on its own.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
 that causes the problem, not the the writes themselves.


 Hmmm. My (poor) understanding is that fsync would work fine if everything
 was already written beforehand:-) that is it has nothing to do but assess
 that all is already written. If there is remaining write work, it starts
 doing it now with the disastrous effects I'm complaining about.

 When I say pacing does not work, I mean that things where not written out
 to disk by the OS, it does not mean that pg did not ask for it.

 However it does not make much sense for an OS scheduler to wait several
 minutes with tens of thousands of pages to write and do nothing about it...
 So I'm wondering.

 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.

 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).

 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.


Errata: just after the writes :-p


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
  [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
  that causes the problem, not the the writes themselves.
 
 
  Hmmm. My (poor) understanding is that fsync would work fine if everything
  was already written beforehand:-) that is it has nothing to do but assess
  that all is already written. If there is remaining write work, it starts
  doing it now with the disastrous effects I'm complaining about.
 
  When I say pacing does not work, I mean that things where not written out
  to disk by the OS, it does not mean that pg did not ask for it.
 
  However it does not make much sense for an OS scheduler to wait several
  minutes with tens of thousands of pages to write and do nothing about it...
  So I'm wondering.
 
 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.
 
 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).
 
 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.

The kernel *starts* to write out pages after 30s, it doesn't finish
doing so. So I don't think that's going to work.

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
 On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
  [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
  that causes the problem, not the the writes themselves.
 
 
  Hmmm. My (poor) understanding is that fsync would work fine if everything
  was already written beforehand:-) that is it has nothing to do but assess
  that all is already written. If there is remaining write work, it starts
  doing it now with the disastrous effects I'm complaining about.
 
  When I say pacing does not work, I mean that things where not written out
  to disk by the OS, it does not mean that pg did not ask for it.
 
  However it does not make much sense for an OS scheduler to wait several
  minutes with tens of thousands of pages to write and do nothing about it...
  So I'm wondering.

 Maybe what's needed, is to slightly tweak checkpoint logic to give the
 kernel some time to flush buffers.

 Correct me if I'm wrong, but the checkpointer does the sync right
 after the reads. Of course there will be about 30s-worth of
 accumulated writes (it's the default amount of time the kernel holds
 on to dirty buffers).

 Perhaps it should be delayed a small time, say 30s, to let the kernel
 do the writing on its own.

 The kernel *starts* to write out pages after 30s, it doesn't finish
 doing so. So I don't think that's going to work.

 I think a somewhat smarter version of the explicit flushes in the
 hack^Wpatch I posted nearby is going to more likely to be successful.


That path is dangerous (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
  I think a somewhat smarter version of the explicit flushes in the
  hack^Wpatch I posted nearby is going to more likely to be successful.
 
 
 That path is dangerous (as in, may not work as intended) if the
 filesystem doesn't properly understand range flushes (ehem, like
 ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Claudio Freire
On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
  I think a somewhat smarter version of the explicit flushes in the
  hack^Wpatch I posted nearby is going to more likely to be successful.


 That path is dangerous (as in, may not work as intended) if the
 filesystem doesn't properly understand range flushes (ehem, like
 ext3).

 The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
 guaranteeing durability. And - afaik - not implemented in a file system
 specific manner. It just initiates writeback for individual pages. It
 doesn't cause barrier, journal flushes or anything to be issued. That's
 still done by the fsync() later.

 The big disadvantage is that it's a OS specific solution, but I don't
 think we're going to find anything that isn't in this area.

I guess it should work then.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Aidan Van Dyk
On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello Andres,

  [...]

 I think you're misunderstanding how spread checkpoints work.


 Yep, definitely:-) On the other hand I though I was seeking something
 simple, namely correct latency under small load, that I would expect out
 of the box.

 What you describe is reasonable, and is more or less what I was hoping
 for, although I thought that bgwriter was involved from the start and
 checkpoint would only do what is needed in the end. My mistake.


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


a.


-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a
slave.


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote:
 On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 
 
  Hello Andres,
 
   [...]
 
  I think you're misunderstanding how spread checkpoints work.
 
 
  Yep, definitely:-) On the other hand I though I was seeking something
  simple, namely correct latency under small load, that I would expect out
  of the box.
 
  What you describe is reasonable, and is more or less what I was hoping
  for, although I thought that bgwriter was involved from the start and
  checkpoint would only do what is needed in the end. My mistake.
 
 
 If all you want is to avoid the write storms when fsyncs start happening on
 slow storage, can you not just adjust the kernel vm.dirty* tunables to
 start making the kernel write out dirty buffers much sooner instead of
 letting them accumulate until fsyncs force them out all at once?

Well. For one that's a os specific global tunable requiring root to be
adjustable. For another we actually do want some buffering: If a backend
writes out a buffer's data itself (happens very frequently) it *should*
get buffered... So I don't think a process independent tunable is going
to do the trick.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Fabien COELHO


Hello,


If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?


I can try, when I have finished with the current round of testing.

Note that, as Andres put it, it currently sucks. Having to tinker with 
linux kernel parameters just to handle a small load without being offline 
10% of the time does not look very good, so even if it works, ISTM that a 
pg side solution is desirable.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Heikki Linnakangas

On 08/27/2014 04:20 PM, Andres Freund wrote:

On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.



That path is dangerous (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).


The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.


I've been thinking for a long time that we should interleave the writes 
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at 
once, causing a spike, but at least not more than that. Also, the 
scheduling of a spread checkpoint is currently a bit bogus; we don't 
take into account the time needed for the fsync phase.


A long time ago, Itagaki Takahiro wrote a patch sort the buffers and 
write them out in order 
(http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). 
The performance impact of that was inconclusive, but one thing that it 
allows nicely is to interleave the fsyncs, so that you write all the 
buffers for one file, then fsync it, then next file and so on. IIRC the 
biggest worry with that patch was that sorting the buffers requires a 
fairly large amount of memory, and making a large allocation in the 
checkpointer might cause an out-of-memory, which would be bad.


I don't think anyone's seriously worked on this area since. If the 
impact on responsiveness or performance is significant, I'm pretty sure 
the OOM problem could be alleviated somehow.


For the kicks, I wrote a quick  dirty patch for interleaving the 
fsyncs, see attached. It works by repeatedly scanning the buffer pool, 
writing buffers belonging to a single relation segment at a time. I 
would be interested to hear how this performs in your test case.


- Heikki

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 938c554..0f2e4e0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1217,6 +1217,12 @@ BufferSync(int flags)
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	RelFileNode target_rnode = { 0, 0, 0 };
+	ForkNumber	target_forkNum = InvalidForkNumber;
+	int			target_segno = 0;
+	bool		has_target = false;
+	int			outer_num_to_scan = 0;
+	int			outer_buf_id = 0;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1281,10 +1287,30 @@ BufferSync(int flags)
 	buf_id = StrategySyncStart(NULL, NULL);
 	num_to_scan = NBuffers;
 	num_written = 0;
-	while (num_to_scan--  0)
+
+	for (;;)
 	{
 		volatile BufferDesc *bufHdr = BufferDescriptors[buf_id];
 
+		if (num_to_scan == 0)
+		{
+			if (has_target)
+			{
+/*
+ * We have finished writing out buffers belonging to this
+ * relation segment. fsync it now.
+ */
+mdsync_seg(target_rnode, target_forkNum, target_segno);
+
+/* continue the outer scan where we left */
+num_to_scan = outer_num_to_scan;
+buf_id = outer_buf_id;
+has_target = false;
+			}
+			else
+break; /* all done! */
+		}
+
 		/*
 		 * We don't need to acquire the lock here, because we're only looking
 		 * at a single bit. It's possible that someone else writes the buffer
@@ -1299,7 +1325,36 @@ BufferSync(int flags)
 		 */
 		if (bufHdr-flags  BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
+			RelFileNode this_rnode = bufHdr-tag.rnode;
+			ForkNumber	this_forkNum = bufHdr-tag.forkNum;
+			int			this_segno = bufHdr-tag.blockNum / (RELSEG_SIZE / BLCKSZ);
+			bool		skip = false;
+
+			if (has_target)
+			{
+if (!RelFileNodeEquals(this_rnode, target_rnode) ||
+	this_forkNum != target_forkNum ||
+	this_segno != target_segno)
+{
+	/*
+	 * This buffer belongs to another relation than the one
+	 * we're targeting right now. We'll revisit it later.
+	 */
+	skip = true;
+}
+			}
+			else
+			{
+target_rnode = this_rnode;
+target_forkNum = this_forkNum;
+target_segno = this_segno;
+has_target = true;
+/* remember where we left the outer scan */
+outer_buf_id = buf_id;
+outer_num_to_scan = num_to_scan;
+			}
+
+			if (!skip  SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
 			{
 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
 BgWriterStats.m_buf_written_checkpoints++;
@@ -1328,6 +1383,7 @@ BufferSync(int flags)
 
 		if (++buf_id = NBuffers)
 			buf_id = 0;
+		num_to_scan--;
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c 

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
 On 08/27/2014 04:20 PM, Andres Freund wrote:
 On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
 I think a somewhat smarter version of the explicit flushes in the
 hack^Wpatch I posted nearby is going to more likely to be successful.
 
 
 That path is dangerous (as in, may not work as intended) if the
 filesystem doesn't properly understand range flushes (ehem, like
 ext3).
 
 The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
 guaranteeing durability. And - afaik - not implemented in a file system
 specific manner. It just initiates writeback for individual pages. It
 doesn't cause barrier, journal flushes or anything to be issued. That's
 still done by the fsync() later.
 
 The big disadvantage is that it's a OS specific solution, but I don't
 think we're going to find anything that isn't in this area.
 
 I've been thinking for a long time that we should interleave the writes and
 the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
 causing a spike, but at least not more than that.

I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
manner.

I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.

So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.

 A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
 them out in order 
 (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
 The performance impact of that was inconclusive, but one thing that it
 allows nicely is to interleave the fsyncs, so that you write all the buffers
 for one file, then fsync it, then next file and so on. IIRC the biggest
 worry with that patch was that sorting the buffers requires a fairly large
 amount of memory, and making a large allocation in the checkpointer might
 cause an out-of-memory, which would be bad.
 
 I don't think anyone's seriously worked on this area since. If the impact on
 responsiveness or performance is significant, I'm pretty sure the OOM
 problem could be alleviated somehow.

It's a major problem imo.

What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.

 For the kicks, I wrote a quick  dirty patch for interleaving the fsyncs,
 see attached. It works by repeatedly scanning the buffer pool, writing
 buffers belonging to a single relation segment at a time. I would be
 interested to hear how this performs in your test case.

I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Fabien COELHO



off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1345 (6.246 %)

on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
number of skipped transactions: 1 (0.005 %)



That machine is far from idle right now, so the noise is pretty high.


What is the OS and FS? Could it be XFS?


But rather nice initial results.


Indeed, I can confirm:

I did 5000s 25tps tests:
 - Off: 8002 transactions lost (6.3%)
 - On: 158 transactions lost (0.12%).

Although it is still 13 times larger than the 12 (0.01%) lost with my 
every 0.2s CHECKPOINT hack, it is nevertheless much much better than 
before!


The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 
- 150, could have been worse), *BUT* is also much smoother, the tps is 
not going to 0, but stay in 50-100 range before the next spike.


I'm wondering about he order of operations. It seems to me that you sync 
just after giving back a buffer. Maybe it would be better to pipeline it, 
that is something like:


  round 0:
send buffers 0
sleep?

  round N:
sync buffers N-1
send buffers N
sleep?

  final N sync:
sync buffer N

I have not found how to control the checkpoint pacing interval, if there 
is such a thing. With a 200ms lag limit on pgbench, it would be nice if it 
is less than 200ms.


I found this old thread Add basic checkpoint sync spreading by Greg 
Smith and Simons Riggs, dating from 2010: 
http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com 
https://commitfest.postgresql.org/action/patch_view?id=431 which ends up 
returned with feedback.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-27 Thread Andres Freund
On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote:
 
 off:
 
 $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
 200
 number of skipped transactions: 1345 (6.246 %)
 
 on:
 
 $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 
 200
 number of skipped transactions: 1 (0.005 %)
 
 That machine is far from idle right now, so the noise is pretty high.
 
 What is the OS and FS? Could it be XFS?

That's linux v3.17-rc2 + ext4.

 But rather nice initial results.
 
 Indeed, I can confirm:
 
 I did 5000s 25tps tests:
  - Off: 8002 transactions lost (6.3%)
  - On: 158 transactions lost (0.12%).
 
 Although it is still 13 times larger than the 12 (0.01%) lost with my every
 0.2s CHECKPOINT hack, it is nevertheless much much better than before!
 
 The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
 - 150, could have been worse), *BUT* is also much smoother, the tps is not
 going to 0, but stay in 50-100 range before the next spike.

Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout
the performance penalty isn't that big, but it's there. That's why I
think (as mentioned to Heikki nearby) it needs to be combined with
sorting during the checkpoint phase.

 I'm wondering about he order of operations. It seems to me that you sync
 just after giving back a buffer.

Yep. Was just a rather quick patch...

 Maybe it would be better to pipeline it,
 that is something like:
 
   round 0:
 send buffers 0
 sleep?
 
   round N:
 sync buffers N-1
 send buffers N
 sleep?
 
   final N sync:
 sync buffer N

Yes, I think we're going to need to leave a it more room for write
combining and such here. But I think it's going to better to issue
flushes for several buffers together - just not after each write(). To
be really beneficial it needs sorted output though.

 I have not found how to control the checkpoint pacing interval, if there is
 such a thing. With a 200ms lag limit on pgbench, it would be nice if it is
 less than 200ms.

Not sure what you mean.

 I found this old thread Add basic checkpoint sync spreading by Greg Smith
 and Simons Riggs, dating from 2010:
 http://www.postgresql.org/message-id/4ce07548.4030...@2ndquadrant.com
 https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
 returned with feedback.

I didn't really like the unapplied remainder of what was proposed in
there.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Josh,


So I think that you're confusing the roles of bgwriter vs. spread
checkpoint.  What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+).  In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.


I'm using ext4 on debian wheezy with postgresqk 9.4b2.

I agree that the OS may be able to help, but this aspect does not 
currently work for me at all out of the box. The all of RAM is really a 
few thousands 8 kB pages written randomly, a few dozen MB.


Also, if pg needs advanced OS tweaking to handle a small load, ISTM that 
it fails at simplicity:-(


As for checkpoint spreading, raising checkpoint_completion_target to 0.9 
degrades the situation (20% of transactions are more than 200 ms late 
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s 
run). So maybe there is a bug here somewhere.



Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.


Hmmm... This must be balanced with the risk of being offline. Not all 
people are interested in throughput at the price of latency, so there 
could be settings that help latency, even at the price of reducing 
throughput (average tps). After that, it is the administrator choice to 
set pg for higher throughput or lower latency.


Note that writing some least recently used page multiple times does not 
seems to be any issue at all for me under small/medium load, especially as 
the system has nothing else to do: if you have nothing else to do, there 
is no cost in writing a page, even if you may have to write it again some 
time later, and it helps prevent dirty pages accumulation. So it seems to 
me that pg can help, it is not only/merely an OS issue.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Jeff Janes
On Monday, August 25, 2014, Fabien COELHO coe...@cri.ensmp.fr wrote:


 I have not found any mean to force bgwriter to send writes when it can.
 (Well, I have: create a process which sends CHECKPOINT every 0.2
 seconds... it works more or less, but this is not my point:-)


There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.

Cheers,

Jeff


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


[oops, wrong from, resent...]

Hello Jeff,


The culprit I found is bgwriter, which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, pg_stat_bgwriter says:


Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?


Yes, I did.


You don't want your steady state stats polluted by the bulk load.


Sure!


  buffers_checkpoint = 19046
  buffers_clean = 2995


Out of curiosity, what does buffers_backend show?


buffers_backend = 157


In any event, this almost certainly is a red herring.


Possibly. It is pretty easy to reproduce, though.

Whichever of the three ways are being used to write out the buffers, it 
is the checkpointer that is responsible for fsyncing them, and that is 
where your drama is almost certainly occurring. Writing out with one 
path rather than a different isn't going to change things, unless you 
change the fsync.


Well, I agree partially. ISTM that the OS does not need to wait for fsync 
to start writing pages if it is receiving one minute of buffer writes at 
50 writes per second, I would have thought that some scheduler should 
start handling the flow before fsync... So I thought that if bgwriter was 
to write the buffers is would help, but maybe there is a better way.



Also, are you familiar with checkpoint_completion_target, and what is it
set to?


The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Amit,


I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply.


Yep. Actually it is not strictly 0, but a few tps that I rounded to 0.

  progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms
  progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026, lag 74.172 ms
  progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms
  progress: 66.1 s, 4.5 tps, lat 604.790 ms stddev 440.073, lag 2418.128 ms
  progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302 ms
  progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms
  progress: 69.4 s, 3.6 tps, lat 440.335 ms stddev 369.207, lag 4842.371 ms

Transactions are 4.8 seconds behind schedule at this point.

One reason could be that backends might need to write more WAL due 
Full_Page_Writes, another could be contention around buffer 
content_lock. To dig more about the reason, the same tests can be tried 
by making Full_Page_Writes = off and/or synchronous_commit = off to see 
if WAL writes are causing tps to go down.


Given the small flow of updates, I do not think that there should be 
reason to get that big a write contention between WAL  checkpoint.


If tried with full_page_write = off for 500 seconds: same overall 
behavior, 8.5% of transactions are stuck (instead of 10%). However in 
details pg_stat_bgwriter is quite different:


  buffers_checkpoint = 13906
  buffers_clean = 20748
  buffers_backend = 472

That seems to suggest that bgwriter did some stuff for once, but that did 
not change much the result in the end. This would imply that my suggestion 
to make bgwriter write more would not fix the problem alone.


With synchronous_commit = off, the situation is much improved, with only 
0.3% of transactions stuck. Not a surprise. However, I would not recommand 
that as a solution:-)


Currently, the only way I was able to solve the issue while still 
writing to disk is to send CHECKPOINT every 0.2s, as if I had set 
checkpoint_timeout = 0.2s (although this is not currently allowed).



Similarly for checkpoints, use  checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.


I had already tried, and retried after Jeff suggestion. This does not seem 
to mitigate anything, on the contrary.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello again,


I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends CHECKPOINT every 0.2
seconds... it works more or less, but this is not my point:-)


There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.


I recompiled with the variable coldly set to 1000 instead of 12. The 
situation is slightly degraded (15% of transactions were above 200 ms 
late). However it seems that bgwriter did not write much more pages:


  buffers_checkpoint = 26065
  buffers_clean = 5263
  buffers_backend = 367

Or I may have a problem interpreting pg_stat_bgwriter.

It seems that changing this value is not enough to persuade bgwriter to 
write more pages. Or I may have done something wrong, but I do not know 
what.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote:
 As for checkpoint spreading, raising checkpoint_completion_target to 0.9
 degrades the situation (20% of transactions are more than 200 ms late
 instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run).
 So maybe there is a bug here somewhere.

What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?

Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by time and not by xlog.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Andres,


checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).


That's ext4 I guess?


Yes!


Did you check whether xfs yields a, err, more predictable performance?


No. I cannot test that easily without reinstalling the box. I did some 
quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in 
the very same conditions. Maybe I could try again.


[...] Note that it would *not* be a good idea to make the bgwriter write 
out everything, as much as possible - that'd turn sequential write io 
into random write io.


Hmmm. I'm not sure it would be necessary the case, it depends on how 
bgwriter would choose the pages to write? If they are chosen randomly then 
indeed that could be bad. If there is a big sequential write, should not 
the backend do the write directly anyway? ISTM that currently checkpoint 
is mostly random writes anyway, at least with the OLTP write load of 
pgbench. I'm just trying to be able to start them ealier so that they can 
be completed quickly.


So although bgwriter is not the solution, ISTM that pg has no reason to 
wait for minutes before starting to write dirty pages, if it has nothing 
else to do. If the OS does some retention later and cannot spread the 
load, as Josh suggest, this could also be a problem, but currently the OS 
seems not to have much to write (but WAL) till the checkpoint.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
 Did you check whether xfs yields a, err, more predictable performance?
 
 No. I cannot test that easily without reinstalling the box. I did some quick
 tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
 same conditions. Maybe I could try again.

After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.

 [...] Note that it would *not* be a good idea to make the bgwriter write
 out everything, as much as possible - that'd turn sequential write io into
 random write io.
 
 Hmmm. I'm not sure it would be necessary the case, it depends on how
 bgwriter would choose the pages to write? If they are chosen randomly then
 indeed that could be bad.

The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.

 If there is a big sequential write, should not the
 backend do the write directly anyway? ISTM that currently checkpoint is
 mostly random writes anyway, at least with the OLTP write load of pgbench.
 I'm just trying to be able to start them ealier so that they can be
 completed quickly.

If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.

 So although bgwriter is not the solution, ISTM that pg has no reason to wait
 for minutes before starting to write dirty pages, if it has nothing else to
 do.

That precisely *IS* a spread checkpoint.

 If the OS does some retention later and cannot spread the load, as Josh
 suggest, this could also be a problem, but currently the OS seems not to
 have much to write (but WAL) till the checkpoint.

The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO



What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?


They simply are the defaults:

  checkpoint_segments = 3
  checkpoint_timeout = 5min
  wal_buffers = -1

I did some test checkpoint_segments = 1, the problem is just more frequent 
but shorter. I also reduced wal_segsize down to 1MB, which also made it 
even more frequent but much shorter, so the overall result was an 
improvement with 5% to 3% of transactions lost instead of 10-14%, if I 
recall correctly. I have found no solution on this path.



Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by time and not by xlog.


I do 500 seconds tests, so there could be at most 2 timeout triggered 
checkpoints. Given the write load it takes about 2 minutes to fill the 3 
16 MB buffers (8 kb * 50 tps (there is one page modified per transaction) 
* 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are 
consistent with logs (not sure which prooves which, though:-):


  LOG:  received SIGHUP, reloading configuration files
  LOG:  parameter log_checkpoints changed to on
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=51.449 s, sync=4.857 s,
total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.500 s, sync=5.102 s,
total=58.670 s; sync files=8, longest=2.689 s, average=0.637 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s,
total=58.495 s; sync files=8, longest=2.627 s, average=0.563 s
  LOG:  checkpoint starting: xlog
  LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s,
total=59.834 s; sync files=8, longest=3.680 s, average=0.804 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6240 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s,
total=154.566 s; sync files=9, longest=3.788 s, average=0.605 s

Note that my current effective solution is to do as if 
checkpoints_timeout = 0.2s: it works fine if I do my own spreading.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote:
 
 What are the other settings here? checkpoint_segments,
 checkpoint_timeout, wal_buffers?
 
 They simply are the defaults:
 
   checkpoint_segments = 3
   checkpoint_timeout = 5min
   wal_buffers = -1
 
 I did some test checkpoint_segments = 1, the problem is just more frequent
 but shorter. I also reduced wal_segsize down to 1MB, which also made it even
 more frequent but much shorter, so the overall result was an improvement
 with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly.
 I have found no solution on this path.

Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance you likely will have to make checkpoint_timeout
*longer* and increase checkpoint_segments until *all* checkpoints are
started because of time.

There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.

b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.

c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO



Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance


I do not seek high performance per se, I seek lower maximum latency.

I think that the current settings and parameters are designed for high 
throughput, but do not allow to control the latency even with a small 
load.


you likely will have to make checkpoint_timeout *longer* and increase 
checkpoint_segments until *all* checkpoints are started because of 
time.


Well, as I want to test a *small* load in a *reasonable* time, so I did 
not enlarge the number of segments, otherwise it would take ages.


If I put a checkpoint_timeout = 1min and checkpoint_completion_target = 
0.9 so that the checkpoints are triggered by the timeout,


  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
  ...

The result is basically the same (well 18% transactions lost, but the 
result do not seem to be stable one run after the other), only there are 
more checkpoints.


I fail to understand how multiplying both the segments and time would 
solve the latency problem. If I set 30 segments than it takes 20 minutes 
to fill them, and if I put timeout to 15min then I'll have to wait for 15 
minutes to test.



There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.


Why would they be written by the kernel if bgwriter has not sent them??


b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.


Sure, I understand that, but ISTM that this test does not exercise this 
issue, the load is small, the full page writes do not matter much.



c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.


I think that what you suggest would only delay the issue, not solve it.

I'll try to ran a long test.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
 
 Uh. I'm not surprised you're facing utterly horrible performance with
 this. Did you try using a *large* checkpoints_segments setting? To
 achieve high performance
 
 I do not seek high performance per se, I seek lower maximum latency.

So?

 I think that the current settings and parameters are designed for high
 throughput, but do not allow to control the latency even with a small load.

The way're you're setting them is tuned for 'basically no write
activity'.

 you likely will have to make checkpoint_timeout *longer* and increase
 checkpoint_segments until *all* checkpoints are started because of time.
 
 Well, as I want to test a *small* load in a *reasonable* time, so I did not
 enlarge the number of segments, otherwise it would take ages.

Well, that way you're testing something basically meaningless. That's
not helpful either.

 If I put a checkpoint_timeout = 1min and checkpoint_completion_target =
 0.9 so that the checkpoints are triggered by the timeout,
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
 file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
 total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
   ...

 The result is basically the same (well 18% transactions lost, but the result
 do not seem to be stable one run after the other), only there are more
 checkpoints.

With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.

 I fail to understand how multiplying both the segments and time would solve
 the latency problem. If I set 30 segments than it takes 20 minutes to fill
 them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
 test.

a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.

 There's three reasons:
 a) if checkpoint_timeout + completion_target is large and the checkpoint
 isn't executed prematurely, most of the dirty data has been written out
 by the kernel's background flush processes.
 
 Why would they be written by the kernel if bgwriter has not sent them??

I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.

The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.

 c) If checkpoint's are infrequent enough, the penalty of them causing
 problems, especially if not using ext4, plays less of a role overall.
 
 I think that what you suggest would only delay the issue, not solve it.

The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Jeff,


The culprit I found is bgwriter, which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, pg_stat_bgwriter says:


Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?


Yes, I did.


You don't want your steady state stats polluted by the bulk load.


Sure!


  buffers_checkpoint = 19046
  buffers_clean = 2995


Out of curiosity, what does buffers_backend show?


   buffers_backend = 157


In any event, this almost certainly is a red herring.


Possibly! It is pretty easy to reproduce... can anyone try?

Whichever of the three ways are being used to write out the buffers, it 
is the checkpointer that is responsible for fsyncing them, and that is 
where your drama is almost certainly occurring. Writing out with one 
path rather than a different isn't going to change things, unless you 
change the fsync.


Well, ISTM that the OS does not need to wait for fsync to start writing 
pages if it has received one minute of buffer writes at 50 writes per 
second, some scheduler should start handling the flow somewhere... So if 
bgwriter was to write the buffers is would help, but maybe there is a 
better way.



Also, are you familiar with checkpoint_completion_target, and what is it
set to?


The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Amit Kapila
On Tue, Aug 26, 2014 at 12:53 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 Given the small flow of updates, I do not think that there should be
reason to get that big a write contention between WAL  checkpoint.

 If tried with full_page_write = off for 500 seconds: same overall
behavior, 8.5% of transactions are stuck (instead of 10%). However in
details pg_stat_bgwriter is quite different:

   buffers_checkpoint = 13906
   buffers_clean = 20748
   buffers_backend = 472

 That seems to suggest that bgwriter did some stuff for once, but that did
not change much the result in the end. This would imply that my suggestion
 to make bgwriter write more would not fix the problem alone.

I think the reason could be that in most cases bgwriter
passes the sync responsibility to checkpointer rather than
doing by itself which causes IO storm during checkpoint
and another thing is that it will not proceed to even write
the dirty buffer unless the refcounf and usage_count of
buffer is zero.  I see there is some merit in your point which
is to make bgwriter more useful than its current form.
I could see 3 top level points to think about whether improvement
in any of those can improve the current situation:
a. Scanning of buffer pool to find the dirty buffers that can
be flushed.
b. Deciding on what is criteria to flush a buffer
c. Sync of buffers



 With synchronous_commit = off, the situation is much improved, with
only 0.3% of transactions stuck. Not a surprise. However, I would not
 recommand that as a solution:-)

Yeah, actually it was just to test what is the actual problem,
I will also not recommend such a solution, however it gives
us atleast the indication that due to IO in checkpoint, backends
are not even able to flush comparatively small WAL data.
How about if you keep WAL (pg_xlog) on a separate filesystem
may be via creating symlink or if possible with -X option of initdb?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-25 Thread Josh Berkus
On 08/25/2014 01:23 PM, Fabien COELHO wrote:
 
 Hello pgdevs,
 
 I've been playing with pg for some time now to try to reduce the maximum
 latency of simple requests, to have a responsive server under small to
 medium load.
 
 On an old computer with a software RAID5 HDD attached, pgbench simple
 update script run for a some time (scale 100, fillfactor 95)
 
 pgbench -M prepared -N -c 2 -T 500 -P 1 ...
 
 gives 300 tps. However this performance is really +1000 tps for a few
 seconds followed by 16 seconds at about 0 tps for the checkpoint induced
 IO storm. The server is totally unresponsive 75% of the time. That's
 bandwidth optimization for you. Hmmm... why not.

So I think that you're confusing the roles of bgwriter vs. spread
checkpoint.  What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+).  In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.

Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-25 Thread Andres Freund
Hi,

On 2014-08-25 22:23:40 +0200, Fabien COELHO wrote:
 seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
 storm. The server is totally unresponsive 75% of the time. That's bandwidth
 optimization for you. Hmmm... why not.
 
 Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
 max):
 
   pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
 
 The same thing more or less happens in a delayed fashion... You get 50 tps
 for some time, followed by sections of 15 seconds at 0 tps for the
 checkpoint when the segments are full... the server is unresponsive about
 10% of the time (one in ten transaction is late by more than 200 ms).

That's ext4 I guess? Did you check whether xfs yields a, err, more
predictable performance?

 It is not satisfying, pg should be able to handle that load easily.
 
 The culprit I found is bgwriter, which is basically doing nothing to
 prevent the coming checkpoint IO storm, even though there would be ample
 time to write the accumulating dirty pages so that checkpoint would find a
 clean field and pass in a blink.

While I agree that the current bgwriter implementation is far from good,
note that this isn't the bgwriter's job. Its goal is to avoid backends
from having to write out buffers themselves. I.e. that there are clean
victim buffers when shared_buffers  working set.

Note that it would *not* be a good idea to make the bgwriter write out
everything, as much as possible - that'd turn sequential write io into
random write io.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-25 Thread Jeff Janes
On Monday, August 25, 2014, Fabien COELHO coe...@cri.ensmp.fr wrote:



 The culprit I found is bgwriter, which is basically doing nothing to
 prevent the coming checkpoint IO storm, even though there would be ample
 time to write the accumulating dirty pages so that checkpoint would find a
 clean field and pass in a blink. Indeed, at the end of the 500 seconds
 throttled test, pg_stat_bgwriter says:


Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?
 You don't want your steady state stats polluted by the bulk load.



   buffers_checkpoint = 19046
   buffers_clean = 2995


Out of curiosity, what does buffers_backend show?

In any event, this almost certainly is a red herring.  Whichever of the
three ways are being used to write out the buffers, it is the checkpointer
that is responsible for fsyncing them, and that is where your drama is
almost certainly occurring. Writing out with one path rather than a
different isn't going to change things, unless you change the fsync.

Also, are you familiar with checkpoint_completion_target, and what is it
set to?

Cheers,

Jeff


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-25 Thread Amit Kapila
On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello pgdevs,

 I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.

 On an old computer with a software RAID5 HDD attached, pgbench simple
update script run for a some time (scale 100, fillfactor 95)

 pgbench -M prepared -N -c 2 -T 500 -P 1 ...

 gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
storm. The server is totally unresponsive 75% of the time. That's bandwidth
optimization for you. Hmmm... why not.

 Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):

 pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...

 The same thing more or less happens in a delayed fashion... You get 50
tps for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).

I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply.  One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.

To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.

Similarly for checkpoints, use  checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com