Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-07 Thread Andres Freund
On 2016-03-07 09:41:51 -0800, Andres Freund wrote:
> > Due to the difference in amount of RAM, each machine used different scales -
> > the goal is to have small, ~50% RAM, >200% RAM sizes:
> > 
> > 1) Xeon: 100, 400, 6000
> > 2) i5: 50, 200, 3000
> > 
> > The commits actually tested are
> > 
> >cfafd8be  (right before the first patch)
> >7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
> >db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...
> 
> Huh, now I'm a bit confused. These are the commits you tested? Those
> aren't the ones doing sorting and flushing?

To clarify: The reason we'd not expect to see much difference here is
that the above commits really only have any affect above noise if you
use synchronous_commit=off. Without async commit it's just one
additional gettimeofday() call and a few additional branches in the wal
writer every wal_writer_delay.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-07 Thread Andres Freund
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote:
> 1) HP DL380 G5 (old rack server)
> - 2x Xeon E5450, 16GB RAM (8 cores)
> - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
> - RedHat 6
> - shared_buffers = 4GB
> - min_wal_size = 2GB
> - max_wal_size = 6GB
> 
> 2) workstation with i5 CPU
> - 1x i5-2500k, 8GB RAM
> - 6x Intel S3700 100GB (in RAID0 for this benchmark)
> - Gentoo
> - shared_buffers = 2GB
> - min_wal_size = 1GB
> - max_wal_size = 8GB


Thinking about with that hardware I'm not suprised if you're only seing
small benefits. The amount of ram limits the amount of dirty data; and
you have plenty have on-storage buffering in comparison to that.


> Both machines were using the same kernel version 4.4.2 and default io
> scheduler (cfq). The
> 
> The test procedure was quite simple - pgbench with three different scales,
> for each scale three runs, 1h per run (and 30 minutes of warmup before each
> run).
> 
> Due to the difference in amount of RAM, each machine used different scales -
> the goal is to have small, ~50% RAM, >200% RAM sizes:
> 
> 1) Xeon: 100, 400, 6000
> 2) i5: 50, 200, 3000
> 
> The commits actually tested are
> 
>cfafd8be  (right before the first patch)
>7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
>db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...

Huh, now I'm a bit confused. These are the commits you tested? Those
aren't the ones doing sorting and flushing?


> Also, I really wonder what will happen with non-default io schedulers. I
> believe all the testing so far was done with cfq, so what happens on
> machines that use e.g. "deadline" (as many DB machines actually do)?

deadline and noop showed slightly bigger benefits in my testing.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-01 Thread Fabien COELHO


Hello Tomas,

One of the goals of this thread (as I understand it) was to make the overall 
behavior smoother - eliminate sudden drops in transaction rate due to bursts 
of random I/O etc.


One way to look at this is in terms of how much the tps fluctuates, so let's 
see some charts. I've collected per-second tps measurements (using the 
aggregation built into pgbench) but looking at that directly is pretty 
pointless because it's very difficult to compare two noisy lines jumping up 
and down.


So instead let's see CDF of the per-second tps measurements. I.e. we have 
3600 tps measurements, and given a tps value the question is what percentage 
of the measurements is below this value.


   y = Probability(tps <= x)

We prefer higher values, and the ideal behavior would be that we get exactly 
the same tps every second. Thus an ideal CDF line would be a step line. Of 
course, that's rarely the case in practice. But comparing two CDF curves is 
easy - the line more to the right is better, at least for tps measurements, 
where we prefer higher values.


Very nice and interesting graphs!

Alas not easy to interpret for the HDD, as there are better/worse 
variation all along the distribution, the lines cross one another, so how 
it fares overall is unclear.


Maybe a simple indication would be to compute the standard deviation on 
the per second tps? The median maybe interesting as well.


I do have some more data, but those are the most interesting charts. The rest 
usually shows about the same thing (or nothing).


Overall, I'm not quite sure the patches actually achieve the intended goals. 
On the 10k SAS drives I got better performance, but apparently much more 
variable behavior. On SSDs, I get a bit worse results.


Indeed.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Michael Paquier
On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO  wrote:
>> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO
>> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge.
>>
>>
>> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu
>
>
> Interesting! To summarize it, 25% performance degradation from best kernel
> (2.6.32) to worst (3.2.0), that is indeed significant.

As far as I recall, the OS cache eviction is very aggressive in 3.2,
so it would be possible that data from the FS cache that was just read
could be evicted even if it was not used yet. Thie represents a large
difference when the database does not fit in RAM.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hallo Patric,

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify 
IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is 
huge.


https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu


Interesting! To summarize it, 25% performance degradation from best kernel 
(2.6.32) to worst (3.2.0), that is indeed significant.


You might consider upgrading your kernel to 3.13 LTS. It's quite easy 
[...]


There are other stuff running on the hardware that I do not wish to touch, 
so upgrading the particular host is currently not an option, otherwise I 
would have switched to trusty.


Thanks for the pointer.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Patric Bechtel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Fabien,

Fabien COELHO schrieb am 19.02.2016 um 16:04:
> 
>>> [...] Ubuntu 12.04 LTS (precise)
>> 
>> That's with 12.04's standard kernel?
> 
> Yes.

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO 
somehow. The difference
to 3.13 (the latest LTS kernel for 12.04) is huge.

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally:

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

/Patric
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW
Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M
=TxT+
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hello.


Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.


Yes, these many runs show that 32 is basically as good or better than 64.

I'll do some runs with 16/48 to have some more data.

I gather that you didn't play with 
backend_flush_after/bgwriter_flush_after, i.e. you left them at their 
default values? Especially backend_flush_after can have a significant 
positive and negative performance impact.


Indeed, non reported configuration options have their default values. 
There were also minor changes in the default options for logging (prefix, 
checkpoint, ...), but nothing significant, and always the same for all 
runs.



 [...] Ubuntu 12.04 LTS (precise)


That's with 12.04's standard kernel?


Yes.


   checkpoint_flush_after = { none, 0, 32, 64 }


Did you re-initdb between the runs?


Yes, all runs are from scratch (initdb, pgbench -i, some warmup...).


I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.


Yes, I agree that probably the performance changes on long vs short runs 
(andres00c vs andres00b) is due to autovacuum.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Andres Freund
Hi,

On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote:
> Below the results of a lot of tests with pgbench to exercise checkpoints on
> the above version when fetched.

Wow, that's a great test series.


> Overall comments:
>  - sorting & flushing is basically always a winner
>  - benchmarking with short runs on large databases is a bad idea
>the results are very different if a longer run is used
>(see andres00b vs andres00c)

Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.

I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.


>  16 GB 2 cpu 8 cores
>  200 GB RAID1 HDD, ext4 FS
>  Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?



>  postgresql.conf:
>shared_buffers = 1GB
>max_wal_size = 1GB
>checkpoint_timeout = 300s
>checkpoint_completion_target = 0.8
>checkpoint_flush_after = { none, 0, 32, 64 }

Did you re-initdb between the runs?


I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.


> Hmmm, interesting: maintenance_work_mem seems to have some influence on
> performance, although it is not too consistent between settings, probably
> because as the memory is used to its limit the performance is quite
> sensitive to the available memory.

That's probably because of differing behaviour of autovacuum/vacuum,
which sometime will have to do several scans of the tables if there are
too many dead tuples.


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hello Andres,


I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush


Below the results of a lot of tests with pgbench to exercise checkpoints 
on the above version when fetched.


Overall comments:
 - sorting & flushing is basically always a winner
 - benchmarking with short runs on large databases is a bad idea
   the results are very different if a longer run is used
   (see andres00b vs andres00c)

# HOST/SOFT

 16 GB 2 cpu 8 cores
 200 GB RAID1 HDD, ext4 FS
 Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS

 tps: is the "excluding connection" time tps, the higher the better
 1-sec tps: average of measured per-second tps
   note - it should be the same as the previous one, but due to various
  hazards in the trace, especially when things go badly and pg get
  stuck, it may be different. Such hazard also explain why there
  may be some non-integer tps reported for some seconds.
 stddev: standard deviation, the lower the better
 the five figures in bracket give a feel of the distribution:
 - min: minimal per-second tps seen in the trace
 - q1: first quarter per-second tps seen in the trace
 - med: median per-second tps seen in the trace
 - q3: third quarter per-second tps seen in the trace
 - max: maximal per-second tps seen in the trace
 the last percentage dubbed "<=10.0" is percent of seconds where performance
   is below 10 tps: this measures of how unresponsive pg was during the run

## TINY2

 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
   with scale = 10 (~ 200 MB)

 postgresql.conf:
   shared_buffers = 1GB
   max_wal_size = 1GB
   checkpoint_timeout = 300s
   checkpoint_completion_target = 0.8
   checkpoint_flush_after = { none, 0, 32, 64 }

 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%
  1 | 2575.0 / 2575.1 ± 359.3 [  1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1%
  2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%

0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%
  1 | 2596.6 / 2596.9 ± 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0%
  2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%

   32 0 | 2625.5 / 2625.5 ± 250.5 [  1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%
  1 | 2630.2 / 2630.2 ± 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0%
  2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%

   64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%
  1 | 2591.1 / 2591.1 ± 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0%
  2 | 2047.8 / 2046.4 ± 925.8 [  0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

## SMALL

 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
   with scale = 120 (~ 2 GB)

 postgresql.conf:
   shared_buffers = 2GB
   checkpoint_timeout = 300s
   checkpoint_completion_target = 0.8
   checkpoint_flush_after = { none, 0, 32, 64 }

 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

 head 0 | 209.2 / 204.2 ± 516.5 [0.0,   0.0,   4.0,5.0, 2251.0] 82.3%
  1 | 207.4 / 204.2 ± 518.7 [0.0,   0.0,   4.0,5.0, 2245.1] 82.3%
  2 | 217.5 / 211.0 ± 530.3 [0.0,   0.0,   3.0,5.0, 2255.0] 82.0%
  3 | 217.8 / 213.2 ± 531.7 [0.0,   0.0,   4.0,6.0, 2261.9] 81.7%
  4 | 230.7 / 223.9 ± 542.7 [0.0,   0.0,   4.0,7.0, 2282.0] 80.7%

0 0 | 734.8 / 735.5 ± 879.9 [0.0,   1.0,  16.5, 1748.3, 2281.1] 47.0%
  1 | 694.9 / 693.0 ± 849.0 [0.0,   1.0,  29.5, 1545.7, 2428.0] 46.4%
  2 | 735.3 / 735.5 ± 888.4 [0.0,   0.0,  12.0, 1781.2, 2312.1] 47.9%
  3 | 736.0 / 737.5 ± 887.1 [0.0,   1.0,  16.0, 1794.3, 2317.0] 47.5%
  4 | 734.9 / 735.1 ± 885.1 [0.0,   1.0,  15.5, 1781.0, 2297.1] 47.2%

   32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0,  753.0, 2312.1]  0.2%
  1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5, 671.0,  744.0, 2319.0]  0.1%
  2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0,  756.0, 2331.0]  0.2%
  3 | 744.1 / 744.1 ± 414.4 [0.0, 555.5, 685.2,  758.0, 2285.1]  0.1%
  4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0,  759.0, 2308.1]  0.1%

   64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0,  759.0, 2353.0]  0.1%
  1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2, 680.0,  758.2, 2296.0]  0.1%
  2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1,  757.0, 2310.0]  0.1%
  3 | 529.0 / 526.6 ± 410.9 [0.0, 245.0, 444.0,  701.0, 2380.9]  1.5% ??
  4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0,  754.0, 2298.0]  0.1%

Sorting brings * 3.3 tps, flushing significantly reduces tps 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
> 
> When reading the patch 0002 explanations, I had the following comments:
> 
> AFAICS, there are several levels of actions when writing things in pg:
> 
>  0: the thing is written in some internal buffer
> 
>  1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.


>  2: the buffer is actually passed to the OS (write, flush)
> 
>  3: the OS is advised to send the written data to the io subsystem
> (sync_file_range with SYNC_FILE_RANGE_WRITE)
> 
>  4: the OS is required to send the written data to the disk
> (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.


0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits.  The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns.  On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.


> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.

Right.


> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
> 
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>   potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>   flushing every wal_writer_delay ms or wal_writer_flush_after
>   bytes.

I've pushed these after some more polishing, now working on the next
two.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Fabien COELHO


Hello Andres,


0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
 potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
 flushing every wal_writer_delay ms or wal_writer_flush_after
 bytes.


I've looked at these patches, especially the whole bench of explanations 
and comments which is a good source for understanding what is going on in 
the WAL writer, a part of pg I'm not familiar with.


When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:

 0: the thing is written in some internal buffer

 1: the buffer is advised to be passed to the OS (hint bits?)

 2: the buffer is actually passed to the OS (write, flush)

 3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)

 4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

It is not clear when reading the text which level is discussed. In 
particular, I'm not sure that "flush" refers to level 2, which is 
misleading. When reading the description, I'm rather under the impression 
that it is about level 4, but then if actual fsync are performed every 200 
ms then the tps would be very low...


After more considerations, my final understanding is that this behavior 
only occurs with "asynchronous commit", aka a situation when COMMIT does 
not wait for data to be really fsynced, but the fsync is to occur within 
some delay so it will not be too far away, some kind of compromise for 
performance where commits can be lost.


Now all this is somehow alien to me because the whole point of committing 
is having the data to disk, and I would not consider a database to be safe 
if commit does not imply fsync, but I understand that people may have to 
compromise for performance.


Is my understanding right?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-11 Thread Robert Haas
On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund  wrote:
> On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
>> Fabien asked me to post a new version of the checkpoint flushing patch
>> series. While this isn't entirely ready for commit, I think we're
>> getting closer.
>>
>> I don't want to post a full series right now, but my working state is
>> available on
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
>> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
>
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
>
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>   potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>   flushing every wal_writer_delay ms or wal_writer_flush_after
>   bytes.

I previously reviewed 0001 and I think it's fine.  I haven't reviewed
0002 in detail, but I like the concept.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-11 Thread Andres Freund
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
> 
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently.  The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
  potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
  flushing every wal_writer_delay ms or wal_writer_flush_after
  bytes.

Greetings,

Andres Freund
>From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new
 enough.

Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not require the commit record to be flushed, it just
requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.

In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a follup commit to reduce the
flush rate.

Discussion: 20160118163908.gw10...@awork2.anarazel.de
---
 src/backend/utils/time/tqual.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 465933d..503bd1d 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
  * It is only safe to set a transaction-committed hint bit if we know the
- * transaction's commit record has been flushed to disk, or if the table is
- * temporary or unlogged and will be obliterated by a crash anyway.  We
- * cannot change the LSN of the page here because we may hold only a share
- * lock on the buffer, so we can't use the LSN to interlock this; we have to
- * just refrain from setting the hint bit until some future re-examination
- * of the tuple.
+ * transaction's commit record is guaranteed to be flushed to disk before the
+ * buffer, or if the table is temporary or unlogged and will be obliterated by
+ * a crash anyway.  We cannot change the LSN of the page here because we may
+ * hold only a share lock on the buffer, so we can only use the LSN to
+ * interlock this if the buffer's LSN already is newer than the commit LSN;
+ * otherwise we have to just refrain from setting the hint bit until some
+ * future re-examination of the tuple.
  *
  * We can always set hint bits when marking a transaction aborted.  (Some
  * code in heapam.c relies on that!)
@@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 		/* NB: xid must be known committed here! */
 		XLogRecPtr	commitLSN = TransactionIdGetCommitLSN(xid);
 
-		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
-			return;/* not flushed yet, so don't set hint */
+		if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) &&
+			BufferGetLSNAtomic(buffer) < commitLSN)
+		{
+			/* not flushed and no LSN interlock, so don't set hint */
+			return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
-- 
2.7.0.229.g701fa7f

>From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate.

Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-09 Thread Fabien COELHO


I think I would appreciate comments to understand why/how the 
ringbuffer is used, and more comments in general, so it is fine if you 
improve this part.


I'd suggest to leave out the ringbuffer/new bgwriter parts.


Ok, so the patch would only onclude the checkpointer stuff.

I'll look at this part in detail.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-09 Thread Andres Freund
On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO  
wrote:
>
>>> I think I would appreciate comments to understand why/how the 
>>> ringbuffer is used, and more comments in general, so it is fine if
>you 
>>> improve this part.
>>
>> I'd suggest to leave out the ringbuffer/new bgwriter parts.
>
>Ok, so the patch would only onclude the checkpointer stuff.
>
>I'll look at this part in detail.

Yes, that's the more pressing part. I've seen pretty good results with the new 
bgwriter, but it's not really worthwhile until sorting and flushing is in...

Andres 

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Andres Freund
Hi Fabien,

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
> 
> The main changes are that:
> 1) the significant performance regressions I saw are addressed by
>changing the wal writer flushing logic
> 2) The flushing API moved up a couple layers, and now deals with buffer
>tags, rather than the physical files
> 3) Writes from checkpoints, bgwriter and files are flushed, configurable
>by individual GUCs. Without that I still saw the spiked in a lot of 
> circumstances.
> 
> There's also a more experimental reimplementation of bgwriter, but I'm
> not sure it's realistic to polish that up within the constraints of 9.6.

Any comments before I spend more time polishing this? I'm currently
updating docs and comments to actually describe the current state...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Fabien COELHO


Hello Andres,


Any comments before I spend more time polishing this?


I'm running tests on various settings, I'll send a report when it is done.
Up to now the performance seems as good as with the previous version.

I'm currently updating docs and comments to actually describe the 
current state...


I did notice the mismatched documentation.

I think I would appreciate comments to understand why/how the ringbuffer 
is used, and more comments in general, so it is fine if you improve this 
part.


Minor details:

"typedefs.list" should be updated to WritebackContext.

"WritebackContext" is a typedef, "struct" is not needed.


I'll look at the code more deeply probably over next weekend.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Andres Freund
On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote:
> I think I would appreciate comments to understand why/how the ringbuffer is
> used, and more comments in general, so it is fine if you improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts. I think
they'd be committed separately, and probably not in 9.6.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-04 Thread Andres Freund
Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by
   changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
   tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
   by individual GUCs. Without that I still saw the spiked in a lot of 
circumstances.

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Regards,

Andres 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers