Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO



To emphasize potential bad effects without having to build too large a host
and involve too many table spaces, I would suggest to reduce significantly
the "checkpoint_flush_after" setting while running these tests.


Meh, that completely distorts the test.


Yep, I agree.

The point would be to show whether there is a significant impact, or not, 
with less hardware & cost involved in the test.


Now if you can put 16 disks with 16 table spaces with 16 buffers per 
bucket, that is good, fine with me! I'm just trying to point out that you 
could probably get comparable relative results with 4 disks, 4 tables 
spaces and 4 buffers per bucket, so it is an alternative and less 
expensive testing strategy.


This just shows that I usually work on a tight (negligeable?) budget:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO



My impression is that we actually know what we need to know anyway?


Sure, the overall summary is "it is much better with the patch" on this 
large SSD test, which is good news because the patch was really designed 
to help with HDDs.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO



You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
one and 100M tx on the other, so you get 4.25M tx from the first and
5M from the second.


OK


I'm saying that the percentile should be computed on the largest one
(5M), so that you get a curve like the following, with both curve
having the same transaction density on the y axis, so the second one
does not go up to the top, reflecting that in this case less
transactions where processed.


Huh, that seems weird. That's not how percentiles or CDFs work, and I don't 
quite understand what would that tell us.


It would tell us that for a given transaction number (in the 
latency-ordered list) whether its latency is above or below the other run.


I think it would probably show that the latency is always better for the 
patched version by getting rid of the crossing which has no meaning and 
seems to suggest, wrongly, that in some case the other is better than the 
first, but as the y axis of both curves are not in the same unit (not same 
transaction density) this is just an illusion implied by a misplaced 
normalization.


So I'm basically saying that the y axis should be just the transaction 
number, not a percent.


Anyway, these are just details, your figures show that the patch is a very 
significant win on SSDs, all is well!


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Andres Freund
On 2016-03-22 10:52:55 +0100, Fabien COELHO wrote:
> To emphasize potential bad effects without having to build too large a host
> and involve too many table spaces, I would suggest to reduce significantly
> the "checkpoint_flush_after" setting while running these tests.

Meh, that completely distorts the test.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO



WRT tablespaces: What I'm planning to do, unless somebody has a better
proposal, is to basically rent two big amazon instances, and run pgbench
in parallel over N tablespaces. Once with local SSD and once with local
HDD storage.


Ok.

Not sure how to control that table spaces are actually on distinct 
dedicated disks with VMs, but this is the idea.


To emphasize potential bad effects without having to build too large a 
host and involve too many table spaces, I would suggest to reduce 
significantly the "checkpoint_flush_after" setting while running these 
tests.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Andres Freund
On 2016-03-22 10:48:20 +0100, Tomas Vondra wrote:
> Hi,
> 
> On 03/22/2016 10:44 AM, Fabien COELHO wrote:
> >
> >
> 1) regular-latency.png
> >>>
> >>>I'm wondering whether it would be clearer if the percentiles
> >>>where relative to the largest sample, not to itself, so that the
> >>>figures from the largest one would still be between 0 and 1, but
> >>>the other (unpatched) one would go between 0 and 0.85, that is
> >>>would be cut short proportionnaly to the actual performance.
> >>
> >>I'm not sure what you mean by 'relative to largest sample'?
> >
> >You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
> >one and 100M tx on the other, so you get 4.25M tx from the first and
> >5M from the second.
> 
> OK
> 
> >I'm saying that the percentile should be computed on the largest one
> >(5M), so that you get a curve like the following, with both curve
> >having the same transaction density on the y axis, so the second one
> >does not go up to the top, reflecting that in this case less
> >transactions where processed.
> 
> Huh, that seems weird. That's not how percentiles or CDFs work, and I don't
> quite understand what would that tell us.

My impression is that we actually know what we need to know anyway?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Tomas Vondra

Hi,

On 03/22/2016 10:44 AM, Fabien COELHO wrote:




1) regular-latency.png


I'm wondering whether it would be clearer if the percentiles
where relative to the largest sample, not to itself, so that the
figures from the largest one would still be between 0 and 1, but
the other (unpatched) one would go between 0 and 0.85, that is
would be cut short proportionnaly to the actual performance.


I'm not sure what you mean by 'relative to largest sample'?


You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
one and 100M tx on the other, so you get 4.25M tx from the first and
5M from the second.


OK


I'm saying that the percentile should be computed on the largest one
(5M), so that you get a curve like the following, with both curve
having the same transaction density on the y axis, so the second one
does not go up to the top, reflecting that in this case less
transactions where processed.


Huh, that seems weird. That's not how percentiles or CDFs work, and I 
don't quite understand what would that tell us.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO




1) regular-latency.png


I'm wondering whether it would be clearer if the percentiles where
relative to the largest sample, not to itself, so that the figures
from the largest one would still be between 0 and 1, but the other
(unpatched) one would go between 0 and 0.85, that is would be cut
short proportionnaly to the actual performance.


I'm not sure what you mean by 'relative to largest sample'?


You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one 
and 100M tx on the other, so you get 4.25M tx from the first and 5M from 
the second.


I'm saying that the percentile should be computed on the largest one (5M), 
so that you get a curve like the following, with both curve having the 
same transaction density on the y axis, so the second one does not go up 
to the top, reflecting that in this case less transactions where 
processed.


  A
  +- # up to 100%
  |   /  ___ # cut short
  |   | /
  |   | |
  | _/ /
  |/__/
  +->

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Andres Freund
Hi,

On 2016-03-21 18:46:58 +0100, Tomas Vondra wrote:
> I've repeated the tests, but this time logged details for 5% of the
> transaction (instead of aggregating the data for each second). I've also
> made the tests shorter - just 12 hours instead of 24, to reduce the time
> needed to complete the benchmark.
> 
> Overall, this means ~300M transactions in total for the un-throttled case,
> so sample with ~15M transactions available when computing the following
> charts.
> 
> I've used the same commits as during the previous testing, i.e. a298a1e0
> (before patches) and 23a27b03 (with patches).
> 
> One interesting difference is that while the "patched" version resulted in
> slightly better performance (8122 vs. 8000 tps), the "unpatched" version got
> considerably slower (6790 vs. 7725 tps) - that's ~13% difference, so not
> negligible. Not sure what's the cause - the configuration was exactly the
> same, there's nothing in the log and the machine was dedicated to the
> testing. The only explanation I have is that the unpatched code is a bit
> more unstable when it comes to this type of stress testing.
> 
> There results (including scripts for generating the charts) are here:
> 
> https://github.com/tvondra/flushing-benchmark-2
> 
> Attached are three charts - again, those are using CDF to illustrate the
> distributions and compare them easily:
> 
> 1) regular-latency.png
> 
> The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter
> transactions, the old code is slightly faster (i.e. apparently there's some
> per-transaction overhead). For higher latencies though, the patched code is
> clearly winning - there are far fewer transactions over 6ms, which makes a
> huge difference. (Notice the x-axis is actually log-scale, so the tail on
> the old code is actually much longer than it might appear.)
> 
> 2) throttled-latency.png
> 
> In the throttled case (i.e. when the system is not 100% utilized, so it's
> more representative of actual production use), the difference is quite
> clearly in favor of the new code.
> 
> 3) throttled-schedule-lag.png
> 
> Mostly just an alternative view on the previous chart, showing how much
> later the transactions were scheduled. Again, the new code is winning.

Thanks for running these tests!

I think this shows that we're in a good shape, and that the commits
succeeded in what they were attempting. Very glad to hear that.


WRT tablespaces: What I'm planning to do, unless somebody has a better
proposal, is to basically rent two big amazon instances, and run pgbench
in parallel over N tablespaces. Once with local SSD and once with local
HDD storage.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Tomas Vondra

Hi,

On 03/22/2016 07:35 AM, Fabien COELHO wrote:


Hello Tomas,

Thanks again for these interesting benches.


Overall, this means ~300M transactions in total for the un-throttled
case, so sample with ~15M transactions available when computing the
following charts.


Still a very sizable run!


There results (including scripts for generating the charts) are here:

   https://github.com/tvondra/flushing-benchmark-2


This repository seems empty.


Strange. Apparently I forgot to push, or maybe it did not complete 
before I closed the terminal. Anyway, pushing now (it'll take a bit more 
time to complete).





1) regular-latency.png


I'm wondering whether it would be clearer if the percentiles where
relative to the largest sample, not to itself, so that the figures
from the largest one would still be between 0 and 1, but the other
(unpatched) one would go between 0 and 0.85, that is would be cut
short proportionnaly to the actual performance.



I'm not sure what you mean by 'relative to largest sample'?


The two curves intersect at ~4ms, where both CDF reach ~85%. For
the shorter transactions, the old code is slightly faster (i.e.
apparently there's some per-transaction overhead).


I'm not sure how meaningfull is the crossing, because both curves do
not reflect the same performance. I think that they may not cross at
all if the normalization is with the same reference, i.e. the better
run.


Well, I think the curves illustrate exactly the performance difference, 
because with the old code the percentiles after p=0.85 get much higher. 
Which is the point of the crossing, although I agree the exact point 
does not have a particular meaning.



2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized,
so it's more representative of actual production use), the
difference is quite clearly in favor of the new code.


Indeed, it is a no brainer.


Yep.




3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how
much later the transactions were scheduled. Again, the new code is
winning.


No brainer again. I infer from this figure that with the initial
version 60% of transactions have trouble being processed on time,
while this is maybe about 35% with the new version.


Yep.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-22 Thread Fabien COELHO


Hello Tomas,

Thanks again for these interesting benches.

Overall, this means ~300M transactions in total for the un-throttled case, so 
sample with ~15M transactions available when computing the following charts.


Still a very sizable run!


There results (including scripts for generating the charts) are here:

   https://github.com/tvondra/flushing-benchmark-2


This repository seems empty.


1) regular-latency.png


I'm wondering whether it would be clearer if the percentiles where 
relative to the largest sample, not to itself, so that the figures from 
the largest one would still be between 0 and 1, but the other (unpatched) 
one would go between 0 and 0.85, that is would be cut short proportionnaly 
to the actual performance.


The two curves intersect at ~4ms, where both CDF reach ~85%. For the 
shorter transactions, the old code is slightly faster (i.e. apparently 
there's some per-transaction overhead).


I'm not sure how meaningfull is the crossing, because both curves do not 
reflect the same performance. I think that they may not cross at all if 
the normalization is with the same reference, i.e. the better run.



2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so it's 
more representative of actual production use), the difference is quite 
clearly in favor of the new code.


Indeed, it is a no brainer.


3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much later 
the transactions were scheduled. Again, the new code is winning.


No brainer again. I infer from this figure that with the initial version 
60% of transactions have trouble being processed on time, while this is 
maybe about 35% with the new version.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-21 Thread Tomas Vondra

Hi,

I've repeated the tests, but this time logged details for 5% of the 
transaction (instead of aggregating the data for each second). I've also 
made the tests shorter - just 12 hours instead of 24, to reduce the time 
needed to complete the benchmark.


Overall, this means ~300M transactions in total for the un-throttled 
case, so sample with ~15M transactions available when computing the 
following charts.


I've used the same commits as during the previous testing, i.e. a298a1e0 
(before patches) and 23a27b03 (with patches).


One interesting difference is that while the "patched" version resulted 
in slightly better performance (8122 vs. 8000 tps), the "unpatched" 
version got considerably slower (6790 vs. 7725 tps) - that's ~13% 
difference, so not negligible. Not sure what's the cause - the 
configuration was exactly the same, there's nothing in the log and the 
machine was dedicated to the testing. The only explanation I have is 
that the unpatched code is a bit more unstable when it comes to this 
type of stress testing.


There results (including scripts for generating the charts) are here:

https://github.com/tvondra/flushing-benchmark-2

Attached are three charts - again, those are using CDF to illustrate the 
distributions and compare them easily:


1) regular-latency.png

The two curves intersect at ~4ms, where both CDF reach ~85%. For the 
shorter transactions, the old code is slightly faster (i.e. apparently 
there's some per-transaction overhead). For higher latencies though, the 
patched code is clearly winning - there are far fewer transactions over 
6ms, which makes a huge difference. (Notice the x-axis is actually 
log-scale, so the tail on the old code is actually much longer than it 
might appear.)


2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so 
it's more representative of actual production use), the difference is 
quite clearly in favor of the new code.


3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much 
later the transactions were scheduled. Again, the new code is winning.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Fabien COELHO


Hello Tomas,

Thanks for these great measures.


* 4 x CPU E5-4620 (2.2GHz)


4*8 = 32 cores / 64 threads.


* 256GB of RAM


Wow!


* 24x SSD on LSI 2208 controller (with 1GB BBWC)


Wow! RAID configuration ? The patch is designed to fix very big issues on 
HDD, but it is good to see that the impact is good on SSD as well.


Is it possible to run tests with distinct table spaces on those many 
disks?



* shared_buffers=64GB


1/4 of the available memory.


The pgbench was scale 6, so ~750GB of data on disk,


*3 available memory, mostly on disk.


or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run production 
databases 100% saturated, so it'd be sad to improve the 100% saturated case 
and hurt the common case by increasing latency.


Sure.


The machine does ~8000 tps, so 5000 tps is ~60% of that.


Ok.

I would have suggested using the --latency-limit option to filter out very 
slow queries, otherwise if the system is stuck it may catch up later, but 
then this is not representative of "sustainable" performance.


When pgbench is running under a target rate, in both runs the transaction 
distribution is expected to be the same, around 5000 tps, and the green 
run looks pretty ok with respect to that. The magenta one shows that about 
25% of the time, things are not good at all, and the higher figures just 
show the catching up, which is not really interesting if you asked for a 
web page and it is finally delivered 1 minutes later.



* regular-tps.png (per-second TPS) [...]


Great curves!

consistent. Originally there was ~10% of samples with ~2000 tps, but with the 
flushing you'd have to go to ~4600 tps. It's actually pretty difficult to 
determine this from the chart, because the curve got so steep and I had to 
check the data used to generate the charts.


Similarly for the upper end, but I assume that's a consequence of the 
throttling not having to compensate for the "slow" seconds anymore.


Yep, but they should be filtered out, "sorry, too late", so that would 
count as unresponsisveness, at least for a large class of applications.


Thanks a lot for there interesting tests!

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Tomas Vondra

Hi,

On 03/17/2016 10:14 PM, Fabien COELHO wrote:



...

I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.


Maybe. But that'd only increase the stress on the system, possibly
causing more issues, no? And the magenta line is the old code, thus it
would only increase the improvement of the new code.


Yes and no. I agree that it stresses the system a little more, but
the fact that you have 5000 tps in the end does not show that you can
really sustain 5000 tps with reasonnable latency. I find this later
information more interesting than knowing that you can get 5000 tps
on average, thanks to some catching up. Moreover the non throttled
runs already shown that the system could do 8000 tps, so the
bandwidth is already  there.


Sure, but thanks to the tps charts we *do know* that for vast majority 
of the intervals (each second) the number of completed transactions is 
very close to 5000. And that wouldn't be possible if large part of the 
latencies were close to the maximums.


With 5000 tps and 32 clients, that means the average latency should be 
less than 6ms, otherwise the clients couldn't make ~160 tps each. But we 
do see that the maximum latency for most intervals is way higher. Only 
~10% of the intervals have max latency below 10ms, for example.





Notice the max latency is in microseconds (as logged by pgbench),
so according to the "max latency" charts the latencies are below
10 seconds (old) and 1 second (new) about 99% of the time.


AFAICS, the max latency is aggregated by second, but then it does
not say much about the distribution of individuals latencies in the
interval, that is whether they were all close to the max or not,
Having the same chart with median or average might help. Also, with
the stddev chart, the percent do not correspond with the latency one,
so it may be that the latency is high but the stddev is low, i.e. all
transactions are equally bad on the interval, or not.

>

So I must admit that I'm not clear at all how to interpret the max
latency & stddev charts you provided.


You're right those charts are not describing distributions of the 
latencies but those aggregated metrics. And it's not particularly simple 
to deduce information about the source statistics, for example because 
all the intervals have the same "weight" although the number of 
transactions that completed in each interval may be different.


But I do think it's a very useful tool when it comes to measuring the 
consistency of behavior over time, assuming you're asking questions 
about the intervals and not the original transactions.


For example, had there been intervals with vastly different transaction 
rates, we'd see that on the tps charts (i.e. the chart would be much 
more gradual or wobbly, just like the "unpatched" one). Or if there were 
intervals with much higher variance of latencies, we'd see that on the 
STDDEV chart.


I'll consider repeating the benchmark and logging some reasonable sample 
of transactions - for the 24h run the unthrottled benchmark did ~670M 
transactions. Assuming ~30B per line, that's ~20GB, so 5% sample should 
be ~1GB of data, which I think is enough.


But of course, that's useful for answering questions about distribution 
of the individual latencies in global, not about consistency over time.





So I don't think this would make any measurable difference in practice.


I think that it may show that 25% of the time the system could not
match the target tps, even if it can handle much more on average, so
the tps achieved when discarding late transactions would be under
4000 tps.


You mean the 'throttled-tps' chart? Yes, that one shows that without the 
patches, there's a lot of intervals where the tps was much lower - 
presumably due to a lot of slow transactions.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Fabien COELHO


Hello Tomas,

But I do think it's a very useful tool when it comes to measuring the 
consistency of behavior over time, assuming you're asking questions 
about the intervals and not the original transactions.


For a throttled run, I think it is better to check whether or not the 
system could handle the load "as expected", i.e. with reasonnable latency, 
so somehow I'm interested in the "original transactions" as scheduled by 
the client, and whether they were processed efficiently, but then it must 
be aggregated by interval to get some statistics.


For example, had there been intervals with vastly different transaction 
rates, we'd see that on the tps charts (i.e. the chart would be much more 
gradual or wobbly, just like the "unpatched" one). Or if there were intervals 
with much higher variance of latencies, we'd see that on the STDDEV chart.


On HDDs what happens is that transactions are "blocked/freezed", the tps 
is very low, the latency very high, but then with few tx (even 1 or 0 at 
time) and all latencies very bad but nevertheless close one to the other, 
in a bad way, the resulting stddev may be quite small anyway.


I'll consider repeating the benchmark and logging some reasonable sample of 
transactions


Beware that this measure is skewed, because on HDDs when the system is 
stuck, it is stuck on very few transactions which are waiting, but they
would seldom show on statistics are there are very few of them. That is 
why I'm interested in those that could not make it, hence my interest in 
--latency-limit option which just say that.



So I don't think this would make any measurable difference in practice.


I think that it may show that 25% of the time the system could not
match the target tps, even if it can handle much more on average, so
the tps achieved when discarding late transactions would be under
4000 tps.


You mean the 'throttled-tps' chart?


Yes.

Yes, that one shows that without the patches, there's a lot of intervals 
where the tps was much lower - presumably due to a lot of slow 
transactions.


Yep. That is what is measured with the latency limit option, by counting 
the dropped transactions that where not processed in a timely maner.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Fabien COELHO



Is it possible to run tests with distinct table spaces on those many disks?


Nope, that'd require reconfiguring the system (and then back), and I don't 
have access to that system (just SSH).


Ok.


Also, I don't quite see what would that tell us?


Currently the flushing context is shared between table space, but I think 
that it should be per table space. My tests did not manage to convince 
Andres, so getting some more figures would be great. That will be another 
time!



I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.


Maybe. But that'd only increase the stress on the system, possibly causing 
more issues, no? And the magenta line is the old code, thus it would only 
increase the improvement of the new code.


Yes and no. I agree that it stresses the system a little more, but the 
fact that you have 5000 tps in the end does not show that you can really 
sustain 5000 tps with reasonnable latency. I find this later information 
more interesting than knowing that you can get 5000 tps on average, 
thanks to some catching up. Moreover the non throttled runs already shown 
that the system could do 8000 tps, so the bandwidth is already there.


Notice the max latency is in microseconds (as logged by pgbench), so 
according to the "max latency" charts the latencies are below 10 seconds 
(old) and 1 second (new) about 99% of the time.


AFAICS, the max latency is aggregated by second, but then it does not say 
much about the distribution of individuals latencies in the interval, that 
is whether they were all close to the max or not, Having the same chart 
with median or average might help. Also, with the stddev chart, the 
percent do not correspond with the latency one, so it may be that the 
latency is high but the stddev is low, i.e. all transactions are equally 
bad on the interval, or not.


So I must admit that I'm not clear at all how to interpret the max latency 
& stddev charts you provided.



So I don't think this would make any measurable difference in practice.


I think that it may show that 25% of the time the system could not match 
the target tps, even if it can handle much more on average, so the tps 
achieved when discarding late transactions would be under 4000 tps.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Tomas Vondra

Hi,

On 03/11/2016 02:34 AM, Andres Freund wrote:

Hi,

I just pushed the two major remaining patches in this thread. Let's see
what the buildfarm has to say; I'd not be surprised if there's some
lingering portability problem in the flushing code.

There's one remaining issue we definitely want to resolve before the
next release:  Right now we always use one writeback context across all
tablespaces in a checkpoint, but Fabien's testing shows that that's
likely to hurt in a number of cases. I've some data suggesting the
contrary in others.

Things that'd be good:
* Some benchmarking. Right now controlled flushing is enabled by default
  on linux, but disabled by default on other operating systems. Somebody
  running benchmarks on e.g. freebsd or OSX might be good.


So I've done some benchmarks of this, and I think the results are very 
good. I've compared a298a1e06 and 23a27b039d (so the two patches 
mentioned here are in-between those two), and I've done a few long 
pgbench runs - 24h each:


1) master (a298a1e06), regular pgbench
2) master (a298a1e06), throttled to 5000 tps
3) patched (23a27b039), regular pgbench
3) patched (23a27b039), throttled to 5000 tps

All of this was done on a quite large machine:

* 4 x CPU E5-4620 (2.2GHz)
* 256GB of RAM
* 24x SSD on LSI 2208 controller (with 1GB BBWC)

The page cache was using the default config, although in production 
setups we'd probably lower the limits (particularly the background 
threshold):


* vm.dirty_background_ratio = 10
* vm.dirty_ratio = 20

The main PostgreSQL configuration changes are these:

* shared_buffers=64GB
* bgwriter_delay = 10ms
* bgwriter_lru_maxpages = 1000
* checkpoint_timeout = 30min
* max_wal_size = 64GB
* min_wal_size = 32GB

I haven't touched the flush_after values, so those are at default. Full 
config in the github repo, along with all the results and scripts used 
to generate the charts etc:


https://github.com/tvondra/flushing-benchmark

I'd like to see some benchmarks on machines with regular rotational 
storage, but I don't have a suitable system at hand.


The pgbench was scale 6, so ~750GB of data on disk, and was executed 
either like this (the "default"):


pgbench -c 32 -j 8 -T 86400 -l --aggregate-interval=1 pgbench

or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run 
production databases 100% saturated, so it'd be sad to improve the 100% 
saturated case and hurt the common case by increasing latency. The 
machine does ~8000 tps, so 5000 tps is ~60% of that.


It's difficult to judge based on a single run (although a long one), but 
it seems the throughput increased a tiny bit from 7725 to 8000. That's 
~4% difference, but I guess more runs would be needed to see if this is 
noise or actual improvement.


Now, let's see at the per-second results, i.e. how much the performance 
fluctuates over time (due to checkpoints etc.). That's where the 
aggregated log (per-second) gets useful, as it's used for generating the 
various charts for tps, max latency, stddev of latency etc.


All those charts are CDF, i.e. cumulative distribution function, i.e. 
they plot a metric on x-axis, and probability P(X <= x) on y-axis.


In general the steeper the curve the better (more consistent behavior 
over time). It also allows comparing two curves - e.g. for tps metric 
the "lower" curve is better, as it means higher values are more likely.


default (non-throttled) pgbench runs


Let's see the regular (non-throttled) pgbench runs first:

* regular-tps.png (per-second TPS)

Clearly, the patched version is much more consistent - firstly it's much 
less "wobbly" and it's considerably steeper, which means the per-second 
throughput fluctuates much less. That's good.


We already know the total throughput is almost exactly the same (just 4% 
difference), this also shows that the medians are almost exactly the 
same (the curves intersect at pretty much exactly 50%).


* regular-max-lat.png (per-second maximum latency)
* regular-stddev-lat.png (per-second latency stddev)

Apparently the additional processing slightly increases both the maximum 
latency and standard deviation, as the green line (patched) is 
consistently below the pink one (unpatched).


Notice however that x-axis is using log scale, so the differences are 
actually very small, and we also know that the total throughput slightly 
increased. So while those two metrics slightly increased, the overall 
impact on latency has to be positive.


throttled pgbench runs
--

* throttled-tps.png (per-second TPS)

OK, this is great - the chart shows that the performance is way more 
consistent. Originally there was ~10% of samples with ~2000 tps, but 
with the flushing you'd have to go to ~4600 tps. It's actually pretty 
difficult to determine this from the chart, because the curve got so 
steep and I had to 

Re: [HACKERS] checkpointer continuous flushing

2016-03-19 Thread Tomas Vondra

Hi,

On 03/17/2016 06:36 PM, Fabien COELHO wrote:


Hello Tomas,

Thanks for these great measures.


* 4 x CPU E5-4620 (2.2GHz)


4*8 = 32 cores / 64 threads.


Yep. I only used 32 clients though, to keep some of the CPU available 
for the rest of the system (also, HT does not really double the number 
of cores).





* 256GB of RAM


Wow!


* 24x SSD on LSI 2208 controller (with 1GB BBWC)


Wow! RAID configuration ? The patch is designed to fix very big issues
on HDD, but it is good to see that the impact is good on SSD as well.


Yep, RAID-10. I agree that doing the test on a HDD-based system would be 
useful, however (a) I don't have a comparable system at hand at the 
moment, and (b) I was a bit worried that it'll hurt performance on SSDs, 
but thankfully that's not the case.


I will do the test on a much smaller system with HDDs in a few days.



Is it possible to run tests with distinct table spaces on those many disks?


Nope, that'd require reconfiguring the system (and then back), and I 
don't have access to that system (just SSH). Also, I don't quite see 
what would that tell us?



* shared_buffers=64GB


1/4 of the available memory.


The pgbench was scale 6, so ~750GB of data on disk,


*3 available memory, mostly on disk.


or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run
production databases 100% saturated, so it'd be sad to improve the
100% saturated case and hurt the common case by increasing latency.


Sure.


The machine does ~8000 tps, so 5000 tps is ~60% of that.


Ok.

I would have suggested using the --latency-limit option to filter out
very slow queries, otherwise if the system is stuck it may catch up
later, but then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the
transaction distribution is expected to be the same, around 5000 tps,
and the green run looks pretty ok with respect to that. The magenta one
shows that about 25% of the time, things are not good at all, and the
higher figures just show the catching up, which is not really
interesting if you asked for a web page and it is finally delivered 1
minutes later.


Maybe. But that'd only increase the stress on the system, possibly 
causing more issues, no? And the magenta line is the old code, thus it 
would only increase the improvement of the new code.


Notice the max latency is in microseconds (as logged by pgbench), so 
according to the "max latency" charts the latencies are below 10 seconds 
(old) and 1 second (new) about 99% of the time. So I don't think this 
would make any measurable difference in practice.



regards


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-13 Thread Jim Nasby

On 3/13/16 6:30 PM, Peter Geoghegan wrote:

On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes  wrote:

Would the wiki be a good place for such tips?  Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.


That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.


I think there's ways to significantly restrict who can edit a page, so 
this could probably still be done via the wiki. IMO we should also be 
encouraging users to test various tips and provide feedback, so maybe a 
wiki page with a big fat request at the top asking users to submit any 
feedback about the page to -performance.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-13 Thread Peter Geoghegan
On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes  wrote:
> Would the wiki be a good place for such tips?  Not as formal as the
> documentation, and more centralized (and editable) than a collection
> of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-12 Thread Jeff Janes
On Thu, Mar 10, 2016 at 11:25 PM, Peter Geoghegan  wrote:
> On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO  wrote:
>> I can only concur!
>>
>> The "Performance Tips" chapter (II.14) is more user/query oriented. The
>> "Server Administration" bool (III) does not discuss this much.
>
> That's definitely one area in which the docs are lacking -- I've heard
> several complaints about this myself. I think we've been hesitant to
> do more in part because the docs must always be categorically correct,
> and must not use weasel words. I think it's hard to talk about
> performance while maintaining the general tone of the documentation. I
> don't know what can be done about that.

Would the wiki be a good place for such tips?  Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Peter Geoghegan
On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO  wrote:
> I can only concur!
>
> The "Performance Tips" chapter (II.14) is more user/query oriented. The
> "Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-10 Thread Fabien COELHO



I just pushed the two major remaining patches in this thread.


Hurray! Nine months the this baby out:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO



As you wish. I thought that understanding the underlying performance model
with sequential writes written in chunks is important for the admin, and as
this guc would have an impact on performance it should be hinted about,
including the limits of its effect where large bases will converge to random
io performance. But maybe that is not the right place.


I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here.  I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.


I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The 
"Server Administration" bool (III) does not discuss this much.


There is a wiki about performance tuning, but it is not integrated into 
the documentation. It could be a first documentation source.


Also the README in some development directories are very interesting, 
although they contains too much details about the implementation.


There has been a lot of presentations over the years, and blog posts.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-03-10 Thread Andres Freund
Hi,

I just pushed the two major remaining patches in this thread. Let's see
what the buildfarm has to say; I'd not be surprised if there's some
lingering portability problem in the flushing code.

There's one remaining issue we definitely want to resolve before the
next release:  Right now we always use one writeback context across all
tablespaces in a checkpoint, but Fabien's testing shows that that's
likely to hurt in a number of cases. I've some data suggesting the
contrary in others.

Things that'd be good:
* Some benchmarking. Right now controlled flushing is enabled by default
  on linux, but disabled by default on other operating systems. Somebody
  running benchmarks on e.g. freebsd or OSX might be good.
* If somebody has the energy to provide a windows implemenation for
  flush control, that might be worthwhile. There's several places that
  could benefit from that.
* The default values are basically based on benchmarking by me and Fabien.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote:
> As you wish. I thought that understanding the underlying performance model
> with sequential writes written in chunks is important for the admin, and as
> this guc would have an impact on performance it should be hinted about,
> including the limits of its effect where large bases will converge to random
> io performance. But maybe that is not the right place.

I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here.  I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


[...]


If the default is in pages, maybe you could state it and afterwards
translate it in size.


Hm, I think that's more complicated for users than it's worth.


As you wish. I liked the number of pages you used initially because it 
really gives a hint of how much random IOs are avoided when they are 
contiguous, and I do not have the same just intuition with sizes. Also it 
is related to the io queue length manage by the OS.



The text could say something about sequential writes performance because
pages are sorted.., but that it is lost for large bases and/or short
checkpoints ?


I think that's an implementation detail.


As you wish. I thought that understanding the underlying performance model 
with sequential writes written in chunks is important for the admin, and 
as this guc would have an impact on performance it should be hinted about, 
including the limits of its effect where large bases will converge to 
random io performance. But maybe that is not the right place.


--
Fabien


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


Hello Andres,


I'm not sure I've seen these performance... If you have hard evidence,
please feel free to share it.


Man, are you intentionally trying to be hard to work with?


Sorry, I do not understand this remark.

You were refering to some latency measures in your answer, and I was just 
stating that I was interested in seeing these figures which were used to 
justify your choice to keep a shared writeback context.


I did not intend this wish to be an issue, I was expressing an interest.


To quote the email you responded to:

My current plan is to commit this with the current behaviour (as in 
this week[end]), and then do some actual benchmarking on this specific 
part. It's imo a relatively minor detail.


Good.

From the evidence in the thread, I would have given the per tablespace 
context the preference, but this is just a personal opinion and I agree 
that it can work the other way around.


I look forward to see these benchmarks later on, when you have them.

So all is well, and hopefully will be even better later on.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote:
> 
> >   
> >Whenever more than bgwriter_flush_after bytes have
> >been written by the bgwriter, attempt to force the OS to issue these
> >writes to the underlying storage.  Doing so will limit the amount of
> >dirty data in the kernel's page cache, reducing the likelihood of
> >stalls when an fsync is issued at the end of a checkpoint, or when
> >the OS writes data back  in larger batches in the background.  Often
> >that will result in greatly reduced transaction latency, but there
> >also are some cases, especially with workloads that are bigger than
> >, but smaller than the OS's page
> >cache, where performance might degrade.  This setting may have no
> >effect on some platforms.  0 disables controlled
> >writeback. The default is 256Kb on Linux, 0
> >otherwise. This parameter can only be set in the
> >postgresql.conf file or on the server command line.
> >   
> >
> >(plus adjustments for the other gucs)

> What about the maximum value?

Added.

  
   bgwriter_flush_after (int)
   
bgwriter_flush_after configuration 
parameter
   
   
   

 Whenever more than bgwriter_flush_after bytes have
 been written by the bgwriter, attempt to force the OS to issue these
 writes to the underlying storage.  Doing so will limit the amount of
 dirty data in the kernel's page cache, reducing the likelihood of
 stalls when an fsync is issued at the end of a checkpoint, or when
 the OS writes data back in larger batches in the background.  Often
 that will result in greatly reduced transaction latency, but there
 also are some cases, especially with workloads that are bigger than
 , but smaller than the OS's page
 cache, where performance might degrade.  This setting may have no
 effect on some platforms.  The valid range is between
 0, which disables controlled writeback, and
 2MB.  The default is 256Kb on Linux,
 0 elsewhere.  (Non-default values of
 BLCKSZ change the default and maximum.)
 This parameter can only be set in the postgresql.conf
 file or on the server command line.

   
  
 


> If the default is in pages, maybe you could state it and afterwards
> translate it in size.

Hm, I think that's more complicated for users than it's worth.


> The text could say something about sequential writes performance because
> pages are sorted.., but that it is lost for large bases and/or short
> checkpoints ?

I think that's an implementation detail.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO



   
Whenever more than bgwriter_flush_after bytes have
been written by the bgwriter, attempt to force the OS to issue these
writes to the underlying storage.  Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync is issued at the end of a checkpoint, or when
the OS writes data back  in larger batches in the background.  Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
, but smaller than the OS's page
cache, where performance might degrade.  This setting may have no
effect on some platforms.  0 disables controlled
writeback. The default is 256Kb on Linux, 0
otherwise. This parameter can only be set in the
postgresql.conf file or on the server command line.
   

(plus adjustments for the other gucs)


Some suggestions:

What about the maximum value?

If the default is in pages, maybe you could state it and afterwards 
translate it in size.


"The default is 64 pages on Linux (usually 256Kb)..."

The text could say something about sequential writes performance because 
pages are sorted.., but that it is lost for large bases and/or short 
checkpoints ?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote:
> I'm not sure I've seen these performance... If you have hard evidence,
> please feel free to share it.

Man, are you intentionally trying to be hard to work with?  To quote the
email you responded to:

> My current plan is to commit this with the current behaviour (as in this
> week[end]), and then do some actual benchmarking on this specific
> part. It's imo a relatively minor detail.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Fabien COELHO


[...]

I had originally kept it with one context per tablespace after 
refactoring this, but found that it gave worse results in rate limited 
loads even over only two tablespaces. That's on SSDs though.


Might just mean that a smaller context size is better on SSD, and it could 
still be better per table space.



The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.


I'm not sure I've seen these performance... If you have hard evidence, 
please feel free to share it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-10 17:33:33 -0500, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund  wrote:
> > On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> >> I think there might be a semantic distinction between these two terms.
> >> Doesn't writeback mean writing pages to disk, and flushing mean making
> >> sure that they are durably on disk?  So for example when the Linux
> >> kernel thinks there is too much dirty data, it initiates writeback,
> >> not a flush; on the other hand, at transaction commit, we initiate a
> >> flush, not writeback.
> >
> > I don't think terminology is sufficiently clear to make such a
> > distinction. Take e.g. our FlushBuffer()...
> 
> Well then we should clarify it!

Trying that as we speak, err, write. How about:

 Whenever more than bgwriter_flush_after bytes have
 been written by the bgwriter, attempt to force the OS to issue these
 writes to the underlying storage.  Doing so will limit the amount of
 dirty data in the kernel's page cache, reducing the likelihood of
 stalls when an fsync is issued at the end of a checkpoint, or when
 the OS writes data back  in larger batches in the background.  Often
 that will result in greatly reduced transaction latency, but there
 also are some cases, especially with workloads that are bigger than
 , but smaller than the OS's page
 cache, where performance might degrade.  This setting may have no
 effect on some platforms.  0 disables controlled
 writeback. The default is 256Kb on Linux, 0
 otherwise. This parameter can only be set in the
 postgresql.conf file or on the server command line.


(plus adjustments for the other gucs)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Robert Haas
On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund  wrote:
> On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
>> I think there might be a semantic distinction between these two terms.
>> Doesn't writeback mean writing pages to disk, and flushing mean making
>> sure that they are durably on disk?  So for example when the Linux
>> kernel thinks there is too much dirty data, it initiates writeback,
>> not a flush; on the other hand, at transaction commit, we initiate a
>> flush, not writeback.
>
> I don't think terminology is sufficiently clear to make such a
> distinction. Take e.g. our FlushBuffer()...

Well then we should clarify it!

:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> I think there might be a semantic distinction between these two terms.
> Doesn't writeback mean writing pages to disk, and flushing mean making
> sure that they are durably on disk?  So for example when the Linux
> kernel thinks there is too much dirty data, it initiates writeback,
> not a flush; on the other hand, at transaction commit, we initiate a
> flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-10 Thread Andres Freund
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote:
> 
> >>>Now I cannot see how having one context per table space would have a
> >>>significant negative performance impact.
> >>
> >>The 'dirty data' etc. limits are global, not per block device. By having
> >>several contexts with unflushed dirty data the total amount of dirty
> >>data in the kernel increases.
> >
> >Possibly, but how much?  Do you have experimental data to back up that
> >this is really an issue?
> >
> >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB
> >of dirty buffers to manage for 16 table spaces, I do not see that as a
> >major issue for the kernel.

We flush in those increments, that doesn't mean there's only that much
dirty data. I regularly see one order of magnitude more being dirty.


I had originally kept it with one context per tablespace after
refactoring this, but found that it gave worse results in rate limited
loads even over only two tablespaces. That's on SSDs though.


> To complete the argument, the 4MB is just a worst case scenario, in reality
> flushing the different context would be randomized over time, so the
> frequency of flushing a context would be exactly the same in both cases
> (shared or per table space context) if the checkpoints are the same size,
> just that with shared table space each flushing potentially targets all
> tablespace with a few pages, while with the other version each flushing
> targets one table space only.

The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.


My current plan is to commit this with the current behaviour (as in this
week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-08 Thread Fabien COELHO



Now I cannot see how having one context per table space would have a
significant negative performance impact.


The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.


Possibly, but how much?  Do you have experimental data to back up that this 
is really an issue?


We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of 
dirty buffers to manage for 16 table spaces, I do not see that as a major 
issue for the kernel.


More thoughts about your theoretical argument:

To complete the argument, the 4MB is just a worst case scenario, in 
reality flushing the different context would be randomized over time, so 
the frequency of flushing a context would be exactly the same in both 
cases (shared or per table space context) if the checkpoints are the same 
size, just that with shared table space each flushing potentially targets 
all tablespace with a few pages, while with the other version each 
flushing targets one table space only.


So my handwaving analysis is that the flow of dirty buffers is the same 
with both approaches, but for the shared version buffers are more equaly 
distributed on table spaces, hence reducing sequential write 
effectiveness, and for the other the dirty buffers are grouped more 
clearly per table space, so it should get better sequential write 
performance.



--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Fabien COELHO


Hello Andres,


Now I cannot see how having one context per table space would have a
significant negative performance impact.


The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases.


Possibly, but how much?  Do you have experimental data to back up that 
this is really an issue?


We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB 
of dirty buffers to manage for 16 table spaces, I do not see that as a 
major issue for the kernel.


Thus you're more likely to see stalls by the kernel moving pages into 
writeback.


I do not see the above data having a 30% negative impact on tps, given the 
quite small amount of data under discussion, and switching to random IOs 
cost so much that it must really be avoided.


Without further experimental data, I still think that the one context per 
table space is the reasonnable choice.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Andres Freund
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote:
> Now I cannot see how having one context per table space would have a
> significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases. Thus you're more likely to see stalls by
the kernel moving pages into writeback.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Fabien COELHO


Hello Andres,


(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
   per second avg, stddev [ min q1 median d3 max ] <=300tps
   679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
   per second avg, stddev [ min q1 median d3 max ] <=300tps
   956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%


Well, that's not a particularly meaningful workload. You increased the 
number of flushed to the same number of disks considerably.


It is just a simple workload designed to emphasize the effect of having 
one context shared for all table space instead of on per tablespace, 
without rewriting the patch and without a large host with multiple disks.


For a meaningful comparison you'd have to compare using one writeback 
context for N tablespaces on N separate disks/raids, and using N 
writeback contexts for the same.


Sure, it would be better to do that, but that would require (1) rewriting 
the patch, which is a small work, and also (2) having access to a machine 
with a number of disks/raids, that I do NOT have available.



What happens in the 16 tb workload is that much smaller flushes are 
performed on the 16 files writen in parallel, so the tps performance is 
significantly degraded, despite the writes being sorted in each file. On 
one tb, all buffers flushed are in the same file, so flushes are much more 
effective.


When the context is shared and checkpointer buffer writes are balanced 
against table spaces, then when the limit is reached the flushing gets few 
buffers per tablespace, so this limits sequential writes to few buffers, 
hence the performance degradation.


So I can explain the performance degradation *because* the flush context 
is shared between the table spaces, which is a logical argument backed 
with experimental data, so it is better than handwaving. Given the 
available hardware, this is the best proof I can have that context should 
be per table space.


Now I cannot see how having one context per table space would have a 
significant negative performance impact.


So the logical conclusion for me is that without further experimental data 
it is better to have one context per table space.


If you have a hardware with plenty disks available for testing, that would 
provide better data, obviously.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-03-07 Thread Andres Freund
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote:
> 
> >>Random updates on 16 tables which total to 1.1GB of data, so this is in
> >>buffer, no significant "read" traffic.
> >>
> >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
> >>per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> >>
> >>(2) with 1 tablespace on 1 disk : 956.0 tps
> >>per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
> >
> >Interesting. That doesn't reflect my own tests, even on rotating media,
> >at all. I wonder if it's related to:
> >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> >
> >If you use your 12.04 kernel, that'd not be fixed. Which might be a
> >reason to do it as you suggest.
> >
> >Could you share the exact details of that workload?
> 
> See attached scripts (sh to create the 16 tables in the default or 16 table
> spaces, small sql bench script, stat computation script).
> 
> The per-second stats were computed with:
> 
>   grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 
> --limit=300
> 
> Host is 8 cpu 16 GB, 2 HDD in RAID 1.

Well, that's not a particularly meaningful workload. You increased the
number of flushed to the same number of disks considerably. For a
meaningful comparison you'd have to compare using one writeback context
for N tablespaces on N separate disks/raids, and using N writeback
contexts for the same.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-07 Thread Andres Freund
On 2016-03-07 09:41:51 -0800, Andres Freund wrote:
> > Due to the difference in amount of RAM, each machine used different scales -
> > the goal is to have small, ~50% RAM, >200% RAM sizes:
> > 
> > 1) Xeon: 100, 400, 6000
> > 2) i5: 50, 200, 3000
> > 
> > The commits actually tested are
> > 
> >cfafd8be  (right before the first patch)
> >7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
> >db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...
> 
> Huh, now I'm a bit confused. These are the commits you tested? Those
> aren't the ones doing sorting and flushing?

To clarify: The reason we'd not expect to see much difference here is
that the above commits really only have any affect above noise if you
use synchronous_commit=off. Without async commit it's just one
additional gettimeofday() call and a few additional branches in the wal
writer every wal_writer_delay.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-07 Thread Andres Freund
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote:
> 1) HP DL380 G5 (old rack server)
> - 2x Xeon E5450, 16GB RAM (8 cores)
> - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
> - RedHat 6
> - shared_buffers = 4GB
> - min_wal_size = 2GB
> - max_wal_size = 6GB
> 
> 2) workstation with i5 CPU
> - 1x i5-2500k, 8GB RAM
> - 6x Intel S3700 100GB (in RAID0 for this benchmark)
> - Gentoo
> - shared_buffers = 2GB
> - min_wal_size = 1GB
> - max_wal_size = 8GB


Thinking about with that hardware I'm not suprised if you're only seing
small benefits. The amount of ram limits the amount of dirty data; and
you have plenty have on-storage buffering in comparison to that.


> Both machines were using the same kernel version 4.4.2 and default io
> scheduler (cfq). The
> 
> The test procedure was quite simple - pgbench with three different scales,
> for each scale three runs, 1h per run (and 30 minutes of warmup before each
> run).
> 
> Due to the difference in amount of RAM, each machine used different scales -
> the goal is to have small, ~50% RAM, >200% RAM sizes:
> 
> 1) Xeon: 100, 400, 6000
> 2) i5: 50, 200, 3000
> 
> The commits actually tested are
> 
>cfafd8be  (right before the first patch)
>7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
>db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...

Huh, now I'm a bit confused. These are the commits you tested? Those
aren't the ones doing sorting and flushing?


> Also, I really wonder what will happen with non-default io schedulers. I
> believe all the testing so far was done with cfq, so what happens on
> machines that use e.g. "deadline" (as many DB machines actually do)?

deadline and noop showed slightly bigger benefits in my testing.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-03-01 Thread Fabien COELHO


Hello Tomas,

One of the goals of this thread (as I understand it) was to make the overall 
behavior smoother - eliminate sudden drops in transaction rate due to bursts 
of random I/O etc.


One way to look at this is in terms of how much the tps fluctuates, so let's 
see some charts. I've collected per-second tps measurements (using the 
aggregation built into pgbench) but looking at that directly is pretty 
pointless because it's very difficult to compare two noisy lines jumping up 
and down.


So instead let's see CDF of the per-second tps measurements. I.e. we have 
3600 tps measurements, and given a tps value the question is what percentage 
of the measurements is below this value.


   y = Probability(tps <= x)

We prefer higher values, and the ideal behavior would be that we get exactly 
the same tps every second. Thus an ideal CDF line would be a step line. Of 
course, that's rarely the case in practice. But comparing two CDF curves is 
easy - the line more to the right is better, at least for tps measurements, 
where we prefer higher values.


Very nice and interesting graphs!

Alas not easy to interpret for the HDD, as there are better/worse 
variation all along the distribution, the lines cross one another, so how 
it fares overall is unclear.


Maybe a simple indication would be to compute the standard deviation on 
the per second tps? The median maybe interesting as well.


I do have some more data, but those are the most interesting charts. The rest 
usually shows about the same thing (or nothing).


Overall, I'm not quite sure the patches actually achieve the intended goals. 
On the 10k SAS drives I got better performance, but apparently much more 
variable behavior. On SSDs, I get a bit worse results.


Indeed.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO



Random updates on 16 tables which total to 1.1GB of data, so this is in
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%


Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?


See attached scripts (sh to create the 16 tables in the default or 16 
table spaces, small sql bench script, stat computation script).


The per-second stats were computed with:

  grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300

Host is 8 cpu 16 GB, 2 HDD in RAID 1.

--
Fabien.

ts_create.sh
Description: Bourne shell script


ts_test.sql
Description: application/sql
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# $Id: avg.py 1242 2016-02-06 14:44:02Z coelho $
#

import argparse
ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...')
ap.add_argument('--median', default=True, action='store_true',
help='compute median and quartile values')
ap.add_argument('--no-median', dest='median', default=True,
action='store_false',
help='do not compute median and quartile values')
ap.add_argument('--more', default=False, action='store_true',
help='show some more stats')
ap.add_argument('--limit', type=float, default=None,
help='set limit for counting below limit values')
ap.add_argument('--length', type=int, default=None,
help='set expected length, assume 0 if beyond')
ap.add_argument('--precision', type=int, default=1,
help='floating point precision')
ap.add_argument('file', nargs='*', help='list of files to process')
opt = ap.parse_args()

# option consistency
if opt.limit != None:
	opt.more = True
if opt.more:
	opt.median = True

# reset arguments for fileinput
import sys
sys.argv[1:] = opt.file

import fileinput

n, skipped, vals = 0, 0, []
k, vmin, vmax = None, None, None
sum1, sum2 = 0.0, 0.0

for line in fileinput.input():
	try:
		v = float(line)
		if opt.median: # keep track only if needed
			vals.append(v)
		if k is None: # first time
			k, vmin, vmax = v, v, v
		else: # next time
			vmin = min(vmin, v)
			vmax = max(vmax, v)
		n += 1
		vmk = v - k
		sum1 += vmk
		sum2 += vmk * vmk
	except ValueError: # float conversion failed
		skipped += 1

if n == 0:
	# avoid ops on None below
	k, vmin, vmax = 0.0, 0.0, 0.0

if opt.length:
	assert "some data seen", n > 0
	missing = int(opt.length) - len(vals)
	assert "positive number of missing data", missing >= 0
	if missing > 0:
		print("warning: %d missing data, expanding with zeros" % missing)
		if opt.median:
			vals += [ 0.0 ] * missing
		vmin = min(vmin, 0.0)
		sum1 += - k * missing
		sum2 += k * k * missing
		n += missing
		assert len(vals) == int(opt.length)

if opt.median:
	assert "consistent length", len(vals) == n

# five numbers...
# numpy.percentile requires numpy at least 1.9 to use 'midpoint'
# statistics.median requires python 3.4 (?)
def median(vals, start, length):
	if len(vals) == 1:
		start, length = 0, 1
	m, odd = divmod(length, 2)
	#return 0.5 * (vals[start + m + odd - 1] + vals[start + m])
	return  vals[start + m] if odd else \
		0.5 * (vals[start + m-1] + vals[start + m])

# return ratio of below limit (limit included) values
def below(vals, limit):
	# hmmm... short but generates a list
	#return float(len([ v for v in vals if v <= limit ])) / len(vals)
	below_limit = 0
	for v in vals:
		if v <= limit:
			below_limit += 1
	return float(below_limit) / len(vals)

# float prettyprint with precision
def f(v):
	return ('%.' + str(opt.precision) + 'f') % v

# output
if skipped:
	print("warning: %d lines skipped" % skipped)

if n > 0:
	# show result (hmmm, precision is truncated...)
	from math import sqrt
	avg, stddev = k + sum1 / n, sqrt((sum2 - (sum1 * sum1) / n) / n)
	if opt.median:
		vals.sort()
		med = median(vals, 0, len(vals))
		# not sure about odd/even issues here... q3 needs fixing if len is 1
		q1 = median(vals, 0, len(vals) // 2)
		q3 = median(vals, (len(vals)+1) // 2, len(vals) // 2)
		# build summary message
		msg = "avg over %d: %s ± %s [%s, %s, %s, %s, %s]" % \
			  (n, f(avg), f(stddev), f(vmin), f(q1), f(med), f(q3), f(vmax))
		if opt.more:
			limit = opt.limit if opt.limit != None else 0.1 * med
			# msg += " <=%s:" % f(limit)
			msg += " %s%%" % f(100.0 * below(vals, limit))
	else:
		msg = "avg over %d: %s ± %s [%s, %s]" % \
			  (n, f(avg), f(stddev), f(vmin), f(vmax))
else:
	msg = "no data 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Andres Freund
On 2016-02-22 11:05:20 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Interesting. That doesn't reflect my own tests, even on rotating media,
> > at all. I wonder if it's related to:
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> 
> > If you use your 12.04 kernel, that'd not be fixed. Which might be a
> > reason to do it as you suggest.
> 
> Hmm ... that kernel commit is less than 4 months old.  Would it be
> reflected in *any* production kernels yet?

Probably not - so far I though it mainly has some performance benefits
on relatively extreme workloads; where without the patch, flushing still
is better performancewise than not flushing. But in the scenario Fabien
has brought up it seems quite possible that sync_file_range emitting
"storage cache flush" instructions, could explain the rather large
performance difference between his and my experiments.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Tom Lane
Andres Freund  writes:
> Interesting. That doesn't reflect my own tests, even on rotating media,
> at all. I wonder if it's related to:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

> If you use your 12.04 kernel, that'd not be fixed. Which might be a
> reason to do it as you suggest.

Hmm ... that kernel commit is less than 4 months old.  Would it be
reflected in *any* production kernels yet?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Andres Freund
On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote:
> 
> >I did a quick & small test with random updates on 16 tables with
> >checkpoint_flush_after=16 checkpoint_timeout=30
> 
> Another run with more "normal" settings and over 1000 seconds, so less
> "quick & small" that the previous one.
> 
>  checkpoint_flush_after = 16
>  checkpoint_timeout = 5min # default
>  shared_buffers = 2GB # 1/8 of available memory
> 
> Random updates on 16 tables which total to 1.1GB of data, so this is in
> buffer, no significant "read" traffic.
> 
> (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
> per second avg, stddev [ min q1 median d3 max ] <=300tps
> 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> 
> (2) with 1 tablespace on 1 disk : 956.0 tps
> per second avg, stddev [ min q1 median d3 max ] <=300tps
> 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO


I did a quick & small test with random updates on 16 tables with 
checkpoint_flush_after=16 checkpoint_timeout=30


Another run with more "normal" settings and over 1000 seconds, so less 
"quick & small" that the previous one.


 checkpoint_flush_after = 16
 checkpoint_timeout = 5min # default
 shared_buffers = 2GB # 1/8 of available memory

Random updates on 16 tables which total to 1.1GB of data, so this is in 
buffer, no significant "read" traffic.


(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%

(2) with 1 tablespace on 1 disk : 956.0 tps
per second avg, stddev [ min q1 median d3 max ] <=300tps
956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-22 Thread Fabien COELHO


Hallo Andres,


AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I did a quick & small test with random updates on 16 tables with 
checkpoint_flush_after=16 checkpoint_timeout=30


(1) with 16 tablespaces (1 per table, but same disk) :
tps = 1100, 27% time under 100 tps

(2) with 1 tablespace :
tps = 1200,  3% time under 100 tps

This result is logical: with one writeback context shared between 
tablespaces the sync_file_range is issued on a few buffers per file at a 
time on the 16 files, no coalescing occurs there, so this result in random 
IOs, while with one table space all writes are aggregated per file.


ISTM that this quick test shows that a writeback context are relevant per 
tablespace, as I expected.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.


They don't cause much space usage, and we access the values frequently. 
So why not store them?


The same question would work the other way around: these values are one 
division away, why not compute them when needed? No big deal.


[...] Given realistic amounts of memory the max potential "skew" seems 
fairly small with float8. If we ever flush one buffer "too much" for a 
tablespace it's pretty much harmless.


I do agree. I'm suggesting that a comment should be added to justify why 
float8 accuracy is okay.



I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?


Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


I'm not sure that a memory context is justified here, there are only two 
mallocs and the checkpointer works for very long times. I think that it is 
simpler to just get the malloc/free right.


[...] I'm not arguing for ripping it out, what I mean is that we don't 
set a nondefault value for the GUCs on platforms with just 
posix_fadivise available...


Ok with that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO



[...] I do think that this whole writeback logic really does make
sense *per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device
as well (otherwise what the point?), so you should want to coalesce and
"writeback" pages per device as wel. Calling sync_file_range on distinct
devices should probably be issued more or less randomly, and should not
interfere one with the other.


The kernel's dirty buffer accounting is global, not per block device.


Sure, but this is not my point. My point is that "sync_file_range" moves 
buffers to the device io queues, which are per device. If there is one 
queue in pg and many queues on many devices, the whole point of coalescing 
to get sequential writes is somehow lost.



It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


Ok, some people would not benefit if the use many tablespaces on one 
device, too bad but that does not look like a useful very setting anyway, 
and I do not think it would harm much in this case.



If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes per
device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them, which
would not be that bad, and I think that it is what is needed.


That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


ISTM that you would need a significant number of tablespaces to see the 
benefit. If you do not do that, the more table spaces the more random the 
IOs, which is disappointing. Also, "the cost is marginal", so I do not see 
any good argument not to do it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund
Hi,

On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:
> * CpktSortItem:
> 
> I think that allocating 20 bytes per buffer in shared memory is a little on
> the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
> to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
> are very few tablespaces, they could be given a small number and this number
> could be used instead of the Oid, so the space requirement could be reduced
> to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
> 4 bytes alignement and also getting 8 byte alignement... If this is too
> much, I have shown that it can work with only 4 bytes per buffer, as the
> sorting is really just a performance optimisation and is not broken if some
> stuff changes between sorting & writeback, but you did not like the idea. If
> the amount of shared memory required is a significant concern, it could be
> resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far.  And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.


> ISTM that "progress" and "progress_slice" only depend on num_scanned and
> per-tablespace num_to_scan and total num_to_scan, so they are somehow
> redundant and the progress could be recomputed from the initial figures
> when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?


> If these fields are kept, I think that a comment should justify why float8
> precision is okay for the purpose. I think it is quite certainly fine in the
> worst case with 32 bits buffer_ids, but it would not be if this size is
> changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

> ISTM that nearly all of the collected data on the second sweep could be
> collected on the first sweep, so that this second sweep could be avoided
> altogether. The only missing data is the index of the first buffer in the
> array, which can be computed by considering tablespaces only, sweeping over
> buffers is not needed. That would suggest creating the heap or using a hash
> in the initial buffer sweep to keep this information. This would also
> provide a point where to number tablespaces for compressing the CkptSortItem
> struct.

Doesn't seem worth the complexity to me.


> I'm wondering about calling CheckpointWriteDelay on each round, maybe
> a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

> I see a binary_heap_allocate but no corresponding deallocation, this
> looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


> >I think this patch primarily needs:
> >* Benchmarking on FreeBSD/OSX to see whether we should enable the
> > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> > inclined to leave it off till then.
> 
> I do not have that. As "msync" seems available on Linux, it is possible to
> force using it with a "ifdef 0" to skip sync_file_range and check whether it
> does some good there.

Unfortunately it doesn't work well on linux:
 * On many OSs msync() on a mmap'ed file triggers writeback. On 
linux
 * it only does so when MS_SYNC is specified, but then it does 
the
 * writeback synchronously. Luckily all common linux systems 
have
 * sync_file_range().  This is preferrable over FADV_DONTNEED 
because
 * it doesn't flush out clean data.

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...


> Idem for the "posix_fadvise" stuff. I can try to do
> that, but it takes time to do so, if someone can test on other OS it would
> be much better. I think that if it works it should be kept in, so it is just
> a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise
available...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Andres Freund
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote:
> >>In the discussion in the wal section, I'm not sure about the effect of
> >>setting writebacks on SSD, [...]
> >
> >Yea, that paragraph needs some editing. I think we should basically
> >remove that last sentence.
> 
> Ok, fine with me. Does that mean that flushing as a significant positive
> impact on SSD in your tests?

Yes. The reason we need flushing is that the kernel amasses dirty pages,
and then flushes them at once. That hurts for both SSDs and rotational
media. Sorting is the the bigger question, but I've seen it have clearly
beneficial performance impacts. I guess if you look at devices with a
internal block size bigger than 8k, you'd even see larger differences.

> >>Maybe the merging strategy could be more aggressive than just strict
> >>neighbors?
> >
> >I don't think so. If you flush more than neighbouring writes you'll
> >often end up flushing buffers dirtied by another backend, causing
> >additional stalls.
> 
> Ok. Maybe the neightbor definition could be relaxed just a little bit so
> that small holes are overtake, but not large holes? If there is only a few
> pages in between, even if written by another process, then writing them
> together should be better? Well, this can wait for a clear case, because
> hopefully the OS will recoalesce them behind anyway.

I'm against doing so without clear measurements of a benefit.

> >Also because the infrastructure is used for more than checkpoint
> >writes. There's absolutely no ordering guarantees there.
> 
> Yep, but not much benefit to expect from a few dozens random pages either.

Actually, there's kinda frequently a benefit observable. Even if few
requests can be merged, doing IO requests in an order more likely doable
within a few rotations is beneficial. Also, the cost is marginal, so why
worry?

> >>[...] I do think that this whole writeback logic really does make
> >>sense *per table space*,
> >
> >Leads to less regular IO, because if your tablespaces are evenly sized
> >(somewhat common) you'll sometimes end up issuing sync_file_range's
> >shortly after each other.  For latency outside checkpoints it's
> >important to control the total amount of dirty buffers, and that's
> >obviously independent of tablespaces.
> 
> I do not understand/buy this argument.
> 
> The underlying IO queue is per device, and table spaces should be per device
> as well (otherwise what the point?), so you should want to coalesce and
> "writeback" pages per device as wel. Calling sync_file_range on distinct
> devices should probably be issued more or less randomly, and should not
> interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.
It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


> If you use just one context, the more table spaces the less performance
> gains, because there is less and less aggregation thus sequential writes per
> device.
> 
> So for me there should really be one context per tablespace. That would
> suggest a hashtable or some other structure to keep and retrieve them, which
> would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


> >>For the checkpointer, a key aspect is that the scheduling process goes
> >>to sleep from time to time, and this sleep time looked like a great
> >>opportunity to do this kind of flushing. You choose not to take advantage
> >>of the behavior, why?
> >
> >Several reasons: Most importantly there's absolutely no guarantee that
> >you'll ever end up sleeping, it's quite common to happen only seldomly.
> 
> Well, that would be under a situation when pg is completely unresponsive.
> More so, this behavior *makes* pg unresponsive.

No. The checkpointer being bottlenecked on actual IO performance doesn't
impact production that badly. It'll just sometimes block in
sync_file_range(), but the IO queues will have enough space to
frequently give way to other backends, particularly to synchronous reads
(most pg reads) and synchronous writes (fdatasync()).  So a single
checkpoint will take a bit longer, but otherwise the system will mostly
keep up the work in a regular manner.  Without the sync_file_range()
calls the kernel will amass dirty buffers until global dirty limits are
reached, which then will bring the whole system to a standstill.

It's pretty common that checkpoint_timeout is too short to be able to
write all shared_buffers out, in that case it's much better to slow down
the whole checkpoint, instead of being incredibly slow at the end.

> >I also don't really believe it helps that much, although that's a complex
> >argument to make.
> 
> Yep. My 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO


Hallo Andres,

Here is a review for the second patch.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
 it doesn't take a lot of time.


I put it there initialy to demonstrate that there was no cache performance 
issue when sorting on just buffer indexes. As it is always small, I agree 
that it is not needed. Well, it could be still be in seconds on a very 
large shared buffers setting with a very large checkpoint, but then the 
checkpoint would be tremendously huge...



* Minor comment polishing.


Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little 
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, 
there are very few tablespaces, they could be given a small number and 
this number could be used instead of the Oid, so the space requirement 
could be reduced to say 16 bytes per buffer by combining space & fork in 2 
shorts and keeping 4 bytes alignement and also getting 8 byte 
alignement... If this is too much, I have shown that it can work with only 
4 bytes per buffer, as the sorting is really just a performance 
optimisation and is not broken if some stuff changes between sorting & 
writeback, but you did not like the idea. If the amount of shared memory 
required is a significant concern, it could be resurrected, though.


* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8 
precision is okay for the purpose. I think it is quite certainly fine in 
the worst case with 32 bits buffer_ids, but it would not be if this size 
is changed someday.


* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then 
there those buffers are swept again to compute some per tablespace data 
and organise a heap.


ISTM that nearly all of the collected data on the second sweep could be 
collected on the first sweep, so that this second sweep could be avoided 
altogether. The only missing data is the index of the first buffer in the 
array, which can be computed by considering tablespaces only, sweeping 
over buffers is not needed. That would suggest creating the heap or using 
a hash in the initial buffer sweep to keep this information. This would 
also provide a point where to number tablespaces for compressing the 
CkptSortItem struct.


I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of 
this patch. Probably it works fine because after a sleep the checkpointer 
is behind enough so that it will write a bunch of buffers before sleeping

again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.


I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
 mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
 inclined to leave it off till then.


I do not have that. As "msync" seems available on Linux, it is possible to 
force using it with a "ifdef 0" to skip sync_file_range and check whether 
it does some good there. Idem for the "posix_fadvise" stuff. I can try to 
do that, but it takes time to do so, if someone can test on other OS it 
would be much better. I think that if it works it should be kept in, so it 
is just a matter of testing it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-21 Thread Fabien COELHO


Hallo Andres,

[...] I do think that this whole writeback logic really does make sense 
*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per device 
as well (otherwise what the point?), so you should want to coalesce and 
"writeback" pages per device as wel. Calling sync_file_range on distinct 
devices should probably be issued more or less randomly, and should not 
interfere one with the other.


If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes per 
device.


So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, which 
would not be that bad, and I think that it is what is needed.


Note: I think that an easy way to do that in the "checkpoint sort" patch 
is simply to keep a WritebackContext in CkptTsStatus structure which is 
per table space in the checkpointer.


For bgwriter & backends it can wait, there is few "writeback" coalescing 
because IO should be pretty random, so it does not matter much.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Fabien COELHO


Hallo Andres,


In some previous version I think a warning was shown if the feature was
requested but not available.


I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.


I like "ignoring with a warning" in the log file, because when things do 
not behave as expected that is where I'll be looking. I do not think that 
it should error out.


The sgml documentation about "*_flush_after" configuration parameter 
talks about bytes, but the actual unit should be buffers.


The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.


So I understand that it works with kb as well. Now I do not think that it 
would need a lot if explanations if you say that it is a number of pages, 
and I think that a number of pages is significant because it is a number 
of IO requests to be coalesced, eventually.



In the discussion in the wal section, I'm not sure about the effect of
setting writebacks on SSD, [...]


Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


Ok, fine with me. Does that mean that flushing as a significant positive 
impact on SSD in your tests?


However it does not address the point that bgwriter and backends 
basically issue random writes, [...]


The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


So I understand that there is a positive impact under some load. Good!


Maybe the merging strategy could be more aggressive than just strict
neighbors?


I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls.


Ok. Maybe the neightbor definition could be relaxed just a little bit so 
that small holes are overtake, but not large holes? If there is only a few 
pages in between, even if written by another process, then writing them 
together should be better? Well, this can wait for a clear case, because 
hopefully the OS will recoalesce them behind anyway.



struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.


It has, it's just in WritebackContextInit(). Can duplicateit.


I missed it, I expected something in the struct definition. Do not 
duplicate, but cross reference it?



IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.


Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


Yep, but not much benefit to expect from a few dozens random pages either.

[...] I do think that this whole writeback logic really does make sense 
*per table space*,


Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per 
device as well (otherwise what the point?), so you should want to coalesce 
and "writeback" pages per device as wel. Calling sync_file_range on 
distinct devices should probably be issued more or less randomly, and 
should not interfere one with the other.


If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes 
per device.


So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, 
which would not be that bad, and I think that it is what is needed.



For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?


Several reasons: Most importantly there's absolutely no guarantee that 
you'll ever end up sleeping, it's quite common to happen only seldomly.


Well, that would be under a situation when pg is completely unresponsive. 
More so, this behavior *makes* pg unresponsive.



If you're bottlenecked on IO, you can end up being behind all the time.


Hopefully 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Robert Haas
On Sun, Feb 21, 2016 at 3:37 AM, Andres Freund  wrote:
>> The documentation seems to use "flush" but the code talks about "writeback"
>> or "flush", depending. I think one vocabulary, whichever it is, should be
>> chosen and everything should stick to it, otherwise everything look kind of
>> fuzzy and raises doubt for the reader (is it the same thing? is it something
>> else?). I initially used "flush", but it seems a bad idea because it has
>> nothing to do with the flush function, so I'm fine with writeback or anything
>> else, I just think that *one* word should be chosen and used everywhere.
>
> Hm.

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk?  So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Andres Freund
Hi,

On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote:
> >* Currently *_flush_after can be set to a nonzero value, even if there's
> > no support for flushing on that platform. Imo that's ok, but perhaps
> > other people's opinion differ.
> 
> In some previous version I think a warning was shown of the feature was
> requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

> Here are some quick comments on the patch:
> 
> Patch applies cleanly on head. Compiled and checked on Linux. Compilation
> issues on other systems, see below.

For those I've already pushed a small fixup commit to git... Stupid
mistake.


> The documentation seems to use "flush" but the code talks about "writeback"
> or "flush", depending. I think one vocabulary, whichever it is, should be
> chosen and everything should stick to it, otherwise everything look kind of
> fuzzy and raises doubt for the reader (is it the same thing? is it something
> else?). I initially used "flush", but it seems a bad idea because it has
> nothing to do with the flush function, so I'm fine with writeback or anything
> else, I just think that *one* word should be chosen and used everywhere.

Hm.


> The sgml documentation about "*_flush_after" configuration parameter talks
> about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

> Also, the maximum value (128 ?) should appear in the text. \

Right.


> In the discussion in the wal section, I'm not sure about the effect of
> setting writebacks on SSD, but I think that you have made some tests
> so maybe you have an answer and the corresponding section could be
> written with some more definitive text than "probably brings no
> benefit on SSD".

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


> A good point of the whole approach is that it is available to all kind
> of pg processes.

Exactly.


> However it does not address the point that bgwriter and
> backends basically issue random writes, so I would not expect much positive
> effect before these writes are somehow sorted, which means doing some
> compromise in the LRU/LFU logic...

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


> well, all this is best kept for later, and I'm fine to have the logic
> flushing logic there. I'm wondering why you choose 16 & 64 as default
> for backends & bgwriter, though.

I chose a small value for backends because there often are a large
number of backends, and thus the amount of dirty data of each adds up. I
used a larger value for bgwriter because I saw that ending up using
bigger IOs.


> IssuePendingWritebacks: you merge only strictly neightboring writes.
> Maybe the merging strategy could be more aggressive than just strict
> neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls. And if the writes aren't actually neighbouring
there's not much gained from issuing them in one sync_file_range call.


> mdwriteback: all variables could be declared within the while, I do not
> understand why some are in and some are out.

Right.


> ISTM that putting writeback management at the relation level does not
> help a lot, because you have to translate again from relation to
> files.

Sure, but what's the problem with that? That's how normal read/write IO
works as well?


> struct WritebackContext: keeping a pointer to guc variables is a kind of
> trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.


> ScheduleBufferTagForWriteback: the "pending" variable is not very
> useful.

Shortens line length a good bit, at no cost.



> IssuePendingWritebacks: I understand that qsort is needed "again"
> because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


> AFAICR I used a "flush context" for each table space in some version
> I submitted, because I do think that this whole writeback logic really
> does make sense *per table space*, which suggest that there should be as
> many write backs contexts as table spaces, otherwise the 

Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-20 Thread Fabien COELHO


Hello Andres,


For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
 delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore


I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
 understand. Some language polishing might also be needed.


Yep, see below.


* review of the writeback API, combined with the smgr/md.c changes.


See various comments below.


* Currently *_flush_after can be set to a nonzero value, even if there's
 no support for flushing on that platform. Imo that's ok, but perhaps
 other people's opinion differ.


In some previous version I think a warning was shown of the feature was
requested but not available.


Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation 
issues on other systems, see below.


When pages are written by a process (checkpointer, bgwriter, backend worker),
the list of recently written pages is kept and every so often an advisory
fsync (sync_file_range, other options for other systems) is issued so that
the data is sent to the io system without relying on more or less
(un)controllable os policy.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers. I think that keeping
a number of buffers should be fine, because that is what the internal stuff
will manage, not bytes. Also, the maximum value (128 ?) should appear in
the text. In the discussion in the wal section, I'm not sure about the effect
of setting writebacks on SSD, but I think that you have made some tests so
maybe you have an answer and the corresponding section could be written with
some more definitive text than "probably brings no benefit on SSD".

A good point of the whole approach is that it is available to all kind
of pg processes. However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic... well, all this is best kept for later,
and I'm fine to have the logic flushing logic there. I'm wondering why you
choose 16 & 64 as default for backends & bgwriter, though.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out. ISTM that putting writeback
management at the relation level does not help a lot, because you have to
translate again from relation to files. The good news is that it should work
as well, and that it does avoid the issue that the file may have been closed
in between, so why not.

The PendingWriteback struct looks useless. I think it should be removed,
and maybe put back if one day if it is needed, which I rather doubt it.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

ScheduleBufferTagForWriteback: the "pending" variable is not very useful.
Maybe consider shortening the "pending_writebacks" field name to "writebacks"?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.
AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always
true, I think, it is already checked in the initialization and when setting
gucs.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

pg_flush_data: in the first #elif, "context" is undeclared line 446.
Label "out" is not defined line 455. In the second #elif, "context" is
undeclared line 490 and label "out" line 500 is not defined either.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Michael Paquier
On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO  wrote:
>> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO
>> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge.
>>
>>
>> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu
>
>
> Interesting! To summarize it, 25% performance degradation from best kernel
> (2.6.32) to worst (3.2.0), that is indeed significant.

As far as I recall, the OS cache eviction is very aggressive in 3.2,
so it would be possible that data from the FS cache that was just read
could be evicted even if it was not used yet. Thie represents a large
difference when the database does not fit in RAM.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Andres Freund
On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >Here's the next two (the most important) patches of the series:
> >0001: Allow to trigger kernel writeback after a configurable number of 
> >writes.
> >0002: Checkpoint sorting and balancing.
> 
> I will look into these two in depth.
> 
> Note that I would have ordered them in reverse because sorting is nearly
> always very beneficial, and "writeback" (formely called flushing) is then
> nearly always very beneficial on sorted buffers.

I had it that way earlier. I actually saw pretty large regressions from
sorting alone in some cases as well, apparently because the kernel
submits much larger IOs to disk; although that probably only shows on
SSDs.  This way the modifications imo look a trifle better ;). I'm
intending to commit both at the same time, keep them separate only
because they're easier to ynderstand separately.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Fabien COELHO


Hello Andres,


Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.


I will look into these two in depth.

Note that I would have ordered them in reverse because sorting is nearly 
always very beneficial, and "writeback" (formely called flushing) is then 
nearly always very beneficial on sorted buffers.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V18

2016-02-19 Thread Andres Freund
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Hi,
> 
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
> 
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

I've updated the git tree.

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
  delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore


I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
  understand. Some language polishing might also be needed.
* review of the writeback API, combined with the smgr/md.c changes.
* Currently *_flush_after can be set to a nonzero value, even if there's
  no support for flushing on that platform. Imo that's ok, but perhaps
  other people's opinion differ.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
  it doesn't take a lot of time.
* Minor comment polishing.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
  mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
  inclined to leave it off till then.


Regards,

Andres
>From 58aee659417372f3dda4420d8f2a4f4d41c56d31 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 19 Feb 2016 12:13:05 -0800
Subject: [PATCH 1/4] Allow to trigger kernel writeback after a configurable
 number of writes.

Currently writes to the main data files of postgres all go through the
OS page cache. This means that currently several operating systems can
end up collecting a large number of dirty buffers in their respective
page caches.  When these dirty buffers are flushed to storage rapidly,
be it because of fsync(), timeouts, or dirty ratios, latency for other
writes can increase massively.  This is the primary reason for regular
massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.

On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.

Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache.  Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.

With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.

A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences.  This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.

TODO:
* verify msync codepath
* properly detect mmap() && msync(MS_ASYNC) support, use it by default
  if available and sync_file_range is *not* available

Discussion: alpine.DEB.2.10.150601132.28433@sto
Author: Fabien Coelho and Andres Freund
---
 doc/src/sgml/config.sgml  |  81 +++
 doc/src/sgml/wal.sgml |  13 +++
 src/backend/postmaster/bgwriter.c |   8 +-
 src/backend/storage/buffer/buf_init.c |   5 +
 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hallo Patric,

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify 
IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is 
huge.


https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu


Interesting! To summarize it, 25% performance degradation from best kernel 
(2.6.32) to worst (3.2.0), that is indeed significant.


You might consider upgrading your kernel to 3.13 LTS. It's quite easy 
[...]


There are other stuff running on the hardware that I do not wish to touch, 
so upgrading the particular host is currently not an option, otherwise I 
would have switched to trusty.


Thanks for the pointer.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Patric Bechtel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Fabien,

Fabien COELHO schrieb am 19.02.2016 um 16:04:
> 
>>> [...] Ubuntu 12.04 LTS (precise)
>> 
>> That's with 12.04's standard kernel?
> 
> Yes.

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO 
somehow. The difference
to 3.13 (the latest LTS kernel for 12.04) is huge.

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally:

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

/Patric
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW
Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M
=TxT+
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hello.


Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.


Yes, these many runs show that 32 is basically as good or better than 64.

I'll do some runs with 16/48 to have some more data.

I gather that you didn't play with 
backend_flush_after/bgwriter_flush_after, i.e. you left them at their 
default values? Especially backend_flush_after can have a significant 
positive and negative performance impact.


Indeed, non reported configuration options have their default values. 
There were also minor changes in the default options for logging (prefix, 
checkpoint, ...), but nothing significant, and always the same for all 
runs.



 [...] Ubuntu 12.04 LTS (precise)


That's with 12.04's standard kernel?


Yes.


   checkpoint_flush_after = { none, 0, 32, 64 }


Did you re-initdb between the runs?


Yes, all runs are from scratch (initdb, pgbench -i, some warmup...).


I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.


Yes, I agree that probably the performance changes on long vs short runs 
(andres00c vs andres00b) is due to autovacuum.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Andres Freund
Hi,

On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote:
> Below the results of a lot of tests with pgbench to exercise checkpoints on
> the above version when fetched.

Wow, that's a great test series.


> Overall comments:
>  - sorting & flushing is basically always a winner
>  - benchmarking with short runs on large databases is a bad idea
>the results are very different if a longer run is used
>(see andres00b vs andres00c)

Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.

I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.


>  16 GB 2 cpu 8 cores
>  200 GB RAID1 HDD, ext4 FS
>  Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?



>  postgresql.conf:
>shared_buffers = 1GB
>max_wal_size = 1GB
>checkpoint_timeout = 300s
>checkpoint_completion_target = 0.8
>checkpoint_flush_after = { none, 0, 32, 64 }

Did you re-initdb between the runs?


I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.


> Hmmm, interesting: maintenance_work_mem seems to have some influence on
> performance, although it is not too consistent between settings, probably
> because as the memory is used to its limit the performance is quite
> sensitive to the available memory.

That's probably because of differing behaviour of autovacuum/vacuum,
which sometime will have to do several scans of the tables if there are
too many dead tuples.


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-19 Thread Fabien COELHO


Hello Andres,


I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush


Below the results of a lot of tests with pgbench to exercise checkpoints 
on the above version when fetched.


Overall comments:
 - sorting & flushing is basically always a winner
 - benchmarking with short runs on large databases is a bad idea
   the results are very different if a longer run is used
   (see andres00b vs andres00c)

# HOST/SOFT

 16 GB 2 cpu 8 cores
 200 GB RAID1 HDD, ext4 FS
 Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS

 tps: is the "excluding connection" time tps, the higher the better
 1-sec tps: average of measured per-second tps
   note - it should be the same as the previous one, but due to various
  hazards in the trace, especially when things go badly and pg get
  stuck, it may be different. Such hazard also explain why there
  may be some non-integer tps reported for some seconds.
 stddev: standard deviation, the lower the better
 the five figures in bracket give a feel of the distribution:
 - min: minimal per-second tps seen in the trace
 - q1: first quarter per-second tps seen in the trace
 - med: median per-second tps seen in the trace
 - q3: third quarter per-second tps seen in the trace
 - max: maximal per-second tps seen in the trace
 the last percentage dubbed "<=10.0" is percent of seconds where performance
   is below 10 tps: this measures of how unresponsive pg was during the run

## TINY2

 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
   with scale = 10 (~ 200 MB)

 postgresql.conf:
   shared_buffers = 1GB
   max_wal_size = 1GB
   checkpoint_timeout = 300s
   checkpoint_completion_target = 0.8
   checkpoint_flush_after = { none, 0, 32, 64 }

 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%
  1 | 2575.0 / 2575.1 ± 359.3 [  1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1%
  2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%

0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%
  1 | 2596.6 / 2596.9 ± 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0%
  2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%

   32 0 | 2625.5 / 2625.5 ± 250.5 [  1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%
  1 | 2630.2 / 2630.2 ± 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0%
  2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%

   64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%
  1 | 2591.1 / 2591.1 ± 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0%
  2 | 2047.8 / 2046.4 ± 925.8 [  0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

## SMALL

 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
   with scale = 120 (~ 2 GB)

 postgresql.conf:
   shared_buffers = 2GB
   checkpoint_timeout = 300s
   checkpoint_completion_target = 0.8
   checkpoint_flush_after = { none, 0, 32, 64 }

 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

 head 0 | 209.2 / 204.2 ± 516.5 [0.0,   0.0,   4.0,5.0, 2251.0] 82.3%
  1 | 207.4 / 204.2 ± 518.7 [0.0,   0.0,   4.0,5.0, 2245.1] 82.3%
  2 | 217.5 / 211.0 ± 530.3 [0.0,   0.0,   3.0,5.0, 2255.0] 82.0%
  3 | 217.8 / 213.2 ± 531.7 [0.0,   0.0,   4.0,6.0, 2261.9] 81.7%
  4 | 230.7 / 223.9 ± 542.7 [0.0,   0.0,   4.0,7.0, 2282.0] 80.7%

0 0 | 734.8 / 735.5 ± 879.9 [0.0,   1.0,  16.5, 1748.3, 2281.1] 47.0%
  1 | 694.9 / 693.0 ± 849.0 [0.0,   1.0,  29.5, 1545.7, 2428.0] 46.4%
  2 | 735.3 / 735.5 ± 888.4 [0.0,   0.0,  12.0, 1781.2, 2312.1] 47.9%
  3 | 736.0 / 737.5 ± 887.1 [0.0,   1.0,  16.0, 1794.3, 2317.0] 47.5%
  4 | 734.9 / 735.1 ± 885.1 [0.0,   1.0,  15.5, 1781.0, 2297.1] 47.2%

   32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0,  753.0, 2312.1]  0.2%
  1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5, 671.0,  744.0, 2319.0]  0.1%
  2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0,  756.0, 2331.0]  0.2%
  3 | 744.1 / 744.1 ± 414.4 [0.0, 555.5, 685.2,  758.0, 2285.1]  0.1%
  4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0,  759.0, 2308.1]  0.1%

   64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0,  759.0, 2353.0]  0.1%
  1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2, 680.0,  758.2, 2296.0]  0.1%
  2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1,  757.0, 2310.0]  0.1%
  3 | 529.0 / 526.6 ± 410.9 [0.0, 245.0, 444.0,  701.0, 2380.9]  1.5% ??
  4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0,  754.0, 2298.0]  0.1%

Sorting brings * 3.3 tps, flushing significantly reduces tps 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
> 
> When reading the patch 0002 explanations, I had the following comments:
> 
> AFAICS, there are several levels of actions when writing things in pg:
> 
>  0: the thing is written in some internal buffer
> 
>  1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.


>  2: the buffer is actually passed to the OS (write, flush)
> 
>  3: the OS is advised to send the written data to the io subsystem
> (sync_file_range with SYNC_FILE_RANGE_WRITE)
> 
>  4: the OS is required to send the written data to the disk
> (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.


0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits.  The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns.  On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.


> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.

Right.


> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Andres Freund
On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
> 
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>   potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>   flushing every wal_writer_delay ms or wal_writer_flush_after
>   bytes.

I've pushed these after some more polishing, now working on the next
two.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-18 Thread Fabien COELHO


Hello Andres,


0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
 potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
 flushing every wal_writer_delay ms or wal_writer_flush_after
 bytes.


I've looked at these patches, especially the whole bench of explanations 
and comments which is a good source for understanding what is going on in 
the WAL writer, a part of pg I'm not familiar with.


When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:

 0: the thing is written in some internal buffer

 1: the buffer is advised to be passed to the OS (hint bits?)

 2: the buffer is actually passed to the OS (write, flush)

 3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)

 4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

It is not clear when reading the text which level is discussed. In 
particular, I'm not sure that "flush" refers to level 2, which is 
misleading. When reading the description, I'm rather under the impression 
that it is about level 4, but then if actual fsync are performed every 200 
ms then the tps would be very low...


After more considerations, my final understanding is that this behavior 
only occurs with "asynchronous commit", aka a situation when COMMIT does 
not wait for data to be really fsynced, but the fsync is to occur within 
some delay so it will not be too far away, some kind of compromise for 
performance where commits can be lost.


Now all this is somehow alien to me because the whole point of committing 
is having the data to disk, and I would not consider a database to be safe 
if commit does not imply fsync, but I understand that people may have to 
compromise for performance.


Is my understanding right?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-11 Thread Robert Haas
On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund  wrote:
> On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
>> Fabien asked me to post a new version of the checkpoint flushing patch
>> series. While this isn't entirely ready for commit, I think we're
>> getting closer.
>>
>> I don't want to post a full series right now, but my working state is
>> available on
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
>> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
>
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
>
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>   potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>   flushing every wal_writer_delay ms or wal_writer_flush_after
>   bytes.

I previously reviewed 0001 and I think it's fine.  I haven't reviewed
0002 in detail, but I like the concept.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-11 Thread Andres Freund
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
> 
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently.  The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
  potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
  flushing every wal_writer_delay ms or wal_writer_flush_after
  bytes.

Greetings,

Andres Freund
>From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new
 enough.

Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not require the commit record to be flushed, it just
requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.

In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a follup commit to reduce the
flush rate.

Discussion: 20160118163908.gw10...@awork2.anarazel.de
---
 src/backend/utils/time/tqual.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 465933d..503bd1d 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
  * It is only safe to set a transaction-committed hint bit if we know the
- * transaction's commit record has been flushed to disk, or if the table is
- * temporary or unlogged and will be obliterated by a crash anyway.  We
- * cannot change the LSN of the page here because we may hold only a share
- * lock on the buffer, so we can't use the LSN to interlock this; we have to
- * just refrain from setting the hint bit until some future re-examination
- * of the tuple.
+ * transaction's commit record is guaranteed to be flushed to disk before the
+ * buffer, or if the table is temporary or unlogged and will be obliterated by
+ * a crash anyway.  We cannot change the LSN of the page here because we may
+ * hold only a share lock on the buffer, so we can only use the LSN to
+ * interlock this if the buffer's LSN already is newer than the commit LSN;
+ * otherwise we have to just refrain from setting the hint bit until some
+ * future re-examination of the tuple.
  *
  * We can always set hint bits when marking a transaction aborted.  (Some
  * code in heapam.c relies on that!)
@@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 		/* NB: xid must be known committed here! */
 		XLogRecPtr	commitLSN = TransactionIdGetCommitLSN(xid);
 
-		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
-			return;/* not flushed yet, so don't set hint */
+		if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) &&
+			BufferGetLSNAtomic(buffer) < commitLSN)
+		{
+			/* not flushed and no LSN interlock, so don't set hint */
+			return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
-- 
2.7.0.229.g701fa7f

>From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate.

Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent 

Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-09 Thread Fabien COELHO


I think I would appreciate comments to understand why/how the 
ringbuffer is used, and more comments in general, so it is fine if you 
improve this part.


I'd suggest to leave out the ringbuffer/new bgwriter parts.


Ok, so the patch would only onclude the checkpointer stuff.

I'll look at this part in detail.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-09 Thread Andres Freund
On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO  
wrote:
>
>>> I think I would appreciate comments to understand why/how the 
>>> ringbuffer is used, and more comments in general, so it is fine if
>you 
>>> improve this part.
>>
>> I'd suggest to leave out the ringbuffer/new bgwriter parts.
>
>Ok, so the patch would only onclude the checkpointer stuff.
>
>I'll look at this part in detail.

Yes, that's the more pressing part. I've seen pretty good results with the new 
bgwriter, but it's not really worthwhile until sorting and flushing is in...

Andres 

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Andres Freund
Hi Fabien,

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
> 
> The main changes are that:
> 1) the significant performance regressions I saw are addressed by
>changing the wal writer flushing logic
> 2) The flushing API moved up a couple layers, and now deals with buffer
>tags, rather than the physical files
> 3) Writes from checkpoints, bgwriter and files are flushed, configurable
>by individual GUCs. Without that I still saw the spiked in a lot of 
> circumstances.
> 
> There's also a more experimental reimplementation of bgwriter, but I'm
> not sure it's realistic to polish that up within the constraints of 9.6.

Any comments before I spend more time polishing this? I'm currently
updating docs and comments to actually describe the current state...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Fabien COELHO


Hello Andres,


Any comments before I spend more time polishing this?


I'm running tests on various settings, I'll send a report when it is done.
Up to now the performance seems as good as with the previous version.

I'm currently updating docs and comments to actually describe the 
current state...


I did notice the mismatched documentation.

I think I would appreciate comments to understand why/how the ringbuffer 
is used, and more comments in general, so it is fine if you improve this 
part.


Minor details:

"typedefs.list" should be updated to WritebackContext.

"WritebackContext" is a typedef, "struct" is not needed.


I'll look at the code more deeply probably over next weekend.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-08 Thread Andres Freund
On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote:
> I think I would appreciate comments to understand why/how the ringbuffer is
> used, and more comments in general, so it is fine if you improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts. I think
they'd be committed separately, and probably not in 9.6.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing - V16

2016-02-04 Thread Andres Freund
Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by
   changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
   tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
   by individual GUCs. Without that I still saw the spiked in a lot of 
circumstances.

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Regards,

Andres 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-02-01 Thread Alvaro Herrera
This patch got its fair share of reviewer attention this commitfest.
Moving to the next one.  Andres, if you want to commit ahead of time
you're of course encouraged to do so.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-27 Thread Robert Haas
On Wed, Jan 20, 2016 at 9:02 AM, Andres Freund  wrote:
> Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
> SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
> because we can't easily set the LSN. But, it's actually fairly common
> that the pages LSN is already newer than the commitLSN - in which case
> we, afaics, just can go ahead and set the hint bit, no?
>
> So, instead of
> if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
> return; /* not flushed yet, 
> so don't set hint */
> we do
> if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
> && BufferGetLSNAtomic(buffer) < commitLSN)
> return; /* not flushed yet, 
> so don't set hint */
>
> In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
> a large portion of the hint writes that we currently skip.

Dang.  That's a really good idea.  Although I think you'd probably
better revise the comment, since it will otherwise be false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-21 Thread Andres Freund
On 2016-01-21 11:33:15 +0530, Amit Kapila wrote:
> On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund  wrote:
> > I don't think it's strongly related - the contention here is on read
> > access to the clog, not on write access.
> 
> Aren't reads on clog contended with parallel writes to clog?

Sure. But you're not going to beat "no access to the clog" due to hint
bits, by making parallel writes a bit better citizens.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-20 Thread Amit Kapila
On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund  wrote:
>
> On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> >
> > > The relevant thread is at
> > >
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> > > what I didn't remember is that I voiced concern back then about
exactly this:
> > >
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> > > ;)
> >
> > Interesting.  If we consider for a minute that part of the cause for the
> > slowdown is slowness in pg_clog, maybe we should reconsider the initial
> > decision to flush as quickly as possible (i.e. adopt a strategy where
> > walwriter sleeps a bit between two flushes) in light of the group-update
> > feature for CLOG being proposed by Amit Kapila in another thread -- it
> > seems that these things might go hand-in-hand.
>
> I don't think it's strongly related - the contention here is on read
> access to the clog, not on write access.

Aren't reads on clog contended with parallel writes to clog?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] checkpointer continuous flushing

2016-01-20 Thread Alvaro Herrera
Andres Freund wrote:

> The relevant thread is at
> http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> what I didn't remember is that I voiced concern back then about exactly this:
> http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> ;)

Interesting.  If we consider for a minute that part of the cause for the
slowdown is slowness in pg_clog, maybe we should reconsider the initial
decision to flush as quickly as possible (i.e. adopt a strategy where
walwriter sleeps a bit between two flushes) in light of the group-update
feature for CLOG being proposed by Amit Kapila in another thread -- it
seems that these things might go hand-in-hand.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-20 Thread Andres Freund
On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > The relevant thread is at
> > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> > what I didn't remember is that I voiced concern back then about exactly 
> > this:
> > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> > ;)
> 
> Interesting.  If we consider for a minute that part of the cause for the
> slowdown is slowness in pg_clog, maybe we should reconsider the initial
> decision to flush as quickly as possible (i.e. adopt a strategy where
> walwriter sleeps a bit between two flushes) in light of the group-update
> feature for CLOG being proposed by Amit Kapila in another thread -- it
> seems that these things might go hand-in-hand.

I don't think it's strongly related - the contention here is on read
access to the clog, not on write access. While Amit's patch will reduce
the impact of that a bit, I don't see it making a fundamental
difference.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-20 Thread Andres Freund
On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > This seems like a problem with the WAL writer quite independent of
> > anything else.  It seems likely to be inadvertent fallout from this
> > patch:
> > 
> > Author: Simon Riggs 
> > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +
> > 
> > Wakeup WALWriter as needed for asynchronous commit performance.
> > Previously we waited for wal_writer_delay before flushing WAL. Now
> > we also wake WALWriter as soon as a WAL buffer page has filled.
> > Significant effect observed on performance of asynchronous commits
> > by Robert Haas, attributed to the ability to set hint bits on tuples
> > earlier and so reducing contention caused by clog lookups.
> 
> In addition to that the "powersaving" effort also plays a role - without
> the latch we'd not wake up at any meaningful rate at all atm.

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Simon: CCed you, as the author of the above commit. Quick summary:
The frequent wakeups of wal writer can lead to significant performance
regressions in workloads that are bigger than shared_buffers, because
the super-frequent fdatasync()s by the wal writer slow down concurrent
writes (bgwriter, checkpointer, individual backend writes)
dramatically. To the point that SIGSTOPing the wal writer gets a pgbench
workload from 2995 to 10887 tps.  The reasons fdatasyncs cause a slow
down is that it prevents real use of queuing to the storage devices.


On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > If I understand correctly, prior to that commit, WAL writer woke up 5
> > times per second and flushed just that often (unless you changed the
> > default settings).But as the commit message explained, that turned
> > out to suck - you could make performance go up very significantly by
> > radically decreasing wal_writer_delay.  This commit basically lets it
> > flush at maximum velocity - as fast as we finish one flush, we can
> > start the next.  That must have seemed like a win at the time from the
> > way the commit message was written, but you seem to now be seeing the
> > opposite effect, where performance is suffering because flushes are
> > too frequent rather than too infrequent.  I wonder if there's an ideal
> > flush rate and what it is, and how much it depends on what hardware
> > you have got.
> 
> I think the problem isn't really that it's flushing too much WAL in
> total, it's that it's flushing WAL in a too granular fashion. I suspect
> we want something where we attempt a minimum number of flushes per
> second (presumably tied to wal_writer_delay) and, once exceeded, a
> minimum number of pages per flush. I think we even could continue to
> write() the data at the same rate as today, we just would need to reduce
> the number of fdatasync()s we issue. And possibly could make the
> eventual fdatasync()s cheaper by hinting the kernel to write them out
> earlier.
> 
> Now the question what the minimum number of pages we want to flush for
> (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> simple model would be to statically tie it to the size of wal_buffers;
> say, don't flush unless at least 10% of XLogBuffers have been written
> since the last flush. More complex approaches would be to measure the
> continuous WAL writeout rate.
> 
> By tying it to both a minimum rate under activity (ensuring things go to
> disk fast) and a minimum number of pages to sync (ensuring a reasonable
> number of cache flush operations) we should be able to mostly accomodate
> the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs).  Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-20 Thread Andres Freund
On 2016-01-20 11:13:26 +0100, Andres Freund wrote:
> On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> > On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > I think the problem isn't really that it's flushing too much WAL in
> > total, it's that it's flushing WAL in a too granular fashion. I suspect
> > we want something where we attempt a minimum number of flushes per
> > second (presumably tied to wal_writer_delay) and, once exceeded, a
> > minimum number of pages per flush. I think we even could continue to
> > write() the data at the same rate as today, we just would need to reduce
> > the number of fdatasync()s we issue. And possibly could make the
> > eventual fdatasync()s cheaper by hinting the kernel to write them out
> > earlier.
> >
> > Now the question what the minimum number of pages we want to flush for
> > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> > simple model would be to statically tie it to the size of wal_buffers;
> > say, don't flush unless at least 10% of XLogBuffers have been written
> > since the last flush. More complex approaches would be to measure the
> > continuous WAL writeout rate.
> >
> > By tying it to both a minimum rate under activity (ensuring things go to
> > disk fast) and a minimum number of pages to sync (ensuring a reasonable
> > number of cache flush operations) we should be able to mostly accomodate
> > the different types of workloads. I think.
>
> This unfortunately leaves out part of the reasoning for the above
> commit: We want WAL to be flushed fast, so we immediately can set hint
> bits.
>
> One, relatively extreme, approach would be to continue *writing* WAL in
> the background writer as today, but use rules like suggested above
> guiding the actual flushing. Additionally using operations like
> sync_file_range() (and equivalents on other OSs).  Then, to address the
> regression of SetHintBits() having to bail out more often, actually
> trigger a WAL flush whenever WAL is already written, but not flushed.
> has the potential to be bad in a number of other cases tho :(

Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?

So, instead of
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
return; /* not flushed yet, so 
don't set hint */
we do
if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
&& BufferGetLSNAtomic(buffer) < commitLSN)
return; /* not flushed yet, so 
don't set hint */

In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.

Right now, on my laptop, I get (-M prepared -c 32 -j 32):
current wal-writer  12827 tps, 95 % IO util, 93 % 
CPU
no flushing in wal writer * 13185 tps, 46 % IO util, 93 % 
CPU
no flushing in wal writer & above change16366 tps, 41 % IO util, 95 % 
CPU
flushing in wal writer & above change:  14812 tps, 94 % IO util, 95 % 
CPU

* sometimes the results initially were much lower, with lots of lock
  contention. Can't figure out why that's only sometimes the case. In
  those cases the results were more like 8967 tps.

these aren't meant as thorough benchmarks, just to provide some
orientation.


Now that solution won't improve every situation, e.g. for a workload
that inserts a lot of rows in one transaction, and only does inserts, it
probably won't do all that much. But it still seems like a pretty good
mitigation strategy. I hope that with a smarter write strategy (getting
that 50% reduction in IO util) and the above we should be ok.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Fabien COELHO





I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
  -D /srv/temp/pgdev-dev-800/ \
  -c maintenance_work_mem=2GB \
  -c fsync=on \
  -c synchronous_commit=off \
  -c shared_buffers=2GB \
  -c wal_level=hot_standby \
  -c max_wal_senders=10 \
  -c max_wal_size=100GB \
  -c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1


I must say that I have not succeeded in reproducing any significant 
regression up to now on an HDD. I'm running some more tests again because 
I had left out some options above that I thought were non essential.


I have deep problems with the 30-second checkpoint tests: basically the 
checkpoints take much more than 30 seconds to complete, the system is not 
stable, the 300 seconds runs last more than 900 seconds because the 
clients are stuck a long time. The overall behavior is appaling as most of 
the time is spent in IO panic at 0 tps.


Also, the performance level is around 160 tps on HDDs, which make sense to 
me for a 7200 rpm HDD capable of about x00 random writes per second. It 
seems to me that you reported much better performance on HDD, but I cannot 
really see how this would be possible if data are indeed writen to disk. 
Any idea?


Also, what is the very precise postgres version & patch used in your 
tests on HDDs?



both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.


Maybe this is an explanation

--
Fabien.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Andres Freund
On 2016-01-19 10:27:31 +0100, Fabien COELHO wrote:
> Also, the performance level is around 160 tps on HDDs, which make sense to
> me for a 7200 rpm HDD capable of about x00 random writes per second. It
> seems to me that you reported much better performance on HDD, but I cannot
> really see how this would be possible if data are indeed writen to disk. Any
> idea?

synchronous_commit = off does make a significant difference.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Andres Freund
On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> This seems like a problem with the WAL writer quite independent of
> anything else.  It seems likely to be inadvertent fallout from this
> patch:
> 
> Author: Simon Riggs 
> Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +
> 
> Wakeup WALWriter as needed for asynchronous commit performance.
> Previously we waited for wal_writer_delay before flushing WAL. Now
> we also wake WALWriter as soon as a WAL buffer page has filled.
> Significant effect observed on performance of asynchronous commits
> by Robert Haas, attributed to the ability to set hint bits on tuples
> earlier and so reducing contention caused by clog lookups.

In addition to that the "powersaving" effort also plays a role - without
the latch we'd not wake up at any meaningful rate at all atm.


> If I understand correctly, prior to that commit, WAL writer woke up 5
> times per second and flushed just that often (unless you changed the
> default settings).But as the commit message explained, that turned
> out to suck - you could make performance go up very significantly by
> radically decreasing wal_writer_delay.  This commit basically lets it
> flush at maximum velocity - as fast as we finish one flush, we can
> start the next.  That must have seemed like a win at the time from the
> way the commit message was written, but you seem to now be seeing the
> opposite effect, where performance is suffering because flushes are
> too frequent rather than too infrequent.  I wonder if there's an ideal
> flush rate and what it is, and how much it depends on what hardware
> you have got.

I think the problem isn't really that it's flushing too much WAL in
total, it's that it's flushing WAL in a too granular fashion. I suspect
we want something where we attempt a minimum number of flushes per
second (presumably tied to wal_writer_delay) and, once exceeded, a
minimum number of pages per flush. I think we even could continue to
write() the data at the same rate as today, we just would need to reduce
the number of fdatasync()s we issue. And possibly could make the
eventual fdatasync()s cheaper by hinting the kernel to write them out
earlier.

Now the question what the minimum number of pages we want to flush for
(setting wal_writer_delay triggered ones aside) isn't easy to answer. A
simple model would be to statically tie it to the size of wal_buffers;
say, don't flush unless at least 10% of XLogBuffers have been written
since the last flush. More complex approaches would be to measure the
continuous WAL writeout rate.

By tying it to both a minimum rate under activity (ensuring things go to
disk fast) and a minimum number of pages to sync (ensuring a reasonable
number of cache flush operations) we should be able to mostly accomodate
the different types of workloads. I think.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Robert Haas
On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund  wrote:
> On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
>> Hello Andres,
>>
>> >I measured it in a different number of cases, both on SSDs and spinning
>> >rust. I just reproduced it with:
>> >
>> >postgres-ckpt14 \
>> >   -D /srv/temp/pgdev-dev-800/ \
>> >   -c maintenance_work_mem=2GB \
>> >   -c fsync=on \
>> >   -c synchronous_commit=off \
>> >   -c shared_buffers=2GB \
>> >   -c wal_level=hot_standby \
>> >   -c max_wal_senders=10 \
>> >   -c max_wal_size=100GB \
>> >   -c checkpoint_timeout=30s
>> >
>> >Using a fresh cluster each time (copied from a "template" to save time)
>> >and using
>> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1
>
> So, I've analyzed the problem further, and I think I found something
> rater interesting. I'd profiled the kernel looking where it blocks in
> the IO request queues, and found that the wal writer was involved
> surprisingly often.
>
> So, in a workload where everything (checkpoint, bgwriter, backend
> writes) is flushed: 2995 tps
> After I kill the wal writer with -STOP: 10887 tps
>
> Stracing the wal writer shows:
>
> 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, 
> si_uid=1000} ---
> 17:29:02.001538 rt_sigreturn({mask=[]}) = 0
> 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.001615 write(3, 
> "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 
> 49152) = 49152
> 17:29:02.001671 fdatasync(3)= 0
> 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, 
> si_uid=1000} ---
> 17:29:02.005043 rt_sigreturn({mask=[]}) = 0
> 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.005111 write(3, 
> "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 
> 8192) = 8192
> 17:29:02.005147 fdatasync(3)= 0
> 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, 
> si_uid=1000} ---
> 17:29:02.008705 rt_sigreturn({mask=[]}) = 0
> 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 
> \331_/\0\0\0\267\30\0\0\0\0\0\0"..., 98304) = 98304
> 17:29:02.008822 fdatasync(3)= 0
> 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.016141 rt_sigreturn({mask=[]}) = 0
> 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.016204 write(3, 
> "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 
> 57344) = 57344
> 17:29:02.016281 fdatasync(3)= 0
> 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.019199 rt_sigreturn({mask=[]}) = 0
> 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 17:29:02.019249 write(3, 
> "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0"..., 73728) 
> = 73728
> 17:29:02.019355 fdatasync(3)= 0
> 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
> si_uid=1000} ---
> 17:29:02.022696 rt_sigreturn({mask=[]}) = 0
>
> I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
> second. As soon as the wal writer is stopped, it's much bigger chunks,
> on the order of 50-130 pages. And, not that surprisingly, that improves
> performance, because there's far fewer cache flushes submitted to the
> hardware.

This seems like a problem with the WAL writer quite independent of
anything else.  It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs 
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +

Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings).But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay.  This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next.  That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering 

Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Fabien COELHO



synchronous_commit = off does make a significant difference.


Sure, but I had thought about that and kept this one...


But why are you then saying this is fundamentally limited to 160
xacts/sec?


I'm just saying that the tested load generates mostly random IOs (probably 
on average over 1 page per transaction), random IOs are very slow on a 
HDD, so I do not expect great tps.



I think I found one possible culprit: I automatically wrote 300 seconds for
checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
rerun the tests with this (unreasonnable) figure to check whether I really
get a regression.


I've not seen meaningful changes in the size of the regression between 30/300s.


At 300 seconds (5 minutes) the checkpoints of the accumulated takes 15-25 
minutes, during which the database is mostly offline, and there is no 
clear difference with/without sort+flush.



Other tests I ran with "reasonnable" settings on a large (scale=800) db did
not show any significant performance regression, up to now.


Try running it so that the data set nearly, but not entirely fit into
the OS page cache, while definitely not fitting into shared_buffers. The
scale=800 just worked for that on my hardware, no idea how it's for yours.
That seems to be the point where the effect is the worst.


I have 16GB memory on the tested host, same as your hardware I think, so I 
use scale 800 => 12GB at the beginning of the run. Not sure it fits the 
bill as I think it fits in memory, so the load is mostly write and no/very 
few reads. I'll also try with scale 1000.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Fabien COELHO



synchronous_commit = off does make a significant difference.


Sure, but I had thought about that and kept this one...

I think I found one possible culprit: I automatically wrote 300 seconds 
for checkpoint_timeout, instead of 30 seconds in your settings. I'll have 
to rerun the tests with this (unreasonnable) figure to check whether I 
really get a regression.


Other tests I ran with "reasonnable" settings on a large (scale=800) db 
did not show any significant performance regression, up to know.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-19 Thread Andres Freund
On 2016-01-19 13:34:14 +0100, Fabien COELHO wrote:
> 
> >synchronous_commit = off does make a significant difference.
> 
> Sure, but I had thought about that and kept this one...

But why are you then saying this is fundamentally limited to 160
xacts/sec?

> I think I found one possible culprit: I automatically wrote 300 seconds for
> checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
> rerun the tests with this (unreasonnable) figure to check whether I really
> get a regression.

I've not seen meaningful changes in the size of the regression between 30/300s.

> Other tests I ran with "reasonnable" settings on a large (scale=800) db did
> not show any significant performance regression, up to know.

Try running it so that the data set nearly, but not entirely fit into
the OS page cache, while definitely not fitting into shared_buffers. The
scale=800 just worked for that on my hardware, no idea how it's for yours.

That seems to be the point where the effect is the worst.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-18 Thread Andres Freund
On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >I measured it in a different number of cases, both on SSDs and spinning
> >rust. I just reproduced it with:
> >
> >postgres-ckpt14 \
> >   -D /srv/temp/pgdev-dev-800/ \
> >   -c maintenance_work_mem=2GB \
> >   -c fsync=on \
> >   -c synchronous_commit=off \
> >   -c shared_buffers=2GB \
> >   -c wal_level=hot_standby \
> >   -c max_wal_senders=10 \
> >   -c max_wal_size=100GB \
> >   -c checkpoint_timeout=30s
> >
> >Using a fresh cluster each time (copied from a "template" to save time)
> >and using
> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1

So, I've analyzed the problem further, and I think I found something
rater interesting. I'd profiled the kernel looking where it blocks in
the IO request queues, and found that the wal writer was involved
surprisingly often.

So, in a workload where everything (checkpoint, bgwriter, backend
writes) is flushed: 2995 tps
After I kill the wal writer with -STOP: 10887 tps

Stracing the wal writer shows:

17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, 
si_uid=1000} ---
17:29:02.001538 rt_sigreturn({mask=[]}) = 0
17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.001615 write(3, 
"\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 
49152) = 49152
17:29:02.001671 fdatasync(3)= 0
17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, 
si_uid=1000} ---
17:29:02.005043 rt_sigreturn({mask=[]}) = 0
17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.005111 write(3, 
"\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 
8192) = 8192
17:29:02.005147 fdatasync(3)= 0
17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, 
si_uid=1000} ---
17:29:02.008705 rt_sigreturn({mask=[]}) = 0
17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 
\331_/\0\0\0\267\30\0\0\0\0\0\0"..., 98304) = 98304
17:29:02.008822 fdatasync(3)= 0
17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.016141 rt_sigreturn({mask=[]}) = 0
17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.016204 write(3, 
"\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 
57344) = 57344
17:29:02.016281 fdatasync(3)= 0
17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.019199 rt_sigreturn({mask=[]}) = 0
17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily 
unavailable)
17:29:02.019249 write(3, 
"\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0"..., 73728) = 
73728
17:29:02.019355 fdatasync(3)= 0
17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, 
si_uid=1000} ---
17:29:02.022696 rt_sigreturn({mask=[]}) = 0

I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
second. As soon as the wal writer is stopped, it's much bigger chunks,
on the order of 50-130 pages. And, not that surprisingly, that improves
performance, because there's far fewer cache flushes submitted to the
hardware.


> I'm running some tests similar to those above...

> Do you do some warmup when testing? I guess the answer is "no".

Doesn't make a difference here, I tried both. As long as before/after
benchmarks start from the same state...


> I understand that you have 8 cores/16 threads on your host?

On one of them, 4 cores/8 threads on the laptop.


> Loading scale 800 data for 300 seconds tests takes much more than 300
> seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds
> checkpoint cycles and without any warmup, I feel that these tests are really
> on the very short (too short) side, so I'm not sure how much I can trust
> such results as significant. The data I reported were with more real life
> like parameters.

I see exactly the same with 300s or 1000s checkpoint cycles, it just
takes a lot longer to repeat. They're also similar (although obviously
both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-16 Thread Fabien COELHO


Hello Andres,


Hello Tomas.


Ooops, sorry Andres, I mixed up the thread in my head so was not clear who 
was asking the questions to whom.



I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.


These are very interesting tests, I'm looking forward to have a look at the 
results.


The fact that these options change performance is expected. Personnaly the 
test I submitted on the thread used ext4 with default mount options plus 
"relatime".


I confirm that: nothing special but "relatime" on ext4 on my test host.

If I had a choice, I would tend to take the safest options, because the point 
of a database is to keep data safe. That's why I'm not found of the 
"synchronous_commit=off" chosen above.


"found" -> "fond". I confirm this opinion. If you have BBU on you 
disk/raid system probably playing with some of these options is safe, 
though. Not the case with my basic hardware.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-16 Thread Fabien COELHO


Hello Andres,


I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
   -D /srv/temp/pgdev-dev-800/ \
   -c maintenance_work_mem=2GB \
   -c fsync=on \
   -c synchronous_commit=off \
   -c shared_buffers=2GB \
   -c wal_level=hot_standby \
   -c max_wal_senders=10 \
   -c max_wal_size=100GB \
   -c checkpoint_timeout=30s

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j 16 -T 300 -P 1


I'm running some tests similar to those above...

Do you do some warmup when testing? I guess the answer is "no".

I understand that you have 8 cores/16 threads on your host?

Loading scale 800 data for 300 seconds tests takes much more than 300 
seconds (init takes ~360 seconds, vacuum & index are slow). With 30 
seconds checkpoint cycles and without any warmup, I feel that these tests 
are really on the very short (too short) side, so I'm not sure how much I 
can trust such results as significant. The data I reported were with more 
real life like parameters.


Anyway, I'll have some results to show with a setting more or less similar 
to yours.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-15 Thread Andres Freund
Hi Fabien,

On 2016-01-11 14:45:16 +0100, Andres Freund wrote:
> I measured it in a different number of cases, both on SSDs and spinning
> rust. I just reproduced it with:
> 
> postgres-ckpt14 \
> -D /srv/temp/pgdev-dev-800/ \
> -c maintenance_work_mem=2GB \
> -c fsync=on \
> -c synchronous_commit=off \
> -c shared_buffers=2GB \
> -c wal_level=hot_standby \
> -c max_wal_senders=10 \
> -c max_wal_size=100GB \
> -c checkpoint_timeout=30s

What kernel, filesystem and filesystem option did you measure with?

I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.

Reading kernel sources trying to understand some more of the performance
impact.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-15 Thread Fabien COELHO



Hi Fabien,


Hello Tomas.


On 2016-01-11 14:45:16 +0100, Andres Freund wrote:

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \
-D /srv/temp/pgdev-dev-800/ \
-c maintenance_work_mem=2GB \
-c fsync=on \
-c synchronous_commit=off \
-c shared_buffers=2GB \
-c wal_level=hot_standby \
-c max_wal_senders=10 \
-c max_wal_size=100GB \
-c checkpoint_timeout=30s


What kernel, filesystem and filesystem option did you measure with?


Andres did these measures, not me, so I do not know.


I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.


These are very interesting tests, I'm looking forward to have a look at 
the results.


The fact that these options change performance is expected. Personnaly the 
test I submitted on the thread used ext4 with default mount options plus 
"relatime".


If I had a choice, I would tend to take the safest options, because the 
point of a database is to keep data safe. That's why I'm not found of the 
"synchronous_commit=off" chosen above.



Reading kernel sources trying to understand some more of the performance
impact.


Wow!

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   >