Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2016-01-17 Thread Fabien COELHO



Coming in late here, but I always thought the fact that the FPW happen
mostly at the start of the checkpoint, and the checkpoint writes/fsyncs
happen mostly in the first half of the checkpoint period was always
suboptimal, i.e. it would be nice of one of these was more active in the
second half of the checkpoint period.  I assume that is what is being
discussed here.


Yes, this is the subject of the thread.

On the one end hand, whether is the first half or first quarter of first 
tenth really depends on the actual load, so how much to rebalance depends 
on that dynamic information. At the beginning there should be a short 
spike for index pages which are quickly reused, and a longer spike about 
data pages depending on the pattern of access and size of table.


On the other hand the rebalancing also depends on the measure chosen to 
know about the overall progress, either WAL writing or time, and their 
behavior is not the same, so this should be taken into account.


My conclusion is that there is no simple static fix to this issue, as 
proposed in the submitted patch. The problem needs thinking and maths.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2016-01-16 Thread Bruce Momjian
On Wed, Dec 23, 2015 at 04:37:00PM +0100, Fabien COELHO wrote:
> Hmmm. Let us try with both hands:
> 
> AFAICR with xlog-triggered checkpoints, the checkpointer progress is
> measured with respect to the size of the WAL file, which does not
> grow linearly in time for the reason you pointed above (a lot of FPW
> at the beginning, less in the end). As the WAL file is growing
> quickly, the checkpointer thinks that it is late and that it has
> some catchup to do, so it will start to try writing quickly as well.
> There is a double whammy as both are writing more, and are probably
> not succeeding.
> 
> For time triggered checkpoints, the WAL file gets filled up *but*
> the checkpointer load is balanced against time. This is a "simple"
> whammy, where the checkpointer uses IO bandwith which is needed for
> the WAL, and it could wait a little bit because the WAL will need
> less later, but it is not trying to catch up by even writing more,
> so the load shifting needed in this case is not the same as the
> previous case.
> 
> As you point out there is a WAL spike in both case, but in one case
> there is also a checkpointer spike and in the other the checkpointer
> load is flat.

Coming in late here, but I always thought the fact that the FPW happen
mostly at the start of the checkpoint, and the checkpoint writes/fsyncs
happen mostly in the first half of the checkpoint period was always
suboptimal, i.e. it would be nice of one of these was more active in the
second half of the checkpoint period.  I assume that is what is being
discussed here.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Robert Haas
On Wed, Dec 23, 2015 at 9:22 AM, Fabien COELHO  wrote:
>> Wait, what?  On what workload does the FPW spike last only a few
>> seconds? [...]
>
> Ok. AFAICR, a relatively small part at the beginning of the checkpoint, but
> possibly more that a few seconds.

On a pgbench test, and probably many other workloads, the impact of
FPWs declines exponentially (or maybe geometrically, but I think
exponentially) as we get further into the checkpoint.  The first write
is dead certain to need an FPW; after that, if access is more or less
random, the chance of needing an FPW for the next write increases in
proportion to the number of FPWs already written.  As the chances of
NOT needing an FPW grow higher, the tps rate starts to increase,
initially just a bit, but then faster and faster as the percentage of
the working set that has already had an FPW grows.  If the working set
is large, we're still doing FPWs pretty frequently when the next
checkpoint hits - if it's small, then it'll tail off sooner.

> My actual point is that it should be tested with different and especially
> smaller values, because 1.5 changes the overall load distribution *a lot*.
> For testing purpose I suggested that a guc would help, but the patch author
> has never been back to intervene on the thread, discuss the arguments not
> provide another patch.

Well, somebody else should be able to hack a GUC into the patch.

I think one thing that this conversation exposes is that the size of
the working set matters a lot.   For example, if the workload is
pgbench, you're going to see a relatively short FPW-related spike at
scale factor 100, but at scale factor 3000 it's going to be longer and
at some larger scale factor it will be longer still.  Therefore you're
probably right that 1.5 is unlikely to be optimal for everyone.

Another point (which Jan Wieck made me think of) is that the optimal
behavior here likely depends on whether xlog and data are on the same
disk controller.  If they aren't, the FPW spike and background writes
may not interact as much.

>>> Another issue I raised is that the load change occurs both with xlog and
>>> time triggered checkpoints, and I'm sure it should be applied in both
>>> case.
>>
>> Is this sentence missing a "not"?
> Indeed. I think that it make sense for xlog triggered checkpoints, but less
> so with time triggered checkpoints. I may be wrong, but I think that this
> deserve careful analysis.

Hmm, off-hand I don't see why that should make any difference.  No
matter what triggers the checkpoint, there is going to be a spike of
FPI activity at the beginning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Fabien COELHO


Hello Robert,


I think that the 1.5 value somewhere in the patch is much too high for the
purpose because it shifts the checkpoint load quite a lot (50% more load at
the end of the checkpoint) just for the purpose of avoiding a spike which
lasts a few seconds (I think) at the beginning. A much smaller value should
be used (1.0 <= factor < 1.1), as it would be much less disruptive and would
probably avoid the issue just the same. I recommend not to commit with a 1.5
factor in any case.


Wait, what?  On what workload does the FPW spike last only a few
seconds? [...]


Ok. AFAICR, a relatively small part at the beginning of the checkpoint, 
but possibly more that a few seconds.


My actual point is that it should be tested with different and especially 
smaller values, because 1.5 changes the overall load distribution *a lot*. 
For testing purpose I suggested that a guc would help, but the patch 
author has never been back to intervene on the thread, discuss the 
arguments not provide another patch.



Another issue I raised is that the load change occurs both with xlog and
time triggered checkpoints, and I'm sure it should be applied in both case.


Is this sentence missing a "not"?


Indeed. I think that it make sense for xlog triggered checkpoints, but 
less so with time triggered checkpoints. I may be wrong, but I think that 
this deserve careful analysis.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Fabien COELHO


Hello Robert,


On a pgbench test, and probably many other workloads, the impact of
FPWs declines exponentially (or maybe geometrically, but I think
exponentially) as we get further into the checkpoint.


Indeed. If the probability of hitting a page is uniform, I think that the 
FPW probability is exp(-n/N) for the n-th page access.


The first write is dead certain to need an FPW; after that, if access is 
more or less random, the chance of needing an FPW for the next write 
increases in proportion to the number of FPWs already written.  As the 
chances of NOT needing an FPW grow higher, the tps rate starts to 
increase, initially just a bit, but then faster and faster as the 
percentage of the working set that has already had an FPW grows.  If the 
working set is large, we're still doing FPWs pretty frequently when the 
next checkpoint hits - if it's small, then it'll tail off sooner.


Yes.


My actual point is that it should be tested with different and especially
smaller values, because 1.5 changes the overall load distribution *a lot*.
For testing purpose I suggested that a guc would help, but the patch author
has never been back to intervene on the thread, discuss the arguments not
provide another patch.


Well, somebody else should be able to hack a GUC into the patch.


Yep. But I'm so far behind everything that I was basically waiting for the 
author to do it:-)



I think one thing that this conversation exposes is that the size of
the working set matters a lot.   For example, if the workload is
pgbench, you're going to see a relatively short FPW-related spike at
scale factor 100, but at scale factor 3000 it's going to be longer and
at some larger scale factor it will be longer still.  Therefore you're
probably right that 1.5 is unlikely to be optimal for everyone.

Another point (which Jan Wieck made me think of) is that the optimal
behavior here likely depends on whether xlog and data are on the same
disk controller.  If they aren't, the FPW spike and background writes
may not interact as much.


Yep, I pointed out that as well. In which case the patch just disrupts the 
checkpoint load for no benefit... Which would make a guc mandatory.


[...]. I think that it make sense for xlog triggered checkpoints, but 
less so with time triggered checkpoints. I may be wrong, but I think 
that this deserve careful analysis.


Hmm, off-hand I don't see why that should make any difference.  No
matter what triggers the checkpoint, there is going to be a spike of
FPI activity at the beginning.


Hmmm. Let us try with both hands:

AFAICR with xlog-triggered checkpoints, the checkpointer progress is 
measured with respect to the size of the WAL file, which does not grow 
linearly in time for the reason you pointed above (a lot of FPW at the 
beginning, less in the end). As the WAL file is growing quickly, the 
checkpointer thinks that it is late and that it has some catchup to do, so 
it will start to try writing quickly as well. There is a double whammy as 
both are writing more, and are probably not succeeding.


For time triggered checkpoints, the WAL file gets filled up *but* the 
checkpointer load is balanced against time. This is a "simple" whammy, 
where the checkpointer uses IO bandwith which is needed for the WAL, and 
it could wait a little bit because the WAL will need less later, but it is 
not trying to catch up by even writing more, so the load shifting needed 
in this case is not the same as the previous case.


As you point out there is a WAL spike in both case, but in one case there 
is also a checkpointer spike and in the other the checkpointer load is 
flat.


So I think that the correction should not be the same in both cases. 
Moreover no correction is needed if WAL & relations are on different 
disks. Also, as you pointed out, it also depends on the load (for a large 
base the FPW is spead more evenly, for smaller bases there is a spike), so 
the corrective formula should take that information into account, which 
means that some evaluation of the FPW distribution should be collected...


All this is non trivial. I may do some math to try to solve this, but I'm 
pretty sure that a blank 1.5 correction in all cases is not the solution.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Robert Haas
On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra
 wrote:
>> Another point (which Jan Wieck made me think of) is that the optimal
>> behavior here likely depends on whether xlog and data are on the same
>> disk controller. If they aren't, the FPW spike and background writes
>> may not interact as much.
>
> I'm not sure what exactly you mean by "optimal behavior" here. Surely if you
> want to minimize interference between WAL and regular I/O, you'll do that.
>
> But I don't see what that has to do with the writes generated by the
> checkpoint? If we do much more writes at the beginning of the checkpoint
> (due to getting confused by FPW), and OS starts flushing that to disk
> because we exceed dirty_(background)_bytes, that surely interferes with
> reads (which is a major issue for queries).

Well, it's true that the checkpointer dirty page writes could
interfere with reads, but if you've also got lots of FPW-bloated WAL
records being written to the same disk at the same time, I would think
that'd be worse.  No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Tomas Vondra

Hi,

On 12/23/2015 03:38 PM, Robert Haas wrote:


I think one thing that this conversation exposes is that the size of
the working set matters a lot. For example, if the workload is
pgbench, you're going to see a relatively short FPW-related spike at
scale factor 100, but at scale factor 3000 it's going to be longer
and at some larger scale factor it will be longer still. Therefore
you're probably right that 1.5 is unlikely to be optimal for
everyone.


Right.

Also, when you say "pgbench" you probably mean the default uniform 
distribution. But we now have gaussian and exponential distributions 
which might be handy to simulate other types of workloads.




Another point (which Jan Wieck made me think of) is that the optimal
behavior here likely depends on whether xlog and data are on the same
disk controller. If they aren't, the FPW spike and background writes
may not interact as much.


I'm not sure what exactly you mean by "optimal behavior" here. Surely if 
you want to minimize interference between WAL and regular I/O, you'll do 
that.


But I don't see what that has to do with the writes generated by the 
checkpoint? If we do much more writes at the beginning of the checkpoint 
(due to getting confused by FPW), and OS starts flushing that to disk 
because we exceed dirty_(background)_bytes, that surely interferes with 
reads (which is a major issue for queries).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Tomas Vondra



On 12/21/2015 01:11 PM, Heikki Linnakangas wrote:

On 21/12/15 13:53, Tomas Vondra wrote:

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:

On 17/12/15 19:07, Robert Haas wrote:

If it works well empirically, does it really matter that it's
arbitrary? I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and
which to ignore. The proof that we have made good decisions there
is in the query plans it generates. (The proof that we have made
bad decisions in some cases in the query plans, too.)


Agreed.


What if it only seems to work well because it was tested on cases it was
designed for? What about the workloads that behave differently?

Whenever we do changes to costing and query planning, we carefully
consider counter-examples and cases where it might fail. I see nothing
like that in this thread - all I see is a bunch of pgbench tests, which
seems rather insufficient to me.


Agreed on that too.


I'm ready to spend some time on this, assuming we can agree on what
tests to run. Can we come up with realistic workloads where we expect
the patch might actually work poorly?


I think the worst case scenario would be the case where there is no
FPW-related WAL burst at all, and checkpoints are always triggered by
max_wal_size rather than checkpoint_timeout. In that scenario, the
compensation formula will cause the checkpoint to be too lazy in the
beginning, and it will have to catch up more aggressively towards the
end of the checkpoint cycle.

One such scenario might be to do only COPYs into a table with no
indexes. Or hack pgbench to do concentrate all the updates on only a few
very rows. There will be a FPW on those few pages initially, but the
spike will be much shorter. Or turn full_page_writes=off, and hack the
patch to do compensation even when fullpage_writes=off, and then just
run pgbench.


OK, the COPY scenario works interesting and also realistic because it 
probably applies to systems doing batch loads.


So that's one test to do, can we come up with some other?

We probably do want to do a bunch of pgbench tests, with various scales 
and also distributions - the gaussian/exponential distributions seem 
useful for simulating OLTP systems that usually have just s small active 
set (instead of touching all the data). This surely affects how much FPW 
we do and at what point - my expectetion is that the non-uniform 
distributions will have a long tail of FPW.


So I was thinking about these combinations:

* modes: uniform, gaussian, exponential
* scales: 1000 (15GB), 1 (150GB)
* clients: 1, 2, 4, 8, 16 (to see impact on scalability, if any)

Each combination needs to run for at least an hour or two, possibly with 
multiple runs. I'll also try running this both on SSD-based sytem and a 
system with 10k drives, because those will probably behave differently.


Also, are we tracking the amount of FPW during the checkpoint, 
somewhere? That'd be useful, at least for this patch. Or do we need to 
just track the amount of WAL produced?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Robert Haas
On Wed, Dec 23, 2015 at 10:37 AM, Fabien COELHO  wrote:
> Hmmm. Let us try with both hands:
>
> AFAICR with xlog-triggered checkpoints, the checkpointer progress is
> measured with respect to the size of the WAL file, which does not grow
> linearly in time for the reason you pointed above (a lot of FPW at the
> beginning, less in the end). As the WAL file is growing quickly, the
> checkpointer thinks that it is late and that it has some catchup to do, so
> it will start to try writing quickly as well. There is a double whammy as
> both are writing more, and are probably not succeeding.
>
> For time triggered checkpoints, the WAL file gets filled up *but* the
> checkpointer load is balanced against time. This is a "simple" whammy, where
> the checkpointer uses IO bandwith which is needed for the WAL, and it could
> wait a little bit because the WAL will need less later, but it is not trying
> to catch up by even writing more, so the load shifting needed in this case
> is not the same as the previous case.

I see your point, but this isn't a function of what triggered the
checkpoint.  It's a function of how we measure whether the
already-triggered checkpoint is on schedule - we may be behind either
because of time, or because of xlog, or both.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Tomas Vondra

Hi,

On 12/23/2015 08:22 PM, Robert Haas wrote:

On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra
 wrote:

Another point (which Jan Wieck made me think of) is that the optimal
behavior here likely depends on whether xlog and data are on the same
disk controller. If they aren't, the FPW spike and background writes
may not interact as much.


I'm not sure what exactly you mean by "optimal behavior" here. Surely if you
want to minimize interference between WAL and regular I/O, you'll do that.

But I don't see what that has to do with the writes generated by the
checkpoint? If we do much more writes at the beginning of the checkpoint
(due to getting confused by FPW), and OS starts flushing that to disk
because we exceed dirty_(background)_bytes, that surely interferes with
reads (which is a major issue for queries).


Well, it's true that the checkpointer dirty page writes could
interfere with reads, but if you've also got lots of FPW-bloated WAL
records being written to the same disk at the same time, I would think
that'd be worse.  No?


Yes, sure. My point was that in both cases the "optimal behavior" is not 
to get confused by the initially higher amount of WAL (due to FPW), and 
track the "real" un-skewed checkpoint progress.


Placing both data and WAL on the same device/controller makes the 
interference worse, especially when we have a lot of FPW at the 
beginning of the checkpoint.


I.e. there's only one "optimal" behavior for both cases.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-23 Thread Fabien COELHO



AFAICR with xlog-triggered checkpoints, the checkpointer progress is
measured with respect to the size of the WAL file, which does not grow
linearly in time for the reason you pointed above (a lot of FPW at the
beginning, less in the end). As the WAL file is growing quickly, the
checkpointer thinks that it is late and that it has some catchup to do, so
it will start to try writing quickly as well. There is a double whammy as
both are writing more, and are probably not succeeding.

For time triggered checkpoints, the WAL file gets filled up *but* the
checkpointer load is balanced against time. This is a "simple" whammy, where
the checkpointer uses IO bandwith which is needed for the WAL, and it could
wait a little bit because the WAL will need less later, but it is not trying
to catch up by even writing more, so the load shifting needed in this case
is not the same as the previous case.


I see your point, but this isn't a function of what triggered the
checkpoint.  It's a function of how we measure whether the
already-triggered checkpoint is on schedule - we may be behind either
because of time, or because of xlog, or both.


Yes. Indeed the current implementation does some kind of both time & xlog.

My reasonning was that for time triggered checkpoints (probably average to 
low load) the time is likely to be used for the checkpoint schedule, while 
for xlog-triggered checkpoints (probably higher load) it would be more 
likely to be the xlog, which is skewed.


Anyway careful thinking is needed to balance WAL and checkpointer IOs, 
only when needed, not a rough formula applied blindly.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-22 Thread Robert Haas
On Mon, Dec 21, 2015 at 7:51 AM, Fabien COELHO  wrote:
> I think that the 1.5 value somewhere in the patch is much too high for the
> purpose because it shifts the checkpoint load quite a lot (50% more load at
> the end of the checkpoint) just for the purpose of avoiding a spike which
> lasts a few seconds (I think) at the beginning. A much smaller value should
> be used (1.0 <= factor < 1.1), as it would be much less disruptive and would
> probably avoid the issue just the same. I recommend not to commit with a 1.5
> factor in any case.

Wait, what?  On what workload does the FPW spike last only a few
seconds?  That's certainly not the case in testing I've done.  It
would have to be the case that almost all the writes were concentrated
on a very few pages.

> Another issue I raised is that the load change occurs both with xlog and
> time triggered checkpoints, and I'm sure it should be applied in both case.

Is this sentence missing a "not"?

> Another issue is that the patch makes sense when the WAL & relations are on
> the same disk, but might degrade performance otherwise.

Yes, that would be a good case to test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Heikki Linnakangas

On 17/12/15 19:07, Robert Haas wrote:

On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:

So we know that we should expect about

   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

   (   regular WAL) + (  FPW WAL )

to be produced until the end of the current checkpoint. I don't have a clear
idea how to transform this into the 'progress' yet, but I'm pretty sure
tracking the two types of WAL is a key to a better solution. The x^1.5 is
probably a step in the right direction, but I don't feel particularly
confident about the 1.5 (which is rather arbitrary).


If it works well empirically, does it really matter that it's
arbitrary?  I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and which
to ignore.  The proof that we have made good decisions there is in the
query plans it generates.  (The proof that we have made bad decisions
in some cases in the query plans, too.)


Agreed.


I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.


Yeah, there's that problem too :-).

The reason I didn't commit this back then was lack of performance 
testing. I'm fairly confident that this would be a significant 
improvement for some workloads, and shouldn't hurt much even in the 
worst case. But I did only a little testing on my laptop. I think Simon 
was in favor of just committing it immediately, and Fabien wanted to see 
more performance testing before committing.


I was hoping that Digoal would re-ran his original test case, and report 
back on whether it helps. Fabien had a performance test setup, for 
testing another patch, but he didn't want to run it to test this patch. 
Amit did some testing, but didn't see a difference. We can take that as 
a positive sign - no regression - or as a negative sign, but I think 
that basically means that his test was just not sensitive to the FPW issue.


So Tomas, if you're willing to do some testing on this, that would be 
brilliant!


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Fabien COELHO


Hello Heikki,

The reason I didn't commit this back then was lack of performance testing. 
I'm fairly confident that this would be a significant improvement for some 
workloads, and shouldn't hurt much even in the worst case. But I did only a 
little testing on my laptop. I think Simon was in favor of just committing it 
immediately, and



Fabien wanted to see more performance testing before committing.


I confirm. To summarize my opinion:

I think that the 1.5 value somewhere in the patch is much too high for the 
purpose because it shifts the checkpoint load quite a lot (50% more load 
at the end of the checkpoint) just for the purpose of avoiding a spike 
which lasts a few seconds (I think) at the beginning. A much smaller value 
should be used (1.0 <= factor < 1.1), as it would be much less disruptive 
and would probably avoid the issue just the same. I recommend not to 
commit with a 1.5 factor in any case.


Another issue I raised is that the load change occurs both with xlog and 
time triggered checkpoints, and I'm sure it should be applied in both 
case.


Another issue is that the patch makes sense when the WAL & relations are 
on the same disk, but might degrade performance otherwise.


Another point that it interacts potentially with a patch I submitted which 
has a large impact on performance (order of magnitude better in some cases 
by sorting & flushing blocks on checkpoints), so it would make sense to 
check that.


So more testing is definitely needed. A guc would be nice for this 
purpose, especially to look at different factors.


I was hoping that Digoal would re-ran his original test case, and report 
back on whether it helps. Fabien had a performance test setup, for 
testing another patch, but he didn't want to run it to test this patch.


Indeed, I have, but I'm quite behind at the moment, I cannot promise 
anything. Moreover, I'm not sure I see this "spike" issue in my setting, 
AFAICR.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Tomas Vondra

Hi,

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:

On 17/12/15 19:07, Robert Haas wrote:

On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:

So we know that we should expect about

   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

   (   regular WAL) + (  FPW WAL )

to be produced until the end of the current checkpoint. I don't
have a clear idea how to transform this into the 'progress' yet,
but I'm pretty sure tracking the two types of WAL is a key to a
better solution. The x^1.5 is probably a step in the right
direction, but I don't feel particularly confident about the 1.5
(which is rather arbitrary).


If it works well empirically, does it really matter that it's
arbitrary? I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and
which to ignore. The proof that we have made good decisions there
is in the query plans it generates. (The proof that we have made
bad decisions in some cases in the query plans, too.)


Agreed.


What if it only seems to work well because it was tested on cases it was 
designed for? What about the workloads that behave differently?


Whenever we do changes to costing and query planning, we carefully 
consider counter-examples and cases where it might fail. I see nothing 
like that in this thread - all I see is a bunch of pgbench tests, which 
seems rather insufficient to me.





I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.


Yeah, there's that problem too :-).

The reason I didn't commit this back then was lack of performance
testing. I'm fairly confident that this would be a significant
improvement for some workloads, and shouldn't hurt much even in the
worst case. But I did only a little testing on my laptop. I think
Simon was in favor of just committing it immediately, and Fabien
wanted to see more performance testing before committing.

I was hoping that Digoal would re-ran his original test case, and
report back on whether it helps. Fabien had a performance test setup,
for testing another patch, but he didn't want to run it to test this
patch. Amit did some testing, but didn't see a difference. We can
take that as a positive sign - no regression - or as a negative sign,
but I think that basically means that his test was just not sensitive
to the FPW  issue.

So Tomas, if you're willing to do some testing on this, that would
be brilliant!


I'm ready to spend some time on this, assuming we can agree on what 
tests to run. Can we come up with realistic workloads where we expect 
the patch might actually work poorly?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Heikki Linnakangas

On 21/12/15 13:53, Tomas Vondra wrote:

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:

On 17/12/15 19:07, Robert Haas wrote:

If it works well empirically, does it really matter that it's
arbitrary? I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and
which to ignore. The proof that we have made good decisions there
is in the query plans it generates. (The proof that we have made
bad decisions in some cases in the query plans, too.)


Agreed.


What if it only seems to work well because it was tested on cases it was
designed for? What about the workloads that behave differently?

Whenever we do changes to costing and query planning, we carefully
consider counter-examples and cases where it might fail. I see nothing
like that in this thread - all I see is a bunch of pgbench tests, which
seems rather insufficient to me.


Agreed on that too.


I'm ready to spend some time on this, assuming we can agree on what
tests to run. Can we come up with realistic workloads where we expect
the patch might actually work poorly?


I think the worst case scenario would be the case where there is no 
FPW-related WAL burst at all, and checkpoints are always triggered by 
max_wal_size rather than checkpoint_timeout. In that scenario, the 
compensation formula will cause the checkpoint to be too lazy in the 
beginning, and it will have to catch up more aggressively towards the 
end of the checkpoint cycle.


One such scenario might be to do only COPYs into a table with no 
indexes. Or hack pgbench to do concentrate all the updates on only a few 
very rows. There will be a FPW on those few pages initially, but the 
spike will be much shorter. Or turn full_page_writes=off, and hack the 
patch to do compensation even when fullpage_writes=off, and then just 
run pgbench.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:
> So we know that we should expect about
>
>   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
>
>   (   regular WAL) + (  FPW WAL )
>
> to be produced until the end of the current checkpoint. I don't have a clear
> idea how to transform this into the 'progress' yet, but I'm pretty sure
> tracking the two types of WAL is a key to a better solution. The x^1.5 is
> probably a step in the right direction, but I don't feel particularly
> confident about the 1.5 (which is rather arbitrary).

If it works well empirically, does it really matter that it's
arbitrary?  I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and which
to ignore.  The proof that we have made good decisions there is in the
query plans it generates.  (The proof that we have made bad decisions
in some cases in the query plans, too.)

I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-14 Thread Tomas Vondra

Hi,

I was planning to do some review/testing on this patch, but then I 
noticed it was rejected with feedback in 2015-07 and never resubmitted 
into another CF. So I won't waste time in testing this unless someone 
shouts that I should do that anyway. Instead I'll just post some ideas 
about how we might improve the patch, because I'd forget about them 
otherwise.


On 07/05/2015 09:48 AM, Heikki Linnakangas wrote:


The ideal correction formula f(x), would be such that f(g(X)) = X, where:

  X is time, 0 = beginning of checkpoint, 1.0 = targeted end of
checkpoint (checkpoint_segments), and

  g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0
= targeted end of checkpoint (derived from max_wal_size).

Unfortunately, we don't know the shape of g(X), as that depends on the
workload. It might be linear, if there is no effect at all from
full_page_writes. Or it could be a step-function, where every write
causes a full page write, until all pages have been touched, and after
that none do (something like an UPDATE without a where-clause might
cause that). In pgbench-like workloads, it's something like sqrt(x). I
picked X^1.5 as a reasonable guess. It's close enough to linear that it
shouldn't hurt too much if g(x) is linear. But it cuts the worst spike
at the very beginning, if g(x) is more like sqrt(x).


Exactly. I think the main "problem" here is that we do mix two types of 
WAL records, with quite different characteristics:


 (a) full_page_writes - very high volume right after checkpoint, then
 usually drops to much lower volume

 (b) regular records - about the same volume over time (well, lower
 volume right after the checkpoint, as that's where FPWs happen)

We completely ignore this when computing elapsed_xlogs, because we 
compute it (about) like this:


elapsed_xlogs = wal_since_checkpoint / CheckPointSegments;

which of course gets confused when we write a lot of WAL right after a 
checkpoint, because of FPW. But what if we actually tracked the amount 
of WAL produced by FWP in a checkpoint (which we current don't AFAIK)?


Then we could compute the expected *remaining* amount of WAL to be 
produced within the checkpoint interval, and use that to compute a 
better progress like this:


  wal_bytes  - WAL (total)
  wal_fpw_bytes  - WAL (due to FPW)
  prev_wal_bytes - WAL (total) in previous checkpoint
  prev_wal_fpw_bytes - WAL (due to FPW) in previous checkpoint

So we know that we should expect about

  (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

  (   regular WAL) + (  FPW WAL )

to be produced until the end of the current checkpoint. I don't have a 
clear idea how to transform this into the 'progress' yet, but I'm pretty 
sure tracking the two types of WAL is a key to a better solution. The 
x^1.5 is probably a step in the right direction, but I don't feel 
particularly confident about the 1.5 (which is rather arbitrary).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-08-25 Thread Michael Paquier
On Mon, Jul 6, 2015 at 12:30 PM, Amit Kapila wrote:
 Yes, we definitely want to see the effect on TPS at the beginning of
 checkpoint,
 but even measuring the IO during checkpoint with the way Digoal was
 capturing
 the data can show the effect of this patch.

I am marking this patch as returned with feedback.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Heikki Linnakangas

On 07/04/2015 07:34 PM, Fabien COELHO wrote:



In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.


I've looked at the maths.

I think that the load is distributed as the derivative of this function,
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
is the finishing load is 1.5 the average load, just before fsyncing files.
This looks like a recipee for a bad time: I would say this is too large an
overload. I would suggest a much lower value, say around 1.1...


Hmm. Load is distributed as a derivate of that, but probably not the way 
you think. Note that X means the amount of WAL consumed, not time. The 
goal is that I/O is constant over time, but the consumption of WAL over 
time is non-linear, with a lot more WAL consumed in the beginning of a 
checkpoint cycle. The function compensates for that.



The other issue with this function is that it should only degrade
performance by disrupting the write distribution if someone has WAL on a
different disk. As I understand it this thing does only make sense if the
WAL  the data are on the samee disk. This really suggest a guc.


No, the I/O storm caused by full-page-writes is a problem even if WAL is 
on a different disk. Even though the burst of WAL I/O then happens on a 
different disk, the fact that we consume a lot of WAL in the beginning 
of a checkpoint makes the checkpointer think that it needs to hurry up, 
in order to meet the deadline. It will flush a lot of pages in a rush, 
so you get a burst of I/O on the data disk too. Yes, it's even worse 
when WAL and data are on the same disk, but even then, I think the 
random I/O caused by the checkpointer hurrying is more significant than 
the extra WAL I/O, which is sequential.


To illustrate that, imagine that the checkpoint begins now. The 
checkpointer calculates that it has 10 minutes to complete the 
checkpoint (checkpoint_timeout), or until 1 GB of WAL has been generated 
(derived from max_wal_size), whichever happens first. Immediately after 
the Redo-point has been established, in the very beginning of the 
checkpoint, the WAL storm begins. Every backend that dirties a page also 
writes a full-page image. After just 10 seconds, those backends have 
already written 200 MB of WAL. That's 1/5 of the quota, and based on 
that, the checkpointer will quickly flush 1/5 of all buffers. In 
reality, the WAL consumption is not linear, and will slow down as time 
passes and less full-page writes happen. So in reality, the checkpointer 
would have a lot more time to complete the checkpoint - it is 
unnecessarily aggressive in the beginning of the checkpoint.


The correction factor in the patch compensates for that. With the X^1.5 
formula, when 20% of the WAL has already been consumed, the checkpointer 
have flushed only ~ 9% of the buffers, not 20% as without the patch.


The ideal correction formula f(x), would be such that f(g(X)) = X, where:

 X is time, 0 = beginning of checkpoint, 1.0 = targeted end of 
checkpoint (checkpoint_segments), and


 g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0 
= targeted end of checkpoint (derived from max_wal_size).


Unfortunately, we don't know the shape of g(X), as that depends on the 
workload. It might be linear, if there is no effect at all from 
full_page_writes. Or it could be a step-function, where every write 
causes a full page write, until all pages have been touched, and after 
that none do (something like an UPDATE without a where-clause might 
cause that). In pgbench-like workloads, it's something like sqrt(x). I 
picked X^1.5 as a reasonable guess. It's close enough to linear that it 
shouldn't hurt too much if g(x) is linear. But it cuts the worst spike 
at the very beginning, if g(x) is more like sqrt(x).


This is all assuming that the application load is constant. If it's 
not, g(x) can obviously have any shape, and there's no way we can 
predict that. But that's a different story, nothing to do with 
full_page_writes.



I have ran some tests with this patch and the detailed results of the
runs are attached with this mail.


I do not understand really the aggregated figures in the files attached.


Me neither. It looks like Amit measured the time spent in mdread and 
mdwrite, but I'm not sure what conclusions one can draw from that.



I thought the patch should show difference if I keep max_wal_size to
somewhat lower or moderate value so that checkpoint should get triggered
due to wal size, but I am not seeing any major difference in the writes
spreading.


I'm not sure I understand your point. I would say that at full speed
pgbench the disk is always busy writing as much as possible, either
checkpoint writes or wal writes, so the write load as such should not be
that different anyway?

I understood that the point of the patch is to check whether 

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Heikki Linnakangas

On 07/05/2015 08:19 AM, Fabien COELHO wrote:

I am a bit skeptical about this.  We need test scenarios that clearly
show the benefit of having and of not having this behavior. It might be
that doing this always is fine for everyone.


Do you mean I have to proove that there is an actual problem induced from
this patch?


You don't have to do anything if you don't want to. I said myself that 
this needs performance testing of the worst-case scenario, one where we 
would expect this to perform worse than without the patch. Then we can 
look at how bad that effect is, and decide if that's acceptable.


That said, if you could do that testing, that would be great! I'm not 
planning to spend much time on this myself, and it would take me a fair 
amount of time to set up the hardware and tools to test this. I was 
hoping Digoal would have the time to do that, since he started this 
thread, or someone else that has a system ready for this kind of 
testing. If no-one steps up to the plate to test this more, however, 
we'll have to just forget about this.



Having a guc would also help to test the feature with different values
than 1.5, which really seems harmful from a math point of view. I'm not
sure at all that a power formula is the right approach.


Yeah, a GUC would be helpful in testing this. I'm hoping that we would 
come up with a reasonable formula that would work well enough for 
everyone that we wouldn't need to have a GUC in the final patch, though.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Fabien COELHO



You don't have to do anything if you don't want to.


Sure:-) What I mean is that I think that this patch is not ripe, and I 
understood that some people were suggesting that it could be applied as is 
right away. I'm really disagreeing with that.


I said myself that this needs performance testing of the worst-case 
scenario, one where we would expect this to perform worse than without 
the patch. Then we can look at how bad that effect is, and decide if 
that's acceptable.


Ok, I'm fine with that. It's quite different from looks ok apply now.


That said, if you could do that testing, that would be great!


Hmmm. I was not really planing to. On the other hand, I have some scripts 
and a small setup that I've been using to test checkpointer flushing, and 
it would be easy to start some tests.



Having a guc would also help to test the feature with different values
than 1.5, which really seems harmful from a math point of view. I'm not
sure at all that a power formula is the right approach.


Yeah, a GUC would be helpful in testing this. I'm hoping that we would come 
up with a reasonable formula that would work well enough for everyone that we 
wouldn't need to have a GUC in the final patch, though.


Yep. If it is a guc testing is quite easy and I may run my scripts...

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Fabien COELHO


Hello Heikki,


I think that the load is distributed as the derivative of this function,
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that
is the finishing load is 1.5 the average load, just before fsyncing files.
This looks like a recipee for a bad time: I would say this is too large an
overload. I would suggest a much lower value, say around 1.1...


Hmm. Load is distributed as a derivate of that, but probably not the way you 
think. Note that X means the amount of WAL consumed, not time.


Interesting point. After a look at IsCheckpointOnSchedule, and if I 
understand the code correctly, it is actually *both*, so it really depends 
whether the checkpoint was xlog or time triggered, and especially which 
one (time/xlog) is proeminent at the beginning of the checkpoint.


If it is time triggered and paced my reasonning is probably right and 
things will go bad/worse in the end, but if it is xlog-triggered and paced 
your line of argument is probably closer to what happens.


This suggest that the corrective function should be applied with more 
care, maybe only for the xlog-based on schedule test, but not the 
time-based check.


The goal is that I/O is constant over time, but the consumption of WAL 
over time is non-linear, with a lot more WAL consumed in the beginning 
of a checkpoint cycle. The function compensates for that.


*If* the checkpointer pacing comes from WAL size, which may or may not be 
the case.



[...]

Unfortunately, we don't know the shape of g(X), as that depends on the 
workload. It might be linear, if there is no effect at all from 
full_page_writes. Or it could be a step-function, where every write causes a 
full page write, until all pages have been touched, and after that none do 
(something like an UPDATE without a where-clause might cause that).


If postgresql is running in its cache (i.e. within shared buffers), the 
usual assumption would be an unknown exponential probability decreasing 
with time while the same pages are hit over and over.


If postgresql is running on memory or disk (effective database size 
greater than shared buffers), pages are statiscally not reused by another 
update before being sent out, so the full page write would be always used 
during the whole checkpoint, there is no WAL storm (or it is always a 
storm, depending on the point of view) and the corrective factor would 
only create issues...


So basically I would say that what to do heavily depends on the database 
size and checkpoint trigger (time vs xlog), which really suggest that a 
guc is indispensible, and maybe that the place the correction is applied 
is currently not the right one.



In pgbench-like workloads, it's something like sqrt(x).


Probably for a small database size?

I picked X^1.5 as a reasonable guess. It's close enough to linear that 
it shouldn't hurt too much if g(x) is linear.


My understanding is still a 50% overload at the end of the checkpoint just 
before issuing fsync... I think that could hurt in some case.


But it cuts the worst spike at the very beginning, if g(x) is more like 
sqrt(x).


Hmmm. It's a balance between saving the 10 first seconds of the checkpoint 
at the price of risking a panic at the end of the checkpoint.


Now the right approach might be for pg to know what is happening by 
collecting statistics while running, and to apply a correction when it is 
needed, for the amount needed.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Robert Haas
On Sun, Jul 5, 2015 at 1:19 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Do you mean I have to proove that there is an actual problem induced from
 this patch?

No, I'm not saying anyone *has* to do anything.  What I'm saying is
that I'm not convinced by your analysis.  I don't think we have enough
evidence at this point to conclude that a GUC is necessary, and I hope
it isn't, because I can't imagine what advice we would be able to give
people about how to set it, other than try all the value and see what
works best, which isn't going to be satisfying.

More broadly, I don't really know how to test this patch and show when
it helps and when it hurts.  And I think we need that, rather than
just a theoretical analysis, to tune the behavior.  Heikki, can you
describe what you think a good test setup would be?  Like, what
workload should we run, and what measurements should we gather to see
what the patch is doing that is good or bad?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Fabien COELHO



No, I'm not saying anyone *has* to do anything.  What I'm saying is
that I'm not convinced by your analysis.


Well, the gist of my analysis is really to say that there are potential 
performance issues with the proposed change, and that it must be tested 
thoroughly. The details may varry:-)


I don't think we have enough evidence at this point to conclude that a 
GUC is necessary, and I hope it isn't, because I can't imagine what 
advice we would be able to give people about how to set it, other than 
try all the value and see what works best, which isn't going to be 
satisfying.


At least for testing, ISTM that a GUC would be really useful.

More broadly, I don't really know how to test this patch and show when 
it helps and when it hurts. And I think we need that, rather than just a 
theoretical analysis, to tune the behavior.


The point of an analysis is to think about how it works and what to test, 
but it is not a substitute for testing, obviously.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Andres Freund
On 2015-07-05 11:05:28 -0400, Robert Haas wrote:
 More broadly, I don't really know how to test this patch and show when
 it helps and when it hurts.  And I think we need that, rather than
 just a theoretical analysis, to tune the behavior.  Heikki, can you
 describe what you think a good test setup would be?  Like, what
 workload should we run, and what measurements should we gather to see
 what the patch is doing that is good or bad?

I think a good start would be to graph the writeout rate over several
checkpoints.  It'd be cool if there were a better way, but it's probably
easiest to just graph the number of bytes written (using iostat) and the
number of dirty bytes in the kernel. That'll unfortunately include WAL,
but I can't immediately see how to avoid that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-05 Thread Amit Kapila
On Sun, Jul 5, 2015 at 1:18 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/04/2015 07:34 PM, Fabien COELHO wrote:

 I have ran some tests with this patch and the detailed results of the
 runs are attached with this mail.


 I do not understand really the aggregated figures in the files attached.


 Me neither. It looks like Amit measured the time spent in mdread and
 mdwrite, but I'm not sure what conclusions one can draw from that.


As Heikki has pointed, it is stats data for mdread and mdwrite
between the checkpoints (in the data, you need to search for
checkpoint start/checkpoint done).  In between checkpoint
start and checkpoint done, all the data shows the amount of read/
write done (I am just trying to reproduce what Digoal has reported, so
I am using his script and I also don't understand every thing, but I think
we can look at count between checkpoints to deduce whether the IO
is flattened after patch).  Digoal was seeing a spike at the beginning of
checkpoint (after checkpoint start) in his configuration without this patch
and the spike seems to be reduced after this patch where as in my tests
I don't see the spike immediately after checkpoint (although there are some
spikes in-between) even without patch which means that either I might not
be using the right configuration to measure the IO or there is some other
difference between the way Digoal ran the test and I ran the tests.  I have
done
the setup (even though hardware will not be same, but at least I can run the
tests and collect the data in the format similar to Digoal), so if you guys
have
suggestions about which kind of parameters we should tweek or some tests
to gather the results, I can do that present the results here for further
discussion.


  I thought the patch should show difference if I keep max_wal_size to
 somewhat lower or moderate value so that checkpoint should get triggered
 due to wal size, but I am not seeing any major difference in the writes
 spreading.


 I'm not sure I understand your point. I would say that at full speed
 pgbench the disk is always busy writing as much as possible, either
 checkpoint writes or wal writes, so the write load as such should not be
 that different anyway?

 I understood that the point of the patch is to check whether there is a
 tps dip or not when the checkpoint begins, but I'm not sure how this can
 be infered from the many aggregated data you sent, and from my recent
 tests the tps is very variable anyway on HDD.


Yes, we definitely want to see the effect on TPS at the beginning of
checkpoint,
but even measuring the IO during checkpoint with the way Digoal was
capturing
the data can show the effect of this patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-04 Thread Fabien COELHO



In summary, the X^1.5 correction seems to work pretty well. It doesn't
completely eliminate the problem, but it makes it a lot better.


I've looked at the maths.

I think that the load is distributed as the derivative of this function, 
that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it 
pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that 
is the finishing load is 1.5 the average load, just before fsyncing files. 
This looks like a recipee for a bad time: I would say this is too large an 
overload. I would suggest a much lower value, say around 1.1...


The other issue with this function is that it should only degrade 
performance by disrupting the write distribution if someone has WAL on a 
different disk. As I understand it this thing does only make sense if the 
WAL  the data are on the samee disk. This really suggest a guc.


I have ran some tests with this patch and the detailed results of the 
runs are attached with this mail.


I do not understand really the aggregated figures in the files attached.

I guess that maybe between end markers there is a summary of figures 
collected for 28 backends over 300-second runs (?), but I do not know what 
the min/max/avg/sum/count figures are about.


I thought the patch should show difference if I keep max_wal_size to 
somewhat lower or moderate value so that checkpoint should get triggered 
due to wal size, but I am not seeing any major difference in the writes 
spreading.


I'm not sure I understand your point. I would say that at full speed 
pgbench the disk is always busy writing as much as possible, either 
checkpoint writes or wal writes, so the write load as such should not be 
that different anyway?


I understood that the point of the patch is to check whether there is a 
tps dip or not when the checkpoint begins, but I'm not sure how this can 
be infered from the many aggregated data you sent, and from my recent 
tests the tps is very variable anyway on HDD.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-04 Thread Robert Haas
On Jul 4, 2015, at 11:34 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 In summary, the X^1.5 correction seems to work pretty well. It doesn't
 completely eliminate the problem, but it makes it a lot better.
 
 I've looked at the maths.
 
 I think that the load is distributed as the derivative of this function, that 
 is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it pass the 
 1.0 (average load) around 40% progress, and ends up at 1.5, that is the 
 finishing load is 1.5 the average load, just before fsyncing files. This 
 looks like a recipee for a bad time: I would say this is too large an 
 overload. I would suggest a much lower value, say around 1.1...
 
 The other issue with this function is that it should only degrade performance 
 by disrupting the write distribution if someone has WAL on a different disk. 
 As I understand it this thing does only make sense if the WAL  the data are 
 on the samee disk. This really suggest a guc.

I am a bit skeptical about this.  We need test scenarios that clearly show the 
benefit of having and of not having this behavior. It might be that doing this 
always is fine for everyone.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-04 Thread Fabien COELHO


Hello Robert,


I've looked at the maths.

I think that the load is distributed as the derivative of this 
function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky 
reaches 0.5, it pass the 1.0 (average load) around 40% progress, and 
ends up at 1.5, that is the finishing load is 1.5 the average load, 
just before fsyncing files. This looks like a recipee for a bad time: I 
would say this is too large an overload. I would suggest a much lower 
value, say around 1.1...


The other issue with this function is that it should only degrade 
performance by disrupting the write distribution if someone has WAL on 
a different disk. As I understand it this thing does only make sense if 
the WAL  the data are on the samee disk. This really suggest a guc.


I am a bit skeptical about this.  We need test scenarios that clearly 
show the benefit of having and of not having this behavior. It might be 
that doing this always is fine for everyone.


Do you mean I have to proove that there is an actual problem induced from 
this patch?


The logic fails me: I thought the patch submitter would have to show that 
his/her patch did not harm performance in various reasonable cases. At 
least this is what I'm told in another thread:-)


Currently this patch changes heavily the checkpoint write load 
distribution in many cases with a proof which consist in showing that it 
may improve tps *briefly* on *one* example, as far as I understood the 
issue and the tests. If this is enough proof to apply the patch, then the 
minimum is that it should be possible to desactivate it, hence a guc.


Having a guc would also help to test the feature with different values 
than 1.5, which really seems harmful from a math point of view. I'm not 
sure at all that a power formula is the right approach.


The potential impact I see would be to aggravate significantly the write 
stall issues I'm working on, but the measures provided in these tests do 
not even look at that or measure that.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-03 Thread Fabien COELHO


power 1,5 is almost certainly not right for all cases, but it is simple 
and better.


It is better in some cases, as I've been told on my patch. If you have a 
separate disk for WAL writes the power formula may just degrade 
performance, or maybe not, or not too much, or it really should be a guc.


Well, I just think that it needs more performance testing with various 
loads and sizes, really. I'm not against this patch at all.



And easy to remove if something even better arrives.

I don't see the two patches being in conflict.


They are not in conflict from a git point of view, or even so it would 
be trivial to solve.


They are in conflict as the patch changes the checkpoint load 
significantly, which would mean that my X00 hours of performance testing 
on the checkpoint scheduler should more or less be run again. Ok, it is 
somehow egoistic, but I'm trying to avoid wasting people time.


Another point is that I'm not sure I understand the decision process: for 
some patch in some area extensive performance tests are required, and for 
other patches in the same area they would not be.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-03 Thread Fabien COELHO


Hello Andres,


In conclusion, and very egoistically, I would prefer if this patch could
wait for the checkpoint scheduling patch to be considered, as it would
basically invalidate the X00 hours of performance tests I ran:-)


These two patches target pretty independent mechanics. If you patch were
significantly influenced by this something would be wrong. It might
decrease the benefit of your patch a mite, but that's not really a
problem.


That is not the issue I see. On the principle of performance testing it 
really means that I should rerun the tests, even if I expect that the 
overall influence would be pretty small in this case. This is my egoistic 
argument. Well, probably I would just rerun a few cases to check that the 
impact is mite, as you said, not all cases.


Another point is that I'm not sure that this patch is ripe, in particular 
I'm skeptical about the hardcoded 1.5 without further testing. Maybe it is 
good, maybe 1.3 or 1.6 is better, maybe it depends and it should just be a 
guc with some advises about how to set it. So I really think that it needs 
more performance figures than it has a positive effect on one load.


Well, this is just my opinion, no need to care too much about it:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-03 Thread Simon Riggs
On 3 July 2015 at 06:38, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello Simon,

  We could do better, but that is not a reason not to commit this, as is.
 Commit, please.


 My 0,02€: Please do not commit without further testing...

 I've submitted a patch to improve checkpoint write scheduling, including
 X00 hours of performance test on various cases. This patch changes
 significantly the load distribution over the whole checkpoint, and AFAICS
 has been tested on rather small cases.

 I'm not sure that the power 1.5 is the right one for all cases. For a big
 checkpoint over 30 minutes, it may have, or not, very large and possibly
 unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I
 really think is that it needs performance measures.


power 1,5 is almost certainly not right for all cases, but it is simple and
better. And easy to remove if something even better arrives.

I don't see the two patches being in conflict.


 In conclusion, and very egoistically, I would prefer if this patch could
 wait for the checkpoint scheduling patch to be considered, as it would
 basically invalidate the X00 hours of performance tests I ran:-)


 I recommend making peace with yourself that probably 50% of development
time is wasted. But we try to keep the best half.

Thank you for your time spent contributing.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-03 Thread Andres Freund
On 2015-07-03 07:38:15 +0200, Fabien COELHO wrote:
 I've submitted a patch to improve checkpoint write scheduling, including X00
 hours of performance test on various cases. This patch changes significantly
 the load distribution over the whole checkpoint, and AFAICS has been tested
 on rather small cases.
 
 I'm not sure that the power 1.5 is the right one for all cases. For a big
 checkpoint over 30 minutes, it may have, or not, very large and possibly
 unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I
 really think is that it needs performance measures.
 
 In conclusion, and very egoistically, I would prefer if this patch could
 wait for the checkpoint scheduling patch to be considered, as it would
 basically invalidate the X00 hours of performance tests I ran:-)

These two patches target pretty independent mechanics. If you patch were
significantly influenced by this something would be wrong. It might
decrease the benefit of your patch a mite, but that's not really a
problem.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Amit Kapila
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote:


 In summary, the X^1.5 correction seems to work pretty well. It doesn't
 completely eliminate the problem, but it makes it a lot better.


 Agreed


Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?


Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Simon Riggs
On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote:


 In summary, the X^1.5 correction seems to work pretty well. It doesn't
 completely eliminate the problem, but it makes it a lot better.


Agreed


 I don't want to over-compensate for the full-page-write effect either,
 because there are also applications where that effect isn't so big. For
 example, an application that performs a lot of updates, but all the updates
 are on a small number of pages, so the full-page-write storm immediately
 after checkpoint doesn't last long. A worst case for this patch would be
 such an application - lots of updates on only a few pages - with a long
 checkpoint_timeoout but relatively small checkpoint_segments, so that
 checkpoints are always driven by checkpoint_segments. I'd like to see some
 benchmarking of that worst case before committing anything like this.


We could do better, but that is not a reason not to commit this, as is.
Commit, please.

This has been in place for a while and still remains: TODO: reduce impact
of full page writes

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-07-02 Thread Fabien COELHO


Hello Simon,


We could do better, but that is not a reason not to commit this, as is.
Commit, please.


My 0,02€: Please do not commit without further testing...

I've submitted a patch to improve checkpoint write scheduling, including 
X00 hours of performance test on various cases. This patch changes 
significantly the load distribution over the whole checkpoint, and AFAICS 
has been tested on rather small cases.


I'm not sure that the power 1.5 is the right one for all cases. For a big 
checkpoint over 30 minutes, it may have, or not, very large and possibly 
unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what 
I really think is that it needs performance measures.


In conclusion, and very egoistically, I would prefer if this patch could 
wait for the checkpoint scheduling patch to be considered, as it would 
basically invalidate the X00 hours of performance tests I ran:-)


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-05-13 Thread Heikki Linnakangas

(please keep the mailing list CC'd, and please don't top-post)

On 05/13/2015 05:00 AM, digoal zhou wrote:

I test it, but use exponent not very perfect in any environment.
why cann't use time only?


As you mentioned yourself earlier, if you only use time but you reach 
checkpoint_segments before checkpoint_timeout, you will not complete the 
checkpoint until you'd already need to begin the next checkpoint. You 
can't completely ignore checkpoint_segments.


Comparing the numbers you give below with 
compensate-fpw-effect-on-checkpoint-scheduling-1.patch, with the ones 
from your first post, it looks like the patch already made the situation 
much better. You still have a significant burst in the beginning of the 
checkpoint cycle, but it's a lot smaller than without the patch. Before 
the patch, the count topped at 9078, and below it topped at 2964. 
There is a strange lull after the burst, I'm not sure what's going on 
there, but overall it seems like a big improvement.


Did the patch alleviate the bump in latency that pgbench reports?

I put the count numbers from your original post and below into a 
spreadsheet, and created some fancy charts. See attached. It shows the 
same thing but with pretty pictures. Assuming we want the checkpoint to 
be spread as evenly as possible across the cycle, the ideal would be a 
straight line from 0 to about 15 in 270 seconds in the cumulative 
chart. You didn't give the full data, but you can extrapolate the lines 
to get a rough picture of how close the different versions are from that 
ideal.


In summary, the X^1.5 correction seems to work pretty well. It doesn't 
completely eliminate the problem, but it makes it a lot better.


I don't want to over-compensate for the full-page-write effect either, 
because there are also applications where that effect isn't so big. For 
example, an application that performs a lot of updates, but all the 
updates are on a small number of pages, so the full-page-write storm 
immediately after checkpoint doesn't last long. A worst case for this 
patch would be such an application - lots of updates on only a few pages 
- with a long checkpoint_timeoout but relatively small 
checkpoint_segments, so that checkpoints are always driven by 
checkpoint_segments. I'd like to see some benchmarking of that worst 
case before committing anything like this.



--end-
checkpoint start
buffer__sync__start num_buffers: 524288, dirty_buffers: 156931
r1_or_w2 2, pid: 29132, min: 44, max: 151, avg: 52, sum: 49387, count: 932
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 95, avg: 49, sum: 41532, count: 837
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 747, avg: 54, sum: 100419, count: 1849
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 372, avg: 52, sum: 110701, count: 2090
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 115, avg: 57, sum: 147510, count: 2575
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 470, avg: 58, sum: 145217, count: 2476
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 120, avg: 54, sum: 161401, count: 2964
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 208, avg: 59, sum: 170280, count: 2847
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 10089, avg: 62, sum: 136106, count:
2181
--end-
r1_or_w2 2, pid: 29132, min: 41, max: 487, avg: 56, sum: 88990, count: 1570
--end-
r1_or_w2 2, pid: 29132, min: 39, max: 102, avg: 55, sum: 59807, count: 1083
--end-
r1_or_w2 2, pid: 29132, min: 40, max: 557, avg: 56, sum: 117274, count: 2083
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 537, avg: 58, sum: 169867, count: 2882
--end-
r1_or_w2 2, pid: 29132, min: 44, max: 147, avg: 60, sum: 92835, count: 1538
--end-
r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 55, sum: 14641, count: 264
--end-
r1_or_w2 2, pid: 29132, min: 48, max: 92, avg: 56, sum: 11834, count: 210
--end-
r1_or_w2 2, pid: 29132, min: 45, max: 91, avg: 56, sum: 9151, count: 162
--end-
r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-05-12 Thread Heikki Linnakangas

On 05/12/2015 03:27 AM, digoal zhou wrote:

PostgreSQL (=9.4) trend to smooth buffer write smooth in a
checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
but when we use synchronous_commit=off, there is a little problem for
the checkpoint_segments
target, because xlog write fast(for full page write which the first page
write after checkpoint), so checkpointer cann't sleep and write buffer not
smooth.
...
I think we can add an condition to the IsCheckpointOnSchedule,
 if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
 {
 recptr = GetInsertRecPtr();
 elapsed_xlogs = (((double) (recptr -
ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

 if (progress  elapsed_xlogs)
 {
 ckpt_cached_elapsed = elapsed_xlogs;
 return false;
 }
  }


This has nothing to do with asynchronous_commit, except that setting 
asynchronous_commit=off makes your test case run faster, and hit the 
problem harder.


I think the real problem here is that IsCheckpointOnSchedule assumes 
that the rate of WAL generated is constant throughout the checkpoint 
cycle, but in reality you generate a lot more WAL immediately after the 
checkpoint begins, thanks to full_page_writes. For example, in the 
beginning of the cycle, you quickly use up, say, 20% of the WAL space in 
the first 10 seconds, and the scheduling thinks it's in a lot of hurry 
to finish the checkpoint because it extrapolates that the rest of the 
WAL will be used up in the next 40 seconds. But in reality, the WAL 
consumption levels off, and you have many minutes left until 
CheckPointSegments.


Can you try the attached patch? It modifies the above calculation to 
take the full-page-write effect into account. I used X^1.5 as the 
corrective function, which roughly reflects the typical WAL consumption 
pattern. You can adjust the exponent, 1.5, to make the correction more 
or less aggressive.


- Heikki

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..fb02f56 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -763,6 +763,19 @@ IsCheckpointOnSchedule(double progress)
 		recptr = GetInsertRecPtr();
 		elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
 
+		/*
+		 * Immediately after a checkpoint, a lot more WAL is generated when
+		 * full_page_write is enabled, because every WAL record has to include
+		 * a full image of the modified page. It levels off as time passes and
+		 * more updates fall on pages that have already been modified since
+		 * the last checkpoint.
+		 *
+		 * To correct for that effect, apply a corrective factor on the
+		 * amount of WAL consumed so far.
+		 */
+		if (fullPageWrites)
+			elapsed_xlogs = pow(elapsed_xlogs, 1.5);
+
 		if (progress  elapsed_xlogs)
 		{
 			ckpt_cached_elapsed = elapsed_xlogs;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-05-11 Thread digoal zhou
   PostgreSQL (=9.4) trend to smooth buffer write smooth in a
checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
but when we use synchronous_commit=off, there is a little problem for
the checkpoint_segments
target, because xlog write fast(for full page write which the first page
write after checkpoint), so checkpointer cann't sleep and write buffer not
smooth.
   There is an test:
# stap -DMAXSKIPPED=10 -v 1 -e '
global s_var, e_var, stat_var;

/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid,
int); */
probe process(/opt/pgsql/bin/postgres).mark(smgr__md__read__start) {
  s_var[pid(),1] = gettimeofday_us()
}

/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int,
int, int); */
probe process(/opt/pgsql/bin/postgres).mark(smgr__md__read__done) {
  e_var[pid(),1] = gettimeofday_us()
  if ( s_var[pid(),1]  0 )
stat_var[pid(),1]  e_var[pid(),1] - s_var[pid(),1]
}

/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid,
int); */
probe process(/opt/pgsql/bin/postgres).mark(smgr__md__write__start) {
  s_var[pid(),2] = gettimeofday_us()
}

/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int,
int, int); */
probe process(/opt/pgsql/bin/postgres).mark(smgr__md__write__done) {
  e_var[pid(),2] = gettimeofday_us()
  if ( s_var[pid(),2]  0 )
stat_var[pid(),2]  e_var[pid(),2] - s_var[pid(),2]
}

probe process(/opt/pgsql/bin/postgres).mark(buffer__sync__start) {
  printf(buffer__sync__start num_buffers: %d, dirty_buffers: %d\n,
$NBuffers, $num_to_write)
}

probe process(/opt/pgsql/bin/postgres).mark(checkpoint__start) {
  printf(checkpoint start\n)
}

probe process(/opt/pgsql/bin/postgres).mark(checkpoint__done) {
  printf(checkpoint done\n)
}

probe timer.s(1) {
  foreach ([v1,v2] in stat_var +) {
if ( @count(stat_var[v1,v2]) 0 ) {
  printf(r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d,
count: %d\n, v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]),
@avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
}
  }

printf(--end-\n)
  delete s_var
  delete e_var
  delete stat_var
}'



Use the test table and data:
create table tbl(id primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,5000),now(),now();


Use pgbench test it.
$ vi test.sql
\setrandom id 1 5000
update tbl set crt_time=now() where id = :id ;


$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1
When on schedule checkpoint occure , the tps:
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524

The parameters for onschedule checkpoint:
checkpoint_segments = 512
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9

stap's output :
there is 156467 dirty blocks, we can see the buffer write per second, write
buffer is not smooth between time target.
but between xlog target.
156467/(4.5*60*0.9) = 579.5 write per second.


checkpoint start
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count:
8596
--end-
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078
--end-
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397
--end-
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count:
8097
--end-
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617
--end-
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851
--end-
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953
--end-
r1_or_w2 2, pid: 19848, min: 42, max: