Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-08-27 Thread Greg Smith

On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:

I think that it is almost same as small dirty_background_ratio or
dirty_background_bytes.


The main difference here is that all writes pushed out this way will be 
to a single 1GB relation chunk.  The odds are better that multiple 
writes will combine, and that the I/O will involve a lower than average 
amount of random seeking.  Whereas shrinking the size of the write cache 
always results in more random seeking.



The essential improvement is not dirty page size in fsync() but
scheduling of fsync phase.
I can't understand why postgres does not consider scheduling of fsync
phase.


Because it cannot get the sort of latency improvements I think people 
want.  I proved to myself it's impossible during the last 9.2 CF when I 
submitted several fsync scheduling change submissions.


By the time you get to the fsync sync phase, on a system that's always 
writing heavily there is way too much backlog to possibly cope with by 
then.  There just isn't enough time left before the checkpoint should 
end to write everything out.  You have to force writes to actual disk to 
start happening earlier to keep a predictable schedule.  Basically, the 
longer you go without issuing a fsync, the more uncertainty there is 
around how long it might take to fire.  My proposal lets someone keep 
all I/O from ever reaching the point where the uncertainty is that high.


In the simplest to explain case, imagine that a checkpoint includes a 
1GB relation segment that is completely dirty in shared_buffers.  When a 
checkpoint hits this, it will have 1GB of I/O to push out.


If you have waited this long to fsync the segment, the problem is now 
too big to fix by checkpoint time.  Even if the 1GB of writes are 
themselves nicely ordered and grouped on disk, the concurrent background 
ability is going to chop the combination up into more random I/O than 
the ideal.


Regular consumer disks have a worst case random I/O throughput of less 
than 2MB/s.  My observed progress rates for such systems show you're 
lucky to get 10MB/s of writes out of them.  So how long will the dirty 
1GB in the segment take to write?  1GB @ 10MB/s = 102.4 *seconds*.  And 
that's exactly what I saw whenever I tried to play with checkpoint sync 
scheduling.  No matter what you do there, periodically you'll hit a 
segment that has over a minute of dirty data accumulated, and 60 second 
latency pauses result.  By the point you've reached checkpoint, you're 
dead when you call fsync on that relation.  You *must* hit that segment 
with fsync more often than once per checkpoint to achieve reasonable 
latency.


With this linear slider idea, I might tune such that no segment will 
ever get more than 256MB of writes before hitting a fsync instead.  I 
can't guarantee that will work usefully, but the shape of the idea seems 
to match the problem.



Taken together my checkpoint proposal method,
* write phase
   - Almost same, but considering fsync phase schedule.
   - Considering case of background-write in OS, sort buffer before
starting checkpoint write.


This cannot work for the reasons I've outlined here.  I guarantee you I 
will easily find a test workload where it performs worse than what's 
happening right now.  If you want to play with this to learn more about 
the trade-offs involved, that's fine, but expect me to vote against 
accepting any change of this form.  I would prefer you to not submit 
them because it will waste a large amount of reviewer time to reach that 
conclusion yet again.  And I'm not going to be that reviewer.



* fsync phase
   - Considering checkpoint schedule and write-phase schedule
   - Executing separated sync_file_range() and sleep, in final fsync().


If you can figure out how to use sync_file_range() to fine tune how much 
fsync is happening at any time, that would be useful on all the 
platforms that support it.  I haven't tried it just because that looked 
to me like a large job refactoring the entire fsync absorb mechanism, 
and I've never had enough funding to take it on.  That approach has a 
lot of good properties, if it could be made to work without a lot of 
code changes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-08-22 Thread Jim Nasby

On 7/26/13 7:32 AM, Tom Lane wrote:

Greg Smith g...@2ndquadrant.com writes:

On 7/26/13 5:59 AM, Hannu Krosing wrote:

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.



If your solution to avoiding random writes now is to do sequential ones
into a buffer, you'll pay for it by having more expensive random reads
later.


What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.


FWIW, at RICon East there was someone from Seagate that gave a presentation. 
One of his points is that even spinning rust is moving to the point where the 
drive itself has to do some kind of write log. He notes that modern filesystems 
do the same thing, and the overlap is probably stupid (I pointed out that the 
most degenerate case is the logging database on the logging filesystem on the 
logging drive...)

It'd be interesting for Postgres to work with drive manufacturers to study ways 
to get rid of the extra layers of stupid...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-29 Thread KONDO Mitsumasa

(2013/07/24 1:13), Greg Smith wrote:

On 7/23/13 10:56 AM, Robert Haas wrote:

On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:

We know that a 1GB relation segment can take a really long time to write
out.  That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.


By my count, it can include up to 131,072 changed 8K pages.


Even better!  I can pinpoint exactly what time last night I got tired enough to
start making trivial mistakes.  Everywhere I said 128 it's actually 131,072,
which just changes the range of the GUC I proposed.
I think that it is almost same as small dirty_background_ratio or 
dirty_background_bytes.

This method will be very bad performance, and many fsync() may be caused long 
fsync
situaition which was said past by you. My colleagues who are kernel expert say,
in executing fsync(), other process write same file a lot, it does not return 
fsync
call function occasionally. So too many fsync with large file is very dangerous.
Moreover fsync() also write metadata, it is worst for performance.

The essential improvement is not dirty page size in fsync() but scheduling of 
fsync phase.

I can't understand why postgres does not consider scheduling of fsync phase. 
When
dirty_background_ratio is big, in write phase does not write to disk at all,
therefore, fsync() is too heavy in fsync phase.



Getting the number right really highlights just how bad the current situation
is.  Would you expect the database to dump up to 128K writes into a file and 
then
have low latency when it's flushed to disk with fsync?  Of course not.

I think that it will be improved this problem by sync_file_range() in fsync 
phase,
and adding checkpoint schedule in fsync phase. Executing small range 
sync_file_range()

and sleep, in final executing fsync(). I think it is better than your proposal.
If a system do not support sync_file_range() system call, it only execute fsync 
and sleep, it is same our method (you and I posted past).


Taken together my checkpoint proposal method,

* write phase
  - Almost same, but considering fsync phase schedule.
  - Considering case of background-write in OS, sort buffer before starting 
checkpoint write.


* fsync phase
  - Considering checkpoint schedule and write-phase schedule
  - Executing separated sync_file_range() and sleep, in final fsync().

And if I can, not write a buffer method which is called fsync() in a target 
file.
I think it may be quite difficult.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/25/13 6:02 PM, didier wrote:

It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file?


If you do that, reads of the data will have to traverse that temporary 
file to assemble their data.  You'll make every later reader pay the 
random I/O penalty that's being avoided right now.  Checkpoints are 
already postponing these random writes as long as possible. You have to 
take care of them eventually though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote:
 On 7/25/13 6:02 PM, didier wrote:
 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?

 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  You'll make every later reader pay the
 random I/O penalty that's being avoided right now.  Checkpoints are
 already postponing these random writes as long as possible. You have
 to take care of them eventually though.

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.

I would not dismiss the idea without more tests and discussion.

We could have a system where checkpoint does sequential writes of dirty
wal buffers to alternating synced holding files (a checkpoint log :) )
and only background writer does random writes with no forced sync at all


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Hannu Krosing
On 07/26/2013 11:42 AM, Greg Smith wrote:
 On 7/25/13 6:02 PM, didier wrote:
 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?

 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  
In case of crash recovery, a sequential reading of this file could be
performed as first step.

this should work fairly well in most cases, at least when the recovery
shared_buffers is not smaller
than the latest run of checkpoint-written dirty buffers.




-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 5:59 AM, Hannu Krosing wrote:

Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
random
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.


If your solution to avoiding random writes now is to do sequential ones 
into a buffer, you'll pay for it by having more expensive random reads 
later.  In the SSD write buffer case, that works only because those 
random reads are very cheap.  Do the same thing on a regular drive, and 
you'll be paying a painful penalty *every* time you read in return for 
saving work *once* when you write.  That only makes sense when your 
workload is near write-only.


It's possible to improve on this situation by having some sort of 
background process that goes back and cleans up the random data, 
converting it back into sequentially ordered writes again.  SSD 
controllers also have firmware that does this sort of work, and Postgres 
might do it as part of vacuum cleanup.  But note that such work faces 
exactly the same problems as writing the data out in the first place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 On 7/26/13 5:59 AM, Hannu Krosing wrote:
 Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
 random
 fs pages on one large disk page and having an extra index layer for
 resolving
 random-to-sequential ordering.

 If your solution to avoiding random writes now is to do sequential ones 
 into a buffer, you'll pay for it by having more expensive random reads 
 later.

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/25/13 6:02 PM, didier wrote:

 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file?


 If you do that, reads of the data will have to traverse that temporary
 file to assemble their data.  You'll make every later reader pay the random
 I/O penalty that's being avoided right now.  Checkpoints are already
 postponing these random writes as long as possible. You have to take care
 of them eventually though.


 No the log file is only used at recovery time.

in check point code:
- loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in
current code
- other workers can't write and evicted these marked buffers to disk,
there's a race with fsync.
- check point fsync now or after the next step.
- check point loop again save to log these buffers, clear
BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers
will be written again, as they are when check point isn't running.
- check point done.

During recovery you have to load the log in cache first before applying WAL.

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 9:14 AM, didier wrote:

During recovery you have to load the log in cache first before applying WAL.


Checkpoints exist to bound recovery time after a crash.  That is their 
only purpose.  What you're suggesting moves a lot of work into the 
recovery path, which will slow down how long it takes to process.


More work at recovery time means someone who uses the default of 
checkpoint_timeout='5 minutes', expecting that crash recovery won't take 
very long, will discover it does take a longer time now.  They'll be 
forced to shrink the value to get the same recovery time as they do 
currently.  You might need to make checkpoint_timeout 3 minutes instead, 
if crash recovery now has all this extra work to deal with.  And when 
the time between checkpoints drops, it will slow the fundamental 
efficiency of checkpoint processing down.  You will end up writing out 
more data in the end.


The interval between checkpoints and recovery time are all related.  If 
you let any one side of the current requirements slip, it makes the rest 
easier to deal with.  Those are all trade-offs though, not improvements. 
 And this particular one is already an option.


If you want less checkpoint I/O per capita and don't care about recovery 
time, you don't need a code change to get it.  Just make 
checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you 
only do a checkpoint per hour, because instead of random writes you're 
getting sequential ones to the WAL.  But when you crash, expect to be 
down for a significant chunk of an hour, as you go back to sort out all 
of the work postponed before.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread Greg Smith

On 7/26/13 8:32 AM, Tom Lane wrote:

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.


Hannu was observing that SSD often doesn't do that at all.  They can 
maintain logical - physical translation tables that decode where each 
block was written to forever.  When read seeks are really inexpensive, 
the only pressure to reorder block is wear leveling.


That doesn't really help with regular drives though, where the low seek 
time assumption doesn't play out so well.  The whole idea of writing 
things sequentially and then sorting them out later was the rage in 2001 
for ext3 on Linux, as part of the data=journal mount option.  You can 
go back and see that people are confused but excited about the 
performance at 
http://www.ibm.com/developerworks/linux/library/l-fs8/index.html


Spoiler:  if you use a workload that has checkpoint issues, it doesn't 
help PostgreSQL latency.  Just like using a large write cache, you gain 
some burst performance, but eventually you pay for it with extra latency 
somewhere.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-26 Thread didier
Hi,


On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith
greg@2ndquadg...@2ndquadrant.com(needrant.comg...@2ndquadrant.com
 wrote:

 On 7/26/13 9:14 AM, didier wrote:

 During recovery you have to load the log in cache first before applying
 WAL.


 Checkpoints exist to bound recovery time after a crash.  That is their
 only purpose.  What you're suggesting moves a lot of work into the recovery
 path, which will slow down how long it takes to process.

 Yes it's slower but you're sequentially reading only one file at most the
size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty
WAL.

Now recovery  is very fast but you have to repopulate your cache with
random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of
hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of
the snapshot data are stalled and will be overwritten by WAL ops.

But  If the WAL recovery is CPU bound, loading from the snapshot may be
done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of
 checkpoint_timeout='5 minutes', expecting that crash recovery won't take
 very long, will discover it does take a longer time now.  They'll be forced
 to shrink the value to get the same recovery time as they do currently.
  You might need to make checkpoint_timeout 3 minutes instead, if crash
 recovery now has all this extra work to deal with.  And when the time
 between checkpoints drops, it will slow the fundamental efficiency of
 checkpoint processing down.  You will end up writing out more data in the
 end.

Yes it's a trade off, now you're paying the price at checkpoint time, every
time,  with the log you're paying only once, at recovery.


 The interval between checkpoints and recovery time are all related.  If
 you let any one side of the current requirements slip, it makes the rest
 easier to deal with.  Those are all trade-offs though, not improvements.
  And this particular one is already an option.

 If you want less checkpoint I/O per capita and don't care about recovery
 time, you don't need a code change to get it.  Just make checkpoint_timeout
 huge.  A lot of checkpoint I/O issues go away if you only do a checkpoint
 per hour, because instead of random writes you're getting sequential ones
 to the WAL.  But when you crash, expect to be down for a significant chunk
 of an hour, as you go back to sort out all of the work postponed before.

It's not the same  it's a snapshot saved and loaded in constant time unlike
the WAL log.

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi


On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith g...@2ndquadrant.com wrote:

 Recently I've been dismissing a lot of suggested changes to checkpoint
 fsync timing without suggesting an alternative.  I have a simple one in
 mind that captures the biggest problem I see:  that the number of backend
 and checkpoint writes to a file are not connected at all.

 We know that a 1GB relation segment can take a really long time to write
 out.  That could include up to 128 changed 8K pages, and we allow all of
 them to get dirty before any are forced to disk with fsync.

 It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file? With storage random speed at
least five to ten time slower it could help a lot.
Thanks

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread Robert Haas
On Thu, Jul 25, 2013 at 6:02 PM, didier did...@gmail.com wrote:
 It was surely already discussed but why isn't postresql  writing
 sequentially its cache in a temporary file? With storage random speed at
 least five to ten time slower it could help a lot.
 Thanks

Sure, that's what the WAL does.  But you still have to checkpoint eventually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-25 Thread didier
Hi,


 Sure, that's what the WAL does.  But you still have to checkpoint
 eventually.

 Sure, when you run  pg_ctl stop.
Unlike the WAL it only needs two files, shared_buffers size.

I did bogus tests by replacing  mask |= BM_PERMANENT; with mask = -1 in
BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero
of=foo  conv=fsync

On a saturated storage with %usage locked solid at 100% I got up to 30%
speed improvement and fsync latency down by one order of magnitude, some
fsync were still slow of course if buffers were already in OS cache.

But it's the upper bound, it's was done one a slow storage with bad ratios
: (OS cache write)/(disk sequential write) in 50, (sequential
write)/(effective random write) in 10 range and a proper implementation
would have a 'little' more work to do... (only checkpoint task can write
BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on)

Didier


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-24 Thread Robert Haas
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/23/13 10:56 AM, Robert Haas wrote:
 On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:

 We know that a 1GB relation segment can take a really long time to write
 out.  That could include up to 128 changed 8K pages, and we allow all of
 them to get dirty before any are forced to disk with fsync.

 By my count, it can include up to 131,072 changed 8K pages.

 Even better!  I can pinpoint exactly what time last night I got tired enough
 to start making trivial mistakes.  Everywhere I said 128 it's actually
 131,072, which just changes the range of the GUC I proposed.

 Getting the number right really highlights just how bad the current
 situation is.  Would you expect the database to dump up to 128K writes into
 a file and then have low latency when it's flushed to disk with fsync?  Of
 course not.  But that's the job the checkpointer process is trying to do
 right now.  And it's doing it blind--it has no idea how many dirty pages
 might have accumulated before it started.

 I'm not exactly sure how best to use the information collected.  fsync every
 N writes is one approach.  Another is to use accumulated writes to predict
 how long fsync on that relation should take.  Whenever I tried to spread
 fsync calls out before, the scale of the piled up writes from backends was
 the input I really wanted available.  The segment write count gives an
 alternate way to sort the blocks too, you might start with the heaviest hit
 ones.

 In all these cases, the fundamental I keep coming back to is wanting to cue
 off past write statistics.  If you want to predict relative I/O delay times
 with any hope of accuracy, you have to start the checkpoint knowing
 something about the backend and background writer activity since the last
 one.

So, I don't think this is a bad idea; in fact, I think it'd be a good
thing to explore.  The hard part is likely to be convincing ourselves
of anything about how well or poorly it works on arbitrary hardware
under arbitrary workloads, but we've got to keep trying things until
we find something that works well, so why not this?

One general observation is that there are two bad things that happen
when we checkpoint.  One is that we force all of the data in RAM out
to disk, and the other is that we start doing lots of FPIs.  Both of
these things harm throughput.  Your proposal allows the user to make
the first of those behaviors more frequent without making the second
one more frequent.  That idea seems promising, and it also seems to
admit of many variations.  For example, instead of issuing an fsync
when after N OS writes to a particular file, we could fsync the file
with the most writes every K seconds.  That way, if the system has
busy and idle periods, we'll effectively catch up on our fsyncs when
the system isn't that busy, and we won't bunch them up too much if
there's a sudden surge of activity.

Now that's just a shot in the dark and there might be reasons why it's
terrible, but I just generally offer it as food for thought that the
triggering event for the extra fsyncs could be chosen via a multitude
of different algorithms, and as you hack through this it might be
worth trying a few different possibilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Peter Geoghegan
On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 And I can't get too excited about making this as my volunteer effort when I
 consider what the resulting credit will look like.  Coding is by far the
 smallest part of work like this, first behind coming up with the design in
 the first place.  And both of those are way, way behind how long review
 benchmarking takes on something like this.  The way credit is distributed
 for this sort of feature puts coding first, design not credited at all, and
 maybe you'll see some small review credit for benchmarks.  That's completely
 backwards from the actual work ratio.  If all I'm getting out of something
 is credit, I'd at least like it to be an appropriate amount of it.

FWIW, I think that's a reasonable request.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Robert Haas
On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 Recently I've been dismissing a lot of suggested changes to checkpoint fsync
 timing without suggesting an alternative.  I have a simple one in mind that
 captures the biggest problem I see:  that the number of backend and
 checkpoint writes to a file are not connected at all.

 We know that a 1GB relation segment can take a really long time to write
 out.  That could include up to 128 changed 8K pages, and we allow all of
 them to get dirty before any are forced to disk with fsync.

By my count, it can include up to 131,072 changed 8K pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Greg Smith

On 7/23/13 10:56 AM, Robert Haas wrote:

On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:

We know that a 1GB relation segment can take a really long time to write
out.  That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.


By my count, it can include up to 131,072 changed 8K pages.


Even better!  I can pinpoint exactly what time last night I got tired 
enough to start making trivial mistakes.  Everywhere I said 128 it's 
actually 131,072, which just changes the range of the GUC I proposed.


Getting the number right really highlights just how bad the current 
situation is.  Would you expect the database to dump up to 128K writes 
into a file and then have low latency when it's flushed to disk with 
fsync?  Of course not.  But that's the job the checkpointer process is 
trying to do right now.  And it's doing it blind--it has no idea how 
many dirty pages might have accumulated before it started.


I'm not exactly sure how best to use the information collected.  fsync 
every N writes is one approach.  Another is to use accumulated writes to 
predict how long fsync on that relation should take.  Whenever I tried 
to spread fsync calls out before, the scale of the piled up writes from 
backends was the input I really wanted available.  The segment write 
count gives an alternate way to sort the blocks too, you might start 
with the heaviest hit ones.


In all these cases, the fundamental I keep coming back to is wanting to 
cue off past write statistics.  If you want to predict relative I/O 
delay times with any hope of accuracy, you have to start the checkpoint 
knowing something about the backend and background writer activity since 
the last one.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers