subject:"Re\: \[HACKERS\] Experimental patch for inter\-page delay in VACUUM"

Bruce Momjian wrote:
I would be interested to know if you have the background write process
writing old dirty buffers to kernel buffers continually if the sync()
load is diminished.  What this does is to push more dirty buffers into
the kernel cache in hopes the OS will write those buffers on its own
before the checkpoint does its write/sync work.  This might allow us to
reduce sync() load while preventing the need for O_SYNC/fsync().
I tried that first. Linux 2.4 does not, as long as you don't tell it by 
reducing the dirty data block aging time with update(8). So you have to 
force it to utilize the write bandwidth in the meantime. For that you 
have to call sync() or fsync() on something.

Maybe O_SYNC is not as bad an option as it seems. In my patch, the 
checkpointer flushes the buffers in LRU order, meaning it flushes the 
least recently used ones first. This has the side effect that buffers 
returned for replacement (on a cache miss, when the backend needs to 
read the block) are most likely to be flushed/clean. So it reduces the 
write load of backends and thus the probability that a backend is ever 
blocked waiting on an O_SYNC'd write().

I will add some counters and gather some statistics how often the 
backend in comparision to the checkpointer calls write().

Perhaps sync() is bad partly because the checkpoint runs through all the
dirty shared buffers and writes them all to the kernel and then issues
sync() almost guaranteeing a flood of writes to the disk.  This method
would find fewer dirty buffers in the shared buffer cache, and therefore
fewer kernel writes needed by sync().
I don't understand this? How would what method reduce the number of page 
buffers the backends modify?

Jan

---

Jan Wieck wrote:
Tom Lane wrote:

 Jan Wieck [EMAIL PROTECTED] writes:
 
 How I can see the background writer operating is that he's keeping the 
 buffers in the order of the LRU chain(s) clean, because those are the 
 buffers that most likely get replaced soon. In my experimental ARC code 
 it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and 
 n2 dirty buffers (n1+n2 configurable), then fsync all files that have 
 been involved in that, nap depending on where he got down the queues (to 
 increase the write rate when running low on clean buffers), and do it 
 all over again.
 
 You probably need one more knob here: how often to issue the fsyncs.
 I'm not convinced once per outer loop is a sufficient answer.
 Otherwise this is sounding pretty good.

This is definitely heading into the right direction.

I currently have a crude and ugly hacked system, that does checkpoints 
every minute but streches them out over the whole time. It writes out 
the dirty buffers in T1+T2 LRU order intermixed, streches out the flush 
over the whole checkpoint interval and does sync()+usleep() every 32 
blocks (if it has time to do this).

This is clearly the wrong way to implement it, but ...

The same system has ARC and delayed vacuum. With normal, unmodified 
checkpoints every 300 seconds, the transaction responsetime for 
new_order still peaks at over 30 seconds (5 is already too much) so the 
system basically come to a freeze during a checkpoint.

Now with this high-frequent sync()ing and checkpointing by the minute, 
the entire system load levels out really nice. Basically it's constantly 
checkpointing. So maybe the thing we're looking for is to make the 
checkpoint process the background buffer writer process and let it 
checkpoint 'round the clock. Of course, with a bit more selectivity on 
what to fsync and not doing system wide sync() every 10-500 milliseconds :-)

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
   http://www.postgresql.org/docs/faqs/FAQ.html




--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Now, O_SYNC is going to force every write to the disk.  If we have a
transaction that has to write lots of buffers (has to write them to
reuse the shared buffer)
So make the background writer/checkpointer keeping the LRU head clean. I 
explained that 3 times now.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Now, O_SYNC is going to force every write to the disk.  If we have a
  transaction that has to write lots of buffers (has to write them to
  reuse the shared buffer)
 
 So make the background writer/checkpointer keeping the LRU head clean. I 
 explained that 3 times now.

If the background cleaner has to not just write() but write/fsync or
write/O_SYNC, it isn't going to be able to clean them fast enough.  It
creates a bottleneck where we didn't have one before.

We are trying to eliminate an I/O storm during checkpoint, but the
solutions seem to be making the non-checkpoint times slower.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Now, if we are sure that writes will happen only in the checkpoint
 process, O_SYNC would be OK, I guess, but will we ever be sure of that?

This is a performance issue, not a correctness issue.  It's okay for
backends to wait for writes as long as it happens very infrequently.
The question is whether we can design a background dirty-buffer writer
that works well enough to make it uncommon for backends to have to
write dirty buffers for themselves.  If we can, then doing all the
writes O_SYNC would not be a problem.

(One possibility that could help improve the odds is to allow a certain
amount of slop in the LRU buffer reuse policy --- that is, if you see
the buffer at the tail of the LRU list is dirty, allow one of the next
few buffers to be taken instead, if it's clean.  Or just keep separate
lists for dirty and clean buffers.)

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
  I would be interested to know if you have the background write process
  writing old dirty buffers to kernel buffers continually if the sync()
  load is diminished.  What this does is to push more dirty buffers into
  the kernel cache in hopes the OS will write those buffers on its own
  before the checkpoint does its write/sync work.  This might allow us to
  reduce sync() load while preventing the need for O_SYNC/fsync().
 
 I tried that first. Linux 2.4 does not, as long as you don't tell it by 
 reducing the dirty data block aging time with update(8). So you have to 
 force it to utilize the write bandwidth in the meantime. For that you 
 have to call sync() or fsync() on something.
 
 Maybe O_SYNC is not as bad an option as it seems. In my patch, the 
 checkpointer flushes the buffers in LRU order, meaning it flushes the 
 least recently used ones first. This has the side effect that buffers 
 returned for replacement (on a cache miss, when the backend needs to 
 read the block) are most likely to be flushed/clean. So it reduces the 
 write load of backends and thus the probability that a backend is ever 
 blocked waiting on an O_SYNC'd write().
 
 I will add some counters and gather some statistics how often the 
 backend in comparision to the checkpointer calls write().

OK, new idea.  How about if you write() the buffers, mark them as clean
and unlock them, then issue fsync().  The advantage here is that we can
allow the buffer to be reused while we wait for the fsync to complete. 
Obviously, O_SYNC is not going to allow that.  Another idea --- if
fsync() is slow because it can't find the dirty buffers, use write() to
write the buffers, copy the buffer to local memory, mark it as clean,
then open the file with O_SYNC and write it again.  Of course, I am just
throwing out ideas here.  The big thing I am concerned about is that
reusing buffers not take too long.

  Perhaps sync() is bad partly because the checkpoint runs through all the
  dirty shared buffers and writes them all to the kernel and then issues
  sync() almost guaranteeing a flood of writes to the disk.  This method
  would find fewer dirty buffers in the shared buffer cache, and therefore
  fewer kernel writes needed by sync().
 
 I don't understand this? How would what method reduce the number of page 
 buffers the backends modify?

What I was saying is that if we only write() just before a checkpoint,
we never give the kernel a chance to write the buffers on its own.  I
figured if we wrote them earlier, the kernel might write them for us and
sync wouldn't need to do it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

What bothers me a little is that you keep telling us that you have all 
that great code from SRA. Do you have any idea when they intend to share 
this with us and contribute the stuff? I mean at least some pieces 
maybe? You personally got all the code from NuSphere AKA PeerDirect even 
weeks before it got released. Did any PostgreSQL developer other than 
you ever look at the SRA code?

Jan

Bruce Momjian wrote:

scott.marlowe wrote:
On Tue, 4 Nov 2003, Tom Lane wrote:

 Jan Wieck [EMAIL PROTECTED] writes:
  What still needs to be addressed is the IO storm cause by checkpoints. I 
  see it much relaxed when stretching out the BufferSync() over most of 
  the time until the next one should occur. But the kernel sync at it's 
  end still pushes the system hard against the wall.
 
 I have never been happy with the fact that we use sync(2) at all.  Quite
 aside from the I/O storm issue, sync() is really an unsafe way to do a
 checkpoint, because there is no way to be certain when it is done.  And
 on top of that, it does too much, because it forces syncing of files
 unrelated to Postgres.
 
 I would like to see us go over to fsync, or some other technique that
 gives more certainty about when the write has occurred.  There might be
 some scope that way to allow stretching out the I/O, too.
 
 The main problem with this is knowing which files need to be fsync'd.

Wasn't this a problem that the win32 port had to solve by keeping a list 
of all files that need fsyncing since Windows doesn't do sync() in the 
classical sense?  If so, then could we use that code to keep track of the 
files that need fsyncing?
Yes, I have that code from SRA.  They used threading, so they recorded
all the open files in local memory and opened/fsync/closed them for
checkpoints.  We have to store the file names in a shared area, perhaps
an area of shared memory with an overflow to a disk file.


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Tom Lane wrote:
Andrew Sullivan [EMAIL PROTECTED] writes:
 On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
 real traction we'd have to go back to the take over most of RAM for
 shared buffers approach, which we already know to have a bunch of
 severe disadvantages.
 I know there are severe disadvantages in the current implementation,
 but are there in-principle severe disadvantages?
Yes.  For one, since we cannot change the size of shared memory
on-the-fly (at least not portably), there is no opportunity to trade off
memory usage dynamically between processes and disk buffers.  For
another, on many systems shared memory is subject to being swapped out.
Swapping out dirty buffers is a performance killer, because they must be
swapped back in again before they can be written to where they should
have gone.  The only way to avoid this is to keep the number of shared
buffers small enough that they all remain fairly hot (recently used)
and so the kernel won't be tempted to swap out any part of the region.
Agreed, we can't resize shared memory, but I don't think most OS's swap
out shared memory, and even if they do, they usually have a kernel
We can't resize shared memory because we allocate the whole thing in one 
big hump - which causes the shmmax problem BTW. If we allocate that in 
chunks of multiple blocks, we only have to give it a total maximum size 
to get the hash tables and other stuff right from the beginning. But the 
vast majority of memory, the buffers themself, can be made adjustable at 
runtime.

Jan

configuration parameter to lock it into kernel memory.  All the old
unixes locked the shared memory into kernel address space and in fact
this is why many of them required a kernel recompile to increase shared
memory.  I hope the ones that have pagable shared memory have a way to
prevent it --- at least FreeBSD does, not sure about Linux.
Now, the disadvantages of large kernel cache, small PostgreSQL buffer
cache is that data has to be transfered to/from the kernel buffers, and
second, we can't control the kernel's cache replacement strategy, and
will probably not be able to in the near future, while we do control our
own buffer cache replacement strategy.
Looking at the advantages/disadvantages, a large shared buffer cache
looks pretty good to me.


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Jan Wieck wrote:
Bruce Momjian wrote:

 Now, O_SYNC is going to force every write to the disk.  If we have a
 transaction that has to write lots of buffers (has to write them to
 reuse the shared buffer)
So make the background writer/checkpointer keeping the LRU head clean. I 
explained that 3 times now.
If the background cleaner has to not just write() but write/fsync or
write/O_SYNC, it isn't going to be able to clean them fast enough.  It
creates a bottleneck where we didn't have one before.
We are trying to eliminate an I/O storm during checkpoint, but the
solutions seem to be making the non-checkpoint times slower.
It looks as if you're assuming that I am making the backends unable to 
write on their own, so that they have to wait on the checkpointer. I 
never said that.

If the checkpointer keeps the LRU heads clean, that lifts off write load 
from the backends. Sure, they will be able to dirty pages faster. 
Theoretically, because in practice if you have a reasonably good cache 
hitrate, they will just find already dirty buffers where they just add 
some more dust.

If after all the checkpointer (doing write()+whateversync) is not able 
to keep up with the speed of buffers getting dirtied, the backends will 
have to do some write()'s again, because they will eat up the clean 
buffers at the LRU head and pass the checkpointer.

Also please notice another little change in behaviour. The old code just 
went through the buffer cache sequentially, possibly flushing buffers 
that got dirtied after the checkpoint started, which is way ahead of 
time (they need to be flushed for the next checkpoint, not now). That 
means, that if the same buffer gets dirtied again after that, we wasted 
a full disk write on it. My new code creates a list of dirty blocks at 
the beginning of the checkpoint, and flushes only those that are still 
dirty at the time it gets to them.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Jan Wieck wrote:
Bruce Momjian wrote:
 I would be interested to know if you have the background write process
 writing old dirty buffers to kernel buffers continually if the sync()
 load is diminished.  What this does is to push more dirty buffers into
 the kernel cache in hopes the OS will write those buffers on its own
 before the checkpoint does its write/sync work.  This might allow us to
 reduce sync() load while preventing the need for O_SYNC/fsync().
I tried that first. Linux 2.4 does not, as long as you don't tell it by 
reducing the dirty data block aging time with update(8). So you have to 
force it to utilize the write bandwidth in the meantime. For that you 
have to call sync() or fsync() on something.

Maybe O_SYNC is not as bad an option as it seems. In my patch, the 
checkpointer flushes the buffers in LRU order, meaning it flushes the 
least recently used ones first. This has the side effect that buffers 
returned for replacement (on a cache miss, when the backend needs to 
read the block) are most likely to be flushed/clean. So it reduces the 
write load of backends and thus the probability that a backend is ever 
blocked waiting on an O_SYNC'd write().

I will add some counters and gather some statistics how often the 
backend in comparision to the checkpointer calls write().
OK, new idea.  How about if you write() the buffers, mark them as clean
and unlock them, then issue fsync().  The advantage here is that we can
Not really new, I think in my first mail I wrote that I simplified this 
new mdfsyncrecent() function by calling sync() instead ... other than 
that the code I posted worked exactly that way.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:

Bruce Momjian [EMAIL PROTECTED] writes:
Now, if we are sure that writes will happen only in the checkpoint
process, O_SYNC would be OK, I guess, but will we ever be sure of that?
This is a performance issue, not a correctness issue.  It's okay for
backends to wait for writes as long as it happens very infrequently.
The question is whether we can design a background dirty-buffer writer
that works well enough to make it uncommon for backends to have to
write dirty buffers for themselves.  If we can, then doing all the
writes O_SYNC would not be a problem.
(One possibility that could help improve the odds is to allow a certain
amount of slop in the LRU buffer reuse policy --- that is, if you see
the buffer at the tail of the LRU list is dirty, allow one of the next
few buffers to be taken instead, if it's clean.  Or just keep separate
lists for dirty and clean buffers.)
If the checkpointer is writing in LRU order (which is the order buffers 
normally get replaced), this happening would mean that the backends have 
replaced all clean buffers at the LRU head and this can only happen if 
the currently running checkpointer is working way too slow. If it is 
more than 30 seconds away from its target finish time, it would be a 
good idea to restart by building a (guaranteed long now) new todo list 
and write faster (but starting again at the LRU head). If it's too late 
for that, stop napping, finish this checkpoint NOW and start a new one 
immediately.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Zeugswetter Andreas SB SD


 that works well enough to make it uncommon for backends to have to
 write dirty buffers for themselves.  If we can, then doing all the
 writes O_SYNC would not be a problem.

One problem with O_SYNC would be, that the OS does not group writes any 
more. So the code would need to eighter do it's own sorting and grouping
(256k) or use aio, or you won't be able to get the maximum out of the disks.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 One problem with O_SYNC would be, that the OS does not group writes any 
 more. So the code would need to eighter do it's own sorting and grouping
 (256k) or use aio, or you won't be able to get the maximum out of the disks.

Or just run multiple writer processes, which I believe is Oracle's
solution.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian [EMAIL PROTECTED] writes:
 Now, the disadvantages of large kernel cache, small PostgreSQL buffer
 cache is that data has to be transfered to/from the kernel buffers, and
 second, we can't control the kernel's cache replacement strategy, and
 will probably not be able to in the near future, while we do control our
 own buffer cache replacement strategy.

The intent of the posix_fadvise() work is to at least provide a
few hints about our I/O patterns to the kernel's buffer
cache. Although only Linux supports it (right now), that should
hopefully improve the status quo for a fairly significant portion of
our user base.

I'd be curious to see a comparison of the cost of transferring data
from the kernel's buffers to the PG bufmgr.

-Neil


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Zeugswetter Andreas SB SD


  One problem with O_SYNC would be, that the OS does not group writes any 
  more. So the code would need to eighter do it's own sorting and grouping
  (256k) or use aio, or you won't be able to get the maximum out of the disks.
 
 Or just run multiple writer processes, which I believe is Oracle's
 solution.

That does not help, since for O_SYNC the OS'es (those I know) do not group those 
writes together. Oracle allows more than one writer to busy more than one 
disk(subsystem) and circumvent other per process limitations (mainly on platforms 
without AIO). 

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Larry Rosenman



--On Monday, November 10, 2003 11:40:45 -0500 Neil Conway 
[EMAIL PROTECTED] wrote:

Bruce Momjian [EMAIL PROTECTED] writes:
Now, the disadvantages of large kernel cache, small PostgreSQL buffer
cache is that data has to be transfered to/from the kernel buffers, and
second, we can't control the kernel's cache replacement strategy, and
will probably not be able to in the near future, while we do control our
own buffer cache replacement strategy.
The intent of the posix_fadvise() work is to at least provide a
few hints about our I/O patterns to the kernel's buffer
cache. Although only Linux supports it (right now), that should
hopefully improve the status quo for a fairly significant portion of
our user base.
I'd be curious to see a comparison of the cost of transferring data
from the kernel's buffers to the PG bufmgr.
You might also look at Veritas' advisory stuff.  If you want exact doc
pointers, I can provide them, but they are in the Filesystem section
of http://www.lerctr.org:8458/
LER

-Neil

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


pgp0.pgp
Description: PGP signature

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Zeugswetter Andreas SB SD wrote:

 One problem with O_SYNC would be, that the OS does not group writes any 
 more. So the code would need to eighter do it's own sorting and grouping
 (256k) or use aio, or you won't be able to get the maximum out of the disks.

Or just run multiple writer processes, which I believe is Oracle's
solution.
That does not help, since for O_SYNC the OS'es (those I know) do not group those 
writes together. Oracle allows more than one writer to busy more than one disk(subsystem) and circumvent other per process limitations (mainly on platforms without AIO). 
Yes, I think the best way would be to let the background process write a 
bunch of pages, then fsync() the files written to. If one tends to have 
many dirty buffers to the same file, this will group them together and 
the OS can optimize that. If one really has completely random access, 
then there is nothing to group.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Larry Rosenman [EMAIL PROTECTED] writes:
 You might also look at Veritas' advisory stuff.

Thanks for the suggestion -- it looks like we can make use of
this. For the curious, the cache advisory API is documented here:

http://www.lerctr.org:8458/en/man/html.7/vxfsio.7.html
http://www.lerctr.org:8458/en/ODM_FSadmin/fssag-9.html#MARKER-9-1

Note that unlike for posix_fadvise(), the docs for this functionality
explicitly state:

Some advisories are currently maintained on a per-file, not a
per-file-descriptor, basis. This means that only one set of
advisories can be in effect for all accesses to the file. If two
conflicting applications set different advisories, both use the
last advisories that were set.

-Neil


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
What bothers me a little is that you keep telling us that you have all
that great code from SRA. Do you have any idea when they intend to share
this with us and contribute the stuff? I mean at least some pieces
maybe? You personally got all the code from NuSphere AKA PeerDirect even
weeks before it got released. Did any PostgreSQL developer other than
you ever look at the SRA code?

I can get the open/fsync/write/close patch from SRA released, I think.
Let me ask them now.

Tom has seen the Win32 tarball (with SRA's approval) because he wanted
to research if threading was something we should pursue. I haven't
heard a report back from him yet. If you would like to see the tarball,
I can ask them.

Agreed, I got the PeerDirect/Nusphere code very early and it was a help.
I am sure I can get some of it released. I haven't pursued the sync
Win32 patch because it is based on a threaded backend model, so it is
different from how it need to be done in a process model (all shared
file descriptors). However, I will need to get approval in the end
anyway for Win32 because I need that Win32-specific part anyway.

I just looked at the sync() call in the code and it just did _flushall:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/html/_crt__flushall.asp

I can share this because I know it was discussed when someone (SRA?)
realized _commit() didn't force all buffers to disk. In fact, _commit
is fsync().

I think the only question was whether _flushall() fsync file descriptors
that have been closed. Perhaps SRA keeps the file descriptors open
until after the checkpoint, or does it fsync closed files with dirty
buffers. Tatsuo?

--
Bruce Momjian| http://candle.pha.pa.us
[EMAIL PROTECTED] | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup.| Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Andrew Sullivan

On Sun, Nov 09, 2003 at 08:54:25PM -0800, Joe Conway wrote:
 two servers, mounted to the same data volume, and some kind of 
 coordination between the writer processes. Anyone know if this is 
 similar to how Oracle handles RAC?

It is similar, yes, but there's some mighty powerful magic in that
some kind of co-ordination.  What do you do when one of the
particpants crashes, for instance?  

A

-- 

Andrew Sullivan 204-4141 Yonge Street
Afilias CanadaToronto, Ontario Canada
[EMAIL PROTECTED]  M2P 2A8
 +1 416 646 3304 x110


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Now, if we are sure that writes will happen only in the checkpoint
  process, O_SYNC would be OK, I guess, but will we ever be sure of that?
 
 This is a performance issue, not a correctness issue.  It's okay for
 backends to wait for writes as long as it happens very infrequently.
 The question is whether we can design a background dirty-buffer writer
 that works well enough to make it uncommon for backends to have to
 write dirty buffers for themselves.  If we can, then doing all the
 writes O_SYNC would not be a problem.

Agreed.  My concern is that right now we do write() in each backend. 
Those writes are probably pretty fast, probably as fast as a read() when
the buffer is already in the kernel cache.  The current discussion
involves centralizing most of the writes (centralization can be slower),
and having the writes forced to disk.  That seems like it could be a
double-killer.

 (One possibility that could help improve the odds is to allow a certain
 amount of slop in the LRU buffer reuse policy --- that is, if you see
 the buffer at the tail of the LRU list is dirty, allow one of the next
 few buffers to be taken instead, if it's clean.  Or just keep separate
 lists for dirty and clean buffers.)

Yes, I think you almost will have to split the LRU list into
dirty/clean, and that might make dirty buffers stay around longer.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Tom Lane wrote:
  Andrew Sullivan [EMAIL PROTECTED] writes:
   On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
   real traction we'd have to go back to the take over most of RAM for
   shared buffers approach, which we already know to have a bunch of
   severe disadvantages.
  
   I know there are severe disadvantages in the current implementation,
   but are there in-principle severe disadvantages?
  
  Yes.  For one, since we cannot change the size of shared memory
  on-the-fly (at least not portably), there is no opportunity to trade off
  memory usage dynamically between processes and disk buffers.  For
  another, on many systems shared memory is subject to being swapped out.
  Swapping out dirty buffers is a performance killer, because they must be
  swapped back in again before they can be written to where they should
  have gone.  The only way to avoid this is to keep the number of shared
  buffers small enough that they all remain fairly hot (recently used)
  and so the kernel won't be tempted to swap out any part of the region.
  
  Agreed, we can't resize shared memory, but I don't think most OS's swap
  out shared memory, and even if they do, they usually have a kernel
 
 We can't resize shared memory because we allocate the whole thing in one 
 big hump - which causes the shmmax problem BTW. If we allocate that in 
 chunks of multiple blocks, we only have to give it a total maximum size 
 to get the hash tables and other stuff right from the beginning. But the 
 vast majority of memory, the buffers themself, can be made adjustable at 
 runtime.

That is an interesting idea.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
  If the background cleaner has to not just write() but write/fsync or
  write/O_SYNC, it isn't going to be able to clean them fast enough.  It
  creates a bottleneck where we didn't have one before.
  
  We are trying to eliminate an I/O storm during checkpoint, but the
  solutions seem to be making the non-checkpoint times slower.
  
 
 It looks as if you're assuming that I am making the backends unable to 
 write on their own, so that they have to wait on the checkpointer. I 
 never said that.
 
 If the checkpointer keeps the LRU heads clean, that lifts off write load 
 from the backends. Sure, they will be able to dirty pages faster. 
 Theoretically, because in practice if you have a reasonably good cache 
 hitrate, they will just find already dirty buffers where they just add 
 some more dust.
 
 If after all the checkpointer (doing write()+whateversync) is not able 
 to keep up with the speed of buffers getting dirtied, the backends will 
 have to do some write()'s again, because they will eat up the clean 
 buffers at the LRU head and pass the checkpointer.

Yes, there are a couple of issues here --- first, have a background
writer to write dirty pages.  This is good, no question.  The bigger
question is removing sync() and using fsync() or O_SYNC for every write
--- if we do that, the backends doing private write will have to fsync
their writes too, meaning if the checkpointer can't keep up, we now have
backends doing slow writes too.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Jan Wieck wrote:
  Bruce Momjian wrote:
  
   Now, O_SYNC is going to force every write to the disk.  If we have a
   transaction that has to write lots of buffers (has to write them to
   reuse the shared buffer)
  
  So make the background writer/checkpointer keeping the LRU head clean. I 
  explained that 3 times now.
  
  If the background cleaner has to not just write() but write/fsync or
  write/O_SYNC, it isn't going to be able to clean them fast enough.  It
  creates a bottleneck where we didn't have one before.
  
  We are trying to eliminate an I/O storm during checkpoint, but the
  solutions seem to be making the non-checkpoint times slower.
  
 
 It looks as if you're assuming that I am making the backends unable to 
 write on their own, so that they have to wait on the checkpointer. I 
 never said that.

Maybe I missed it but are those backend now doing write or write/fsync? 
If the former, that is fine.  If the later, it does seem slower than it
used to be.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Jan Wieck wrote:
  Bruce Momjian wrote:
   I would be interested to know if you have the background write process
   writing old dirty buffers to kernel buffers continually if the sync()
   load is diminished.  What this does is to push more dirty buffers into
   the kernel cache in hopes the OS will write those buffers on its own
   before the checkpoint does its write/sync work.  This might allow us to
   reduce sync() load while preventing the need for O_SYNC/fsync().
  
  I tried that first. Linux 2.4 does not, as long as you don't tell it by 
  reducing the dirty data block aging time with update(8). So you have to 
  force it to utilize the write bandwidth in the meantime. For that you 
  have to call sync() or fsync() on something.
  
  Maybe O_SYNC is not as bad an option as it seems. In my patch, the 
  checkpointer flushes the buffers in LRU order, meaning it flushes the 
  least recently used ones first. This has the side effect that buffers 
  returned for replacement (on a cache miss, when the backend needs to 
  read the block) are most likely to be flushed/clean. So it reduces the 
  write load of backends and thus the probability that a backend is ever 
  blocked waiting on an O_SYNC'd write().
  
  I will add some counters and gather some statistics how often the 
  backend in comparision to the checkpointer calls write().
  
  OK, new idea.  How about if you write() the buffers, mark them as clean
  and unlock them, then issue fsync().  The advantage here is that we can
 
 Not really new, I think in my first mail I wrote that I simplified this 
 new mdfsyncrecent() function by calling sync() instead ... other than 
 that the code I posted worked exactly that way.

I am confused --- I was suggesting we call fsync after we write a few
blocks for a given table, and that was going to happen between
checkpoints.  Is the sync() happening then or only at checkpoint time.

Sorry I am lost but there seems to be an email delay in my receiving the
replies.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
  One problem with O_SYNC would be, that the OS does not group writes any 
  more. So the code would need to eighter do it's own sorting and grouping
  (256k) or use aio, or you won't be able to get the maximum out of the disks.
 
 Or just run multiple writer processes, which I believe is Oracle's
 solution.

Yes, that might need to be the final solution because the O_SYNC will be
slow.  However, that is a lot of big wrench solution to removing
sync() --- it would be nice if we could find a more eligant way.

In fact, one goffy idea would be if the OS does sync every 30 seconds to
just write() the buffers and wait 30 seconds for the OS to issue the
sync, then recycle the WAL buffers --- again, just a crazy thought.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Larry Rosenman



--On Monday, November 10, 2003 13:40:24 -0500 Neil Conway 
[EMAIL PROTECTED] wrote:

Larry Rosenman [EMAIL PROTECTED] writes:
You might also look at Veritas' advisory stuff.
Thanks for the suggestion -- it looks like we can make use of
this. For the curious, the cache advisory API is documented here:
http://www.lerctr.org:8458/en/man/html.7/vxfsio.7.html
http://www.lerctr.org:8458/en/ODM_FSadmin/fssag-9.html#MARKER-9-1
Note that unlike for posix_fadvise(), the docs for this functionality
explicitly state:
Some advisories are currently maintained on a per-file, not a
per-file-descriptor, basis. This means that only one set of
advisories can be in effect for all accesses to the file. If two
conflicting applications set different advisories, both use the
last advisories that were set.
BTW, if ANY developer wants to play with this, I can make an account for 
them.  I have ODM installed on lerami.lerctr.org (www.lerctr.org is a 
CNAME).

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


pgp0.pgp
Description: PGP signature

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian [EMAIL PROTECTED] writes:
 Another idea --- if fsync() is slow because it can't find the dirty
 buffers, use write() to write the buffers, copy the buffer to local
 memory, mark it as clean, then open the file with O_SYNC and write
 it again.

Yuck.

Do we have any idea how many kernels are out there that implement
fsync() as poorly as HPUX apparently does? I'm just wondering if we're
contemplating spending a whole lot of effort to work around a bug that
is only present on an (old?) version of HPUX. Do typical BSD derived
kernels exhibit this behavior? What about Linux? Solaris?

-Neil


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Zeugswetter Andreas SB SD wrote:
 
   One problem with O_SYNC would be, that the OS does not group writes any 
   more. So the code would need to eighter do it's own sorting and grouping
   (256k) or use aio, or you won't be able to get the maximum out of the disks.
  
  Or just run multiple writer processes, which I believe is Oracle's
  solution.
  
  That does not help, since for O_SYNC the OS'es (those I know) do not group those 
  writes together. Oracle allows more than one writer to busy more than one 
  disk(subsystem) and circumvent other per process limitations (mainly on platforms 
  without AIO). 
 
 Yes, I think the best way would be to let the background process write a 
 bunch of pages, then fsync() the files written to. If one tends to have 
 many dirty buffers to the same file, this will group them together and 
 the OS can optimize that. If one really has completely random access, 
 then there is nothing to group.

Agreed.  This might force enough stuff out to disk the checkpoint/sync()
would be OK.  Jan, have you tested this?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck [EMAIL PROTECTED] writes:
 We can't resize shared memory because we allocate the whole thing in
 one big hump - which causes the shmmax problem BTW. If we allocate
 that in chunks of multiple blocks, we only have to give it a total
 maximum size to get the hash tables and other stuff right from the
 beginning. But the vast majority of memory, the buffers themself, can
 be made adjustable at runtime.

Yeah, writing a palloc()-style wrapper over shm has been suggested
before (by myself among others). You could do the shm allocation in
fixed-size blocks (say, 1 MB each), and then do our own memory
management to allocate and release smaller chunks of shm when
requested. I'm not sure what it really buys us, though: sure, we can
expand the shared buffer area to some degree, but

(a) how do we know what the right size of the shared buffer
area /should/ be? It is difficult enough to avoid running
the machine out of physical memory, let alone figure out
how much memory is being used by the kernel for the buffer
cache and how much we should use ourselves. I think the
DBA needs to configure this anyway.

(b) the amount of shm we can ultimately use is finite, so we
will still need to use a lot of caution when placing
dynamically-sized data structures in shm. A shm_alloc()
might help this somewhat, but I don't see how it would
remove the fundamental problem.

-Neil


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Neil Conway wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Another idea --- if fsync() is slow because it can't find the dirty
  buffers, use write() to write the buffers, copy the buffer to local
  memory, mark it as clean, then open the file with O_SYNC and write
  it again.
 
 Yuck.
 
 Do we have any idea how many kernels are out there that implement
 fsync() as poorly as HPUX apparently does? I'm just wondering if we're
 contemplating spending a whole lot of effort to work around a bug that
 is only present on an (old?) version of HPUX. Do typical BSD derived
 kernels exhibit this behavior? What about Linux? Solaris?

Not sure, but it almost doesn't even matter --- any solution which has
fsync/O_SYNC/sync() in a critical path, even the path of replacing dirty
buffers --- is going to be too slow, I am afraid.  Doesn't matter how
fast fsync() is, it is going to be slow.  

I think Tom's only issue with HPUX is that even if fsync is out of the
critical path (background writer) it is going to consume lots of CPU
time finding those dirty buffers --- not sure how slow that would be.
If it is really slow on HPUX, we could disable the fsync's for the
background writer and just how the OS writes those buffers aggressively.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Jan Wieck wrote:
Bruce Momjian wrote:

 Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Now, O_SYNC is going to force every write to the disk.  If we have a
  transaction that has to write lots of buffers (has to write them to
  reuse the shared buffer)
 
 So make the background writer/checkpointer keeping the LRU head clean. I 
 explained that 3 times now.
 
 If the background cleaner has to not just write() but write/fsync or
 write/O_SYNC, it isn't going to be able to clean them fast enough.  It
 creates a bottleneck where we didn't have one before.
 
 We are trying to eliminate an I/O storm during checkpoint, but the
 solutions seem to be making the non-checkpoint times slower.
 

It looks as if you're assuming that I am making the backends unable to 
write on their own, so that they have to wait on the checkpointer. I 
never said that.
Maybe I missed it but are those backend now doing write or write/fsync? 
If the former, that is fine.  If the later, it does seem slower than it
used to be.
In my all_performance.v4.diff they do write and the checkpointer does 
write+sync.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
   If the background cleaner has to not just write() but write/fsync or
   write/O_SYNC, it isn't going to be able to clean them fast enough.  It
   creates a bottleneck where we didn't have one before.
   
   We are trying to eliminate an I/O storm during checkpoint, but the
   solutions seem to be making the non-checkpoint times slower.
   
  
  It looks as if you're assuming that I am making the backends unable to 
  write on their own, so that they have to wait on the checkpointer. I 
  never said that.
  
  Maybe I missed it but are those backend now doing write or write/fsync? 
  If the former, that is fine.  If the later, it does seem slower than it
  used to be.
 
 In my all_performance.v4.diff they do write and the checkpointer does 
 write+sync.

Again, sorry to be confusing --- I might be good to try write/fsync from
the background writer if backends can do writes on their own too without
fsync.  The additional fsync from the background writer should reduce
disk writing during sync().  (The fsync should happen with the buffer
unlocked.)

You stated you didn't see improvement when the background writer did
non-checkpoint writes unless you modified update(4).  Adding fsync might
correct that.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Neil Conway wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Another idea --- if fsync() is slow because it can't find the dirty
  buffers, use write() to write the buffers, copy the buffer to local
  memory, mark it as clean, then open the file with O_SYNC and write
  it again.
 
 Yuck.

This idea if mine will not even work unless others are prevented from
writing that data block while I am fsync'ing from local memory --- what
if someone modified and wrote that block before my block did its fsync
write?  I would overwrite their new data.  It was just a crazy idea.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Jan Wieck wrote:
Zeugswetter Andreas SB SD wrote:

  One problem with O_SYNC would be, that the OS does not group writes any 
  more. So the code would need to eighter do it's own sorting and grouping
  (256k) or use aio, or you won't be able to get the maximum out of the disks.
 
 Or just run multiple writer processes, which I believe is Oracle's
 solution.
 
 That does not help, since for O_SYNC the OS'es (those I know) do not group those 
 writes together. Oracle allows more than one writer to busy more than one disk(subsystem) and circumvent other per process limitations (mainly on platforms without AIO). 

Yes, I think the best way would be to let the background process write a 
bunch of pages, then fsync() the files written to. If one tends to have 
many dirty buffers to the same file, this will group them together and 
the OS can optimize that. If one really has completely random access, 
then there is nothing to group.
Agreed.  This might force enough stuff out to disk the checkpoint/sync()
would be OK.  Jan, have you tested this?
As said, not using fsync() but sync() at that place. This only makes a 
real difference when you're not running PostgreSQL on a dedicated 
server. And yes, it really works well.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Bruce Momjian wrote:

Jan Wieck wrote:
  If the background cleaner has to not just write() but write/fsync or
  write/O_SYNC, it isn't going to be able to clean them fast enough.  It
  creates a bottleneck where we didn't have one before.
  
  We are trying to eliminate an I/O storm during checkpoint, but the
  solutions seem to be making the non-checkpoint times slower.
  
 
 It looks as if you're assuming that I am making the backends unable to 
 write on their own, so that they have to wait on the checkpointer. I 
 never said that.
 
 Maybe I missed it but are those backend now doing write or write/fsync? 
 If the former, that is fine.  If the later, it does seem slower than it
 used to be.

In my all_performance.v4.diff they do write and the checkpointer does 
write+sync.
Again, sorry to be confusing --- I might be good to try write/fsync from
the background writer if backends can do writes on their own too without
fsync.  The additional fsync from the background writer should reduce
disk writing during sync().  (The fsync should happen with the buffer
unlocked.)
No, you're not. But thank you for suggesting what I implemented.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Jan Wieck wrote:
If the background cleaner has to not just write() but write/fsync or
write/O_SYNC, it isn't going to be able to clean them fast enough.  It
creates a bottleneck where we didn't have one before.

We are trying to eliminate an I/O storm during checkpoint, but the
solutions seem to be making the non-checkpoint times slower.

   
   It looks as if you're assuming that I am making the backends unable to 
   write on their own, so that they have to wait on the checkpointer. I 
   never said that.
   
   Maybe I missed it but are those backend now doing write or write/fsync? 
   If the former, that is fine.  If the later, it does seem slower than it
   used to be.
  
  In my all_performance.v4.diff they do write and the checkpointer does 
  write+sync.
  
  Again, sorry to be confusing --- I might be good to try write/fsync from
  the background writer if backends can do writes on their own too without
  fsync.  The additional fsync from the background writer should reduce
  disk writing during sync().  (The fsync should happen with the buffer
  unlocked.)
 
 No, you're not. But thank you for suggesting what I implemented.

OK, I did IM with Jan and I understand now --- he is using write/sync
for testing, but plans to allow several ways to force writes to disk
occasionally, probably defaulting to fsync on most platforms.  Backend
will still use write only, and a checkpoint will continue using sync().

The qustion still open is whether we can push most/all writes into the
background writer so we can use fsync/open instead of sync.  My point
has been that this might be hard to do with the same performance we have
now.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck wrote:
 Bruce Momjian wrote:
 
  Jan Wieck wrote:
  Zeugswetter Andreas SB SD wrote:
  
One problem with O_SYNC would be, that the OS does not group writes any 
more. So the code would need to eighter do it's own sorting and grouping
(256k) or use aio, or you won't be able to get the maximum out of the disks.
   
   Or just run multiple writer processes, which I believe is Oracle's
   solution.
   
   That does not help, since for O_SYNC the OS'es (those I know) do not group 
   those 
   writes together. Oracle allows more than one writer to busy more than one 
   disk(subsystem) and circumvent other per process limitations (mainly on 
   platforms without AIO). 
  
  Yes, I think the best way would be to let the background process write a 
  bunch of pages, then fsync() the files written to. If one tends to have 
  many dirty buffers to the same file, this will group them together and 
  the OS can optimize that. If one really has completely random access, 
  then there is nothing to group.
  
  Agreed.  This might force enough stuff out to disk the checkpoint/sync()
  would be OK.  Jan, have you tested this?
  
 
 As said, not using fsync() but sync() at that place. This only makes a 
 real difference when you're not running PostgreSQL on a dedicated 
 server. And yes, it really works well.

I talked to Jan about this.  Basically, for testing, if sync decreases
the checkpoint load, fsync/O_SYNC should do even better, hopefully, once
he has that implemented.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Andrew Sullivan wrote:

On Sun, Nov 09, 2003 at 08:54:25PM -0800, Joe Conway wrote:
two servers, mounted to the same data volume, and some kind of 
coordination between the writer processes. Anyone know if this is 
similar to how Oracle handles RAC?
It is similar, yes, but there's some mighty powerful magic in that
some kind of co-ordination.  What do you do when one of the
particpants crashes, for instance?  
What about sympathetic crash?

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Tatsuo Ishii

I can get the open/fsync/write/close patch from SRA released, I think.
Let me ask them now.

I will ask my boss then come back with the result.

I just looked at the sync() call in the code and it just did _flushall:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/html/_crt__flushall.asp

I can share this because I know it was discussed when someone (SRA?)
realized _commit() didn't force all buffers to disk. In fact, _commit
is fsync().

In the SRA's code, the checkpoint thread opens each file (if it's not
already open of course) which has been written then fsync() it.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-10 Thread Shridhar Daithankar

On Tuesday 11 November 2003 00:50, Neil Conway wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  We can't resize shared memory because we allocate the whole thing in
  one big hump - which causes the shmmax problem BTW. If we allocate
  that in chunks of multiple blocks, we only have to give it a total
  maximum size to get the hash tables and other stuff right from the
  beginning. But the vast majority of memory, the buffers themself, can
  be made adjustable at runtime.

 Yeah, writing a palloc()-style wrapper over shm has been suggested
 before (by myself among others). You could do the shm allocation in
 fixed-size blocks (say, 1 MB each), and then do our own memory
 management to allocate and release smaller chunks of shm when
 requested. I'm not sure what it really buys us, though: sure, we can
 expand the shared buffer area to some degree, but

Thinking of it, it can be put as follows. Postgresql needs shared memory 
between all the backends. 

If the parent postmaster mmaps anonymous memory segments and shares them with 
children, postgresql wouldn't be dependent upon any kernel resourse aka 
shared memory anymore.

Furthermore parent posmaster can allocate different anonymous mappings for 
different databases. In addition to postgresql buffer manager overhaul, this 
would make things lot better.

note that I am not suggesting mmap to maintain files on disk. So I guess that 
should be OK. 

I tried searching for mmap on hackers. The threads seem to be very old. One in 
1998. with so many proposals of rewriting core stuff, does this have any 
chance?

 Just a thought.

 Shridhar


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Greg Stark

Manfred Spraul [EMAIL PROTECTED] writes:

 Greg Stark wrote:
 
 I'm assuming fsync syncs writes issued by other processes on the same file,
 which isn't necessarily true though.
 
 It was already pointed out that we can't rely on that assumption.
 
 
 So the NetBSD and Sun developers I checked with both asserted fsync does in
 fact guarantee this. And SUSv2 seems to back them up:
 

 At least Linux had one problem: fsync() syncs the inode to disk, but not the
 directory entry: if you rename a file, open it, write to it, fsync, and the
 computer crashes, then it's not guaranteed that the file rename is on the disk.
 I think only the old ext2 is affected, not the journaling filesystems.

That's true. But why would postgres ever have to worry about files being
renamed being synced? Tables and indexes don't get their files renamed
typically. WAL logs?

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 You want to find, open, and fsync() every file in the database cluster
 for every checkpoint?  Sounds like a non-starter to me.

 Except a) this is outside any critical path, and b) only done every few
 minutes and c) the fsync calls on files with no dirty buffers ought to be
 cheap, at least as far as i/o.

The directory search and opening of the files is in itself nontrivial
overhead ... particularly on systems where open(2) isn't speedy, such
as Solaris.  I also disbelieve your assumption that fsync'ing a file
that doesn't need it will be free.  That depends entirely on what sort
of indexes the OS keeps on its buffer cache.  There are Unixen where
fsync requires a scan through the entire buffer cache because there is
no data structure that permits finding associated buffers any more
efficiently than that.  (IIRC, the HPUX system I'm typing this on is
like that.)  On those sorts of systems, we'd be way better off to use
O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs.
Check the archives --- this was all gone into in great detail when we
were testing alternative methods for fsyncing the WAL files.

 So the NetBSD and Sun developers I checked with both asserted fsync does in
 fact guarantee this. And SUSv2 seems to back them up:

 The fsync() function can be used by an application to indicate that all
 data for the open file description named by fildes is to be transferred to
 the storage device associated with the file described by fildes in an
 implementation-dependent manner.

The question here is what is meant by data for the open file
description.  If it said all data for the file referenced by the open
FD then I would agree that the spec says what you claim.  As is, I
think it would be entirely within the spec for the OS to dump only
buffers that had been dirtied through that particular FD.  Notice that
the last part of the sentence is careful to respect the distinction
between the FD and the file; why isn't the first part?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Manfred Spraul

Greg Stark wrote:

I'm assuming fsync syncs writes issued by other processes on the same file,
which isn't necessarily true though.
 

It was already pointed out that we can't rely on that assumption.
   

So the NetBSD and Sun developers I checked with both asserted fsync does in
fact guarantee this. And SUSv2 seems to back them up:
 

At least Linux had one problem: fsync() syncs the inode to disk, but not 
the directory entry: if you rename a file, open it, write to it, fsync, 
and the computer crashes, then it's not guaranteed that the file rename 
is on the disk.
I think only the old ext2 is affected, not the journaling filesystems.

--
   Manfred
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Stephen

The delay patch worked so well, I couldn't resist asking if a similar patch
could be added for COPY command (pg_dump). It's just an extension of the
same idea. On a large DB, backups can take very long while consuming a lot
of IO slowing down other select and write operations. We operate on a backup
window during low traffic period at night. It'll be nice to be able to run
pg_dump *anytime* and no longer need to worry about the backup window.
Backups will take longer to run, but like in the case of the VACUUM, it's a
win for many people to be able to let it run in the background through the
whole day. The delay should be optional and defaults to zero so those who
wish to backup immediately can still do it. The way I see it, routine
backups and vacuums should be ubiquitous once properly configured.

Regards,

Stephen


Tom Lane [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Jan Wieck [EMAIL PROTECTED] writes:
  I am currently looking at implementing ARC as a replacement strategy. I
  don't have anything that works yet, so I can't really tell what the
  result would be and it might turn out that we want both features.

 It's likely that we would.  As someone (you?) already pointed out,
 VACUUM has bad side-effects both in terms of cache flushing and in
 terms of sheer I/O load.  Those effects require different fixes AFAICS.

 One thing that bothers me here is that I don't see how adjusting our
 own buffer replacement strategy is going to do much of anything when
 we cannot control the kernel's buffer replacement strategy.  To get any
 real traction we'd have to go back to the take over most of RAM for
 shared buffers approach, which we already know to have a bunch of
 severe disadvantages.

 regards, tom lane

 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Jan Wieck

Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:

How I can see the background writer operating is that he's keeping the 
buffers in the order of the LRU chain(s) clean, because those are the 
buffers that most likely get replaced soon. In my experimental ARC code 
it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and 
n2 dirty buffers (n1+n2 configurable), then fsync all files that have 
been involved in that, nap depending on where he got down the queues (to 
increase the write rate when running low on clean buffers), and do it 
all over again.
You probably need one more knob here: how often to issue the fsyncs.
I'm not convinced once per outer loop is a sufficient answer.
Otherwise this is sounding pretty good.
This is definitely heading into the right direction.

I currently have a crude and ugly hacked system, that does checkpoints 
every minute but streches them out over the whole time. It writes out 
the dirty buffers in T1+T2 LRU order intermixed, streches out the flush 
over the whole checkpoint interval and does sync()+usleep() every 32 
blocks (if it has time to do this).

This is clearly the wrong way to implement it, but ...

The same system has ARC and delayed vacuum. With normal, unmodified 
checkpoints every 300 seconds, the transaction responsetime for 
new_order still peaks at over 30 seconds (5 is already too much) so the 
system basically come to a freeze during a checkpoint.

Now with this high-frequent sync()ing and checkpointing by the minute, 
the entire system load levels out really nice. Basically it's constantly 
checkpointing. So maybe the thing we're looking for is to make the 
checkpoint process the background buffer writer process and let it 
checkpoint 'round the clock. Of course, with a bit more selectivity on 
what to fsync and not doing system wide sync() every 10-500 milliseconds :-)

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  You want to find, open, and fsync() every file in the database cluster
  for every checkpoint?  Sounds like a non-starter to me.
 
  Except a) this is outside any critical path, and b) only done every few
  minutes and c) the fsync calls on files with no dirty buffers ought to be
  cheap, at least as far as i/o.
 
 The directory search and opening of the files is in itself nontrivial
 overhead ... particularly on systems where open(2) isn't speedy, such
 as Solaris.  I also disbelieve your assumption that fsync'ing a file
 that doesn't need it will be free.  That depends entirely on what sort
 of indexes the OS keeps on its buffer cache.  There are Unixen where
 fsync requires a scan through the entire buffer cache because there is
 no data structure that permits finding associated buffers any more
 efficiently than that.  (IIRC, the HPUX system I'm typing this on is
 like that.)  On those sorts of systems, we'd be way better off to use
 O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs.
 Check the archives --- this was all gone into in great detail when we
 were testing alternative methods for fsyncing the WAL files.

Not sure on this one --- let's look at our options
O_SYNC
fsync
sync

Now, O_SYNC is going to force every write to the disk.  If we have a
transaction that has to write lots of buffers (has to write them to
reuse the shared buffer), it will have to wait for every buffer to hit
disk before the write returns --- this seems terrible to me and gives
the drive no way to group adjacent writes.  Even on HPUX, which has poor
fsync dirty buffer detection, if the fsync is outside the main
processing loop (checkpoint process), isn't fsync better than O_SYNC? 
Now, if we are sure that writes will happen only in the checkpoint
process, O_SYNC would be OK, I guess, but will we ever be sure of that?
I can't imagine a checkpoint process keeping up with lots of active
backends, especially if the writes use O_SYNC.  The problem is that
instead of having backend write everything to kernel buffers, we are all
of a sudden forcing all writes of dirty buffers to disk.  sync() starts
to look very attractive compared to that option.

fsync is better in that we can force it after a number of writes, and
can delay it, so we can write a buffer and reuse it, then later issue
the fsync.  That is a win, though it doesn't allow the drive to group
adjacent writes in different files.  Sync of course allows grouping of
all writes by the drive, but writes all non-PostgreSQL dirty buffers
too.  Ideally, we would have an fsync() where we could pass it a list of
our files and it would do all of them optimally.

From what I have heard so far, sync() still seems like the most
efficient method.  I know it only schedules write, but with a sleep
after it, it seems like maybe the best bet.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  What still needs to be addressed is the IO storm cause by checkpoints. I 
  see it much relaxed when stretching out the BufferSync() over most of 
  the time until the next one should occur. But the kernel sync at it's 
  end still pushes the system hard against the wall.
 
 I have never been happy with the fact that we use sync(2) at all.  Quite
 aside from the I/O storm issue, sync() is really an unsafe way to do a
 checkpoint, because there is no way to be certain when it is done.  And
 on top of that, it does too much, because it forces syncing of files
 unrelated to Postgres.
 
 I would like to see us go over to fsync, or some other technique that
 gives more certainty about when the write has occurred.  There might be
 some scope that way to allow stretching out the I/O, too.
 
 The main problem with this is knowing which files need to be fsync'd.
 The only idea I have come up with is to move all buffer write operations
 into a background writer process, which could easily keep track of
 every file it's written into since the last checkpoint.  This could cause
 problems though if a backend wants to acquire a free buffer and there's
 none to be had --- do we want it to wait for the background process to
 do something?  We could possibly say that backends may write dirty
 buffers for themselves, but only if they fsync them immediately.  As
 long as this path is seldom taken, the extra fsyncs shouldn't be a big
 performance problem.
 
 Actually, once you build it this way, you could make all writes
 synchronous (open the files O_SYNC) so that there is never any need for
 explicit fsync at checkpoint time.  The background writer process would
 be the one incurring the wait in most cases, and that's just fine.  In
 this way you could directly control the rate at which writes are issued,
 and there's no I/O storm at all.  (fsync could still cause an I/O storm
 if there's lots of pending writes in a single file.)

This outlines the same issue --- a very active backend might dirty 5k
buffers --- if those 5k buffers have to be written using O_SYNC, it will
take much longer than doing 5k buffer writes and doing an fsync() or
sync() at the end.

Having another process do the writing does allow some paralellism, but
people don't seem to care of buffers having to be read in from the
kernel buffer cache, so what big benefit do we get by having someone
else write into the kernel buffer cache, except allowing a central place
to fsync, and is it worth it considering that it might be impossible to
configure a system where the writer process can keep up with all the
backends?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

scott.marlowe wrote:
 On Tue, 4 Nov 2003, Tom Lane wrote:
 
  Jan Wieck [EMAIL PROTECTED] writes:
   What still needs to be addressed is the IO storm cause by checkpoints. I 
   see it much relaxed when stretching out the BufferSync() over most of 
   the time until the next one should occur. But the kernel sync at it's 
   end still pushes the system hard against the wall.
  
  I have never been happy with the fact that we use sync(2) at all.  Quite
  aside from the I/O storm issue, sync() is really an unsafe way to do a
  checkpoint, because there is no way to be certain when it is done.  And
  on top of that, it does too much, because it forces syncing of files
  unrelated to Postgres.
  
  I would like to see us go over to fsync, or some other technique that
  gives more certainty about when the write has occurred.  There might be
  some scope that way to allow stretching out the I/O, too.
  
  The main problem with this is knowing which files need to be fsync'd.
 
 Wasn't this a problem that the win32 port had to solve by keeping a list 
 of all files that need fsyncing since Windows doesn't do sync() in the 
 classical sense?  If so, then could we use that code to keep track of the 
 files that need fsyncing?

Yes, I have that code from SRA.  They used threading, so they recorded
all the open files in local memory and opened/fsync/closed them for
checkpoints.  We have to store the file names in a shared area, perhaps
an area of shared memory with an overflow to a disk file.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM


I would be interested to know if you have the background write process
writing old dirty buffers to kernel buffers continually if the sync()
load is diminished.  What this does is to push more dirty buffers into
the kernel cache in hopes the OS will write those buffers on its own
before the checkpoint does its write/sync work.  This might allow us to
reduce sync() load while preventing the need for O_SYNC/fsync().

Perhaps sync() is bad partly because the checkpoint runs through all the
dirty shared buffers and writes them all to the kernel and then issues
sync() almost guaranteeing a flood of writes to the disk.  This method
would find fewer dirty buffers in the shared buffer cache, and therefore
fewer kernel writes needed by sync().

---

Jan Wieck wrote:
 Tom Lane wrote:
 
  Jan Wieck [EMAIL PROTECTED] writes:
  
  How I can see the background writer operating is that he's keeping the 
  buffers in the order of the LRU chain(s) clean, because those are the 
  buffers that most likely get replaced soon. In my experimental ARC code 
  it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and 
  n2 dirty buffers (n1+n2 configurable), then fsync all files that have 
  been involved in that, nap depending on where he got down the queues (to 
  increase the write rate when running low on clean buffers), and do it 
  all over again.
  
  You probably need one more knob here: how often to issue the fsyncs.
  I'm not convinced once per outer loop is a sufficient answer.
  Otherwise this is sounding pretty good.
 
 This is definitely heading into the right direction.
 
 I currently have a crude and ugly hacked system, that does checkpoints 
 every minute but streches them out over the whole time. It writes out 
 the dirty buffers in T1+T2 LRU order intermixed, streches out the flush 
 over the whole checkpoint interval and does sync()+usleep() every 32 
 blocks (if it has time to do this).
 
 This is clearly the wrong way to implement it, but ...
 
 The same system has ARC and delayed vacuum. With normal, unmodified 
 checkpoints every 300 seconds, the transaction responsetime for 
 new_order still peaks at over 30 seconds (5 is already too much) so the 
 system basically come to a freeze during a checkpoint.
 
 Now with this high-frequent sync()ing and checkpointing by the minute, 
 the entire system load levels out really nice. Basically it's constantly 
 checkpointing. So maybe the thing we're looking for is to make the 
 checkpoint process the background buffer writer process and let it 
 checkpoint 'round the clock. Of course, with a bit more selectivity on 
 what to fsync and not doing system wide sync() every 10-500 milliseconds :-)
 
 
 Jan
 
 -- 
 #==#
 # It's easier to get forgiveness for being wrong than for being right. #
 # Let's break this rule - forgive me.  #
 #== [EMAIL PROTECTED] #
 
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faqs/FAQ.html
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  That is part of the idea. The whole idea is to issue physical writes 
  at a fairly steady rate without increasing the number of them 
  substantial or interfering with the drives opinion about their order too 
  much. I think O_SYNC for random access can be in conflict with write 
  reordering.
 
 Good point.  But if we issue lots of writes without fsync then we still
 have the problem of a write storm when the fsync finally occurs, while
 if we fsync too often then we constrain the write order too much.  There
 will need to be some tuning here.

I know the BSD's have trickle sync --- if we write the dirty buffers to
kernel buffers many seconds before our checkpoint, the kernel might
right them to disk for use and sync() will not need to do it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Tom Lane wrote:
 Andrew Sullivan [EMAIL PROTECTED] writes:
  On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
  real traction we'd have to go back to the take over most of RAM for
  shared buffers approach, which we already know to have a bunch of
  severe disadvantages.
 
  I know there are severe disadvantages in the current implementation,
  but are there in-principle severe disadvantages?
 
 Yes.  For one, since we cannot change the size of shared memory
 on-the-fly (at least not portably), there is no opportunity to trade off
 memory usage dynamically between processes and disk buffers.  For
 another, on many systems shared memory is subject to being swapped out.
 Swapping out dirty buffers is a performance killer, because they must be
 swapped back in again before they can be written to where they should
 have gone.  The only way to avoid this is to keep the number of shared
 buffers small enough that they all remain fairly hot (recently used)
 and so the kernel won't be tempted to swap out any part of the region.

Agreed, we can't resize shared memory, but I don't think most OS's swap
out shared memory, and even if they do, they usually have a kernel
configuration parameter to lock it into kernel memory.  All the old
unixes locked the shared memory into kernel address space and in fact
this is why many of them required a kernel recompile to increase shared
memory.  I hope the ones that have pagable shared memory have a way to
prevent it --- at least FreeBSD does, not sure about Linux.

Now, the disadvantages of large kernel cache, small PostgreSQL buffer
cache is that data has to be transfered to/from the kernel buffers, and
second, we can't control the kernel's cache replacement strategy, and
will probably not be able to in the near future, while we do control our
own buffer cache replacement strategy.

Looking at the advantages/disadvantages, a large shared buffer cache
looks pretty good to me.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Joe Conway

Bruce Momjian wrote:
Having another process do the writing does allow some paralellism, but
people don't seem to care of buffers having to be read in from the
kernel buffer cache, so what big benefit do we get by having someone
else write into the kernel buffer cache, except allowing a central place
to fsync, and is it worth it considering that it might be impossible to
configure a system where the writer process can keep up with all the
backends?
This might be far fetched, but I wonder if having a writer process opens 
up the possibility of running PostgreSQL in a cluster? I'm thinking of 
two servers, mounted to the same data volume, and some kind of 
coordination between the writer processes. Anyone know if this is 
similar to how Oracle handles RAC?

Joe

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-09 Thread Joe Conway

Bruce Momjian wrote:
Agreed, we can't resize shared memory, but I don't think most OS's swap
out shared memory, and even if they do, they usually have a kernel
configuration parameter to lock it into kernel memory.  All the old
unixes locked the shared memory into kernel address space and in fact
this is why many of them required a kernel recompile to increase shared
memory.  I hope the ones that have pagable shared memory have a way to
prevent it --- at least FreeBSD does, not sure about Linux.
I'm pretty sure at least Linux, Solaris, and HPUX all work this way -- 
otherwise Oracle would have the same problem with their SGA, which is 
kept in shared memory.

Joe

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-05 Thread Zeugswetter Andreas SB SD


  The only idea I have come up with is to move all buffer write operations
  into a background writer process, which could easily keep track of
  every file it's written into since the last checkpoint.  
 
 I fear this approach. It seems to limit a lot of design flexibility later. But
 I can't come up with any concrete way it limits things so perhaps that
 instinct is just fud.

A lot of modern disk subsystems can only be saturated with more then one parallel 
IO request. So it would at least need a tuneable number of parallel writer processes,
or one writer that uses AIO to dump all outstanding IO requests out at once.
(Optimal would be all, in reality it would need to be batched into groups of 
n pages, since most systems have a max aio request queue size, e.g. 8192).

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Zeugswetter Andreas SB SD


 Or...  It seems to me that we have been observing something on the order 
 of 10x-20x slowdown for vacuuming a table.  I think this is WAY 
 overcompensating for the original problems, and would cause it's own 
 problem as mentioned above.   Since the granularity of delay seems to be 
 the problem can we do more work between delays? Instead of sleeping 
 after every page (I assume this is what it's doing) perhaps we should 
 sleep every 10 pages,

I also think doing more than one page per sleep is advantageous since
it would still allow the OS to do it's readahead optimizations.
I suspect those would fall flat if only one page is fetched per sleep.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Andreas Pflug

Zeugswetter Andreas SB SD wrote:

Or...  It seems to me that we have been observing something on the order 
of 10x-20x slowdown for vacuuming a table.  I think this is WAY 
overcompensating for the original problems, and would cause it's own 
problem as mentioned above.   Since the granularity of delay seems to be 
the problem can we do more work between delays? Instead of sleeping 
after every page (I assume this is what it's doing) perhaps we should 
sleep every 10 pages,
   

I also think doing more than one page per sleep is advantageous since
it would still allow the OS to do it's readahead optimizations.
I suspect those would fall flat if only one page is fetched per sleep.
 

So maybe the setting shouldn't be n ms wait between vacuum actions but 
vacuum pages to handle before sleeping 10 ms.

Regards,
Andreas


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck [EMAIL PROTECTED] writes:
 What still needs to be addressed is the IO storm cause by checkpoints. I 
 see it much relaxed when stretching out the BufferSync() over most of 
 the time until the next one should occur. But the kernel sync at it's 
 end still pushes the system hard against the wall.

I have never been happy with the fact that we use sync(2) at all.  Quite
aside from the I/O storm issue, sync() is really an unsafe way to do a
checkpoint, because there is no way to be certain when it is done.  And
on top of that, it does too much, because it forces syncing of files
unrelated to Postgres.

I would like to see us go over to fsync, or some other technique that
gives more certainty about when the write has occurred.  There might be
some scope that way to allow stretching out the I/O, too.

The main problem with this is knowing which files need to be fsync'd.
The only idea I have come up with is to move all buffer write operations
into a background writer process, which could easily keep track of
every file it's written into since the last checkpoint.  This could cause
problems though if a backend wants to acquire a free buffer and there's
none to be had --- do we want it to wait for the background process to
do something?  We could possibly say that backends may write dirty
buffers for themselves, but only if they fsync them immediately.  As
long as this path is seldom taken, the extra fsyncs shouldn't be a big
performance problem.

Actually, once you build it this way, you could make all writes
synchronous (open the files O_SYNC) so that there is never any need for
explicit fsync at checkpoint time.  The background writer process would
be the one incurring the wait in most cases, and that's just fine.  In
this way you could directly control the rate at which writes are issued,
and there's no I/O storm at all.  (fsync could still cause an I/O storm
if there's lots of pending writes in a single file.)

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Andrew Dunstan

Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:
 

What still needs to be addressed is the IO storm cause by checkpoints. I 
see it much relaxed when stretching out the BufferSync() over most of 
the time until the next one should occur. But the kernel sync at it's 
end still pushes the system hard against the wall.
   

I have never been happy with the fact that we use sync(2) at all.  Quite
aside from the I/O storm issue, sync() is really an unsafe way to do a
checkpoint, because there is no way to be certain when it is done.  And
on top of that, it does too much, because it forces syncing of files
unrelated to Postgres.
I would like to see us go over to fsync, or some other technique that
gives more certainty about when the write has occurred.  There might be
some scope that way to allow stretching out the I/O, too.
The main problem with this is knowing which files need to be fsync'd.
The only idea I have come up with is to move all buffer write operations
into a background writer process, which could easily keep track of
every file it's written into since the last checkpoint.  This could cause
problems though if a backend wants to acquire a free buffer and there's
none to be had --- do we want it to wait for the background process to
do something?  We could possibly say that backends may write dirty
buffers for themselves, but only if they fsync them immediately.  As
long as this path is seldom taken, the extra fsyncs shouldn't be a big
performance problem.
Actually, once you build it this way, you could make all writes
synchronous (open the files O_SYNC) so that there is never any need for
explicit fsync at checkpoint time.  The background writer process would
be the one incurring the wait in most cases, and that's just fine.  In
this way you could directly control the rate at which writes are issued,
and there's no I/O storm at all.  (fsync could still cause an I/O storm
if there's lots of pending writes in a single file.)
 

Or maybe fdatasync() would be slightly more efficient - do we care about 
flushing metadata that much?

cheers

andrew



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Jan Wieck [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I have never been happy with the fact that we use sync(2) at all.

 Sure does it do too much. But together with the other layer of 
 indirection, the virtual file descriptor pool, what is the exact 
 guaranteed behaviour of
  write(); close(); open(); fsync();
 cross platform?

That isn't guaranteed, which is why we have to use sync() at the
moment.  To go over to fsync or O_SYNC we'd need more control over which
file descriptors are used to issue writes.  Which is why I was thinking
about moving the writes to a centralized writer process.

 Actually, once you build it this way, you could make all writes
 synchronous (open the files O_SYNC) so that there is never any need for
 explicit fsync at checkpoint time.

 Yes, but then the configuration leans more towards take over the RAM 

Why?  The idea is to try to issue writes at a fairly steady rate, which
strikes me as much better than the current behavior.  I don't see why it
would force you to have large numbers of buffers available.  You'd want
a few thousand, no doubt, but that's not a large number.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Matthew T. O'Connor

Ang Chin Han wrote:

Christopher Browne wrote:

Centuries ago, Nostradamus foresaw when Stephen 
[EMAIL PROTECTED] would write:

As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s)
to complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10,
both VACUUMs completed in 18m3s (1080 sec). A factor of 13 times! 
This is for a single 350 MB table.
While it is unfortunate that the minimum quanta seems to commonly be
10ms, it doesn't strike me as an enormous difficulty from a practical
perspective.
If we can't lower the minimum quanta, we could always vacuum 2 pages 
before sleeping 10ms, effectively sleeping 5ms.
Right, I think this is what Jan has done already.

What would be interesting would be pg_autovacuum changing these values 
per table, depending on current I/O load.

Hmmm. Looks like there's a lot of interesting things pg_autovacuum can 
do:
1. When on low I/O load, running multiple vacuums on different, 
smaller tables on full speed, careful to note that these vacuums will 
increase the I/O load as well.
2. When on high I/O load, vacuum big, busy tables slowly.
I'm not sure how practacle any of this is.  How will pg_autovacuum 
surmise the current I/O load of the system?  Keeping in mind that 
postgres is not the only cause of I/O.  Also, the optimum delay for a 
long running vacuum might change dramatically while it's running.  If 
there is a way to judge the current I/O load, it might be better for 
vacuum to auto-tune itself while it's running, perhaps based on some 
hints given to it by pg_autovacuum or manually by a user.   For example, 
a delay hint of 0 should always be zero no matter what.  A delay hint of 
1 will scale up slower than a delay hint of 2 which would scale up 
slower than 5 etc 

Of course this is all wild conjecture if there isn't an easy way to 
surmise the system I/O load.  Thoughts?



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Andrew Dunstan [EMAIL PROTECTED] writes:
 Actually, once you build it this way, you could make all writes
 synchronous (open the files O_SYNC) so that there is never any need for
 explicit fsync at checkpoint time.
 
 Or maybe fdatasync() would be slightly more efficient - do we care about 
 flushing metadata that much?

We don't, but it would just obscure the discussion to spell out fsync,
or fdatasync where available ...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Jan Wieck

Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:
What still needs to be addressed is the IO storm cause by checkpoints. I 
see it much relaxed when stretching out the BufferSync() over most of 
the time until the next one should occur. But the kernel sync at it's 
end still pushes the system hard against the wall.
I have never been happy with the fact that we use sync(2) at all.  Quite
aside from the I/O storm issue, sync() is really an unsafe way to do a
checkpoint, because there is no way to be certain when it is done.  And
on top of that, it does too much, because it forces syncing of files
unrelated to Postgres.
Sure does it do too much. But together with the other layer of 
indirection, the virtual file descriptor pool, what is the exact 
guaranteed behaviour of

write(); close(); open(); fsync();

cross platform?


Actually, once you build it this way, you could make all writes
synchronous (open the files O_SYNC) so that there is never any need for
explicit fsync at checkpoint time.  The background writer process would
be the one incurring the wait in most cases, and that's just fine.  In
this way you could directly control the rate at which writes are issued,
and there's no I/O storm at all.  (fsync could still cause an I/O storm
if there's lots of pending writes in a single file.)
Yes, but then the configuration leans more towards take over the RAM 
again, and we better have a much improved cache strategy before that.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Greg Stark

Jan Wieck [EMAIL PROTECTED] writes:

  vacuum_page_per_delay = 2
  vacuum_time_per_delay = 10
 
 That's exactly what I did ... look at the combined experiment posted under
 subject Experimental ARC implementation. The two parameters are named
 vacuum_page_groupsize and vacuum_page_delay.

FWIW this seems like a good idea for other reasons too, the hard drive and the
kernel are going to read multiple sequential blocks anyways whether you sleep
on them or not. Better to read enough blocks to take advantage of the
readahead without saturating the drive, then sleep to let those buffers age
out. If you read one block then sleep the buffers of readahead may get aged
out and have to be fetched again, which would actually increase the amount of
i/o bandwidth used.

I would expect much higher vacuum_page_per_delay's would probably not have a
noticable effect and be much faster. Something like 

vacuum_page_per_delay = 128
vacuum_time_per_delay = 100

Or more likely, something in-between.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Jan Wieck

Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:
Tom Lane wrote:
I have never been happy with the fact that we use sync(2) at all.

Sure does it do too much. But together with the other layer of 
indirection, the virtual file descriptor pool, what is the exact 
guaranteed behaviour of
 write(); close(); open(); fsync();
cross platform?
That isn't guaranteed, which is why we have to use sync() at the
moment.  To go over to fsync or O_SYNC we'd need more control over which
file descriptors are used to issue writes.  Which is why I was thinking
about moving the writes to a centralized writer process.
Actually, once you build it this way, you could make all writes
synchronous (open the files O_SYNC) so that there is never any need for
explicit fsync at checkpoint time.

Yes, but then the configuration leans more towards take over the RAM 
Why?  The idea is to try to issue writes at a fairly steady rate, which
strikes me as much better than the current behavior.  I don't see why it
would force you to have large numbers of buffers available.  You'd want
a few thousand, no doubt, but that's not a large number.
That is part of the idea. The whole idea is to issue physical writes 
at a fairly steady rate without increasing the number of them 
substantial or interfering with the drives opinion about their order too 
much. I think O_SYNC for random access can be in conflict with write 
reordering.

How I can see the background writer operating is that he's keeping the 
buffers in the order of the LRU chain(s) clean, because those are the 
buffers that most likely get replaced soon. In my experimental ARC code 
it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and 
n2 dirty buffers (n1+n2 configurable), then fsync all files that have 
been involved in that, nap depending on where he got down the queues (to 
increase the write rate when running low on clean buffers), and do it 
all over again.

That way, everyone else doing a write must issue an fsync too because 
it's not guaranteed that the fsync of one process flushes the writes of 
another. But as you said, if that is a relatively seldom operation for a 
regular backend, it won't hurt.

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Greg Stark


Tom Lane [EMAIL PROTECTED] writes:

 I would like to see us go over to fsync, or some other technique that
 gives more certainty about when the write has occurred.  There might be
 some scope that way to allow stretching out the I/O, too.
 
 The main problem with this is knowing which files need to be fsync'd.

Why could the postmaster not just fsync *every* file? Does any OS make it a
slow operation to fsync a file that has no pending writes? Would we even care,
it would mean the checkpoint would take longer but not actually issue any
extra i/o.

I'm assuming fsync syncs writes issued by other processes on the same file,
which isn't necessarily true though. Otherwise every process would have to
fsync every file descriptor it has open.

 The only idea I have come up with is to move all buffer write operations
 into a background writer process, which could easily keep track of
 every file it's written into since the last checkpoint.  

I fear this approach. It seems to limit a lot of design flexibility later. But
I can't come up with any concrete way it limits things so perhaps that
instinct is just fud.

It also can become a point of contention. At least on Oracle you often need
multiple such processes to keep up with the i/o bandwidth.

 Actually, once you build it this way, you could make all writes synchronous
 (open the files O_SYNC) so that there is never any need for explicit fsync
 at checkpoint time.

Or using aio write ahead as much as you want and then just make checkpoint
block until all the writes are completed. You don't actually need to rush them
at all, just know when they're done. That would completely eliminate the i/o
storm without changing the actual pattern of writes at all.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread scott.marlowe

On Tue, 4 Nov 2003, Tom Lane wrote:

 Jan Wieck [EMAIL PROTECTED] writes:
  What still needs to be addressed is the IO storm cause by checkpoints. I 
  see it much relaxed when stretching out the BufferSync() over most of 
  the time until the next one should occur. But the kernel sync at it's 
  end still pushes the system hard against the wall.
 
 I have never been happy with the fact that we use sync(2) at all.  Quite
 aside from the I/O storm issue, sync() is really an unsafe way to do a
 checkpoint, because there is no way to be certain when it is done.  And
 on top of that, it does too much, because it forces syncing of files
 unrelated to Postgres.
 
 I would like to see us go over to fsync, or some other technique that
 gives more certainty about when the write has occurred.  There might be
 some scope that way to allow stretching out the I/O, too.
 
 The main problem with this is knowing which files need to be fsync'd.

Wasn't this a problem that the win32 port had to solve by keeping a list 
of all files that need fsyncing since Windows doesn't do sync() in the 
classical sense?  If so, then could we use that code to keep track of the 
files that need fsyncing?


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 The main problem with this is knowing which files need to be fsync'd.

 Why could the postmaster not just fsync *every* file?

You want to find, open, and fsync() every file in the database cluster
for every checkpoint?  Sounds like a non-starter to me.  In typical
situations I'd expect there to be lots of files that have no writes
during any given checkpoint interval (system catalogs for instance).

 I'm assuming fsync syncs writes issued by other processes on the same file,
 which isn't necessarily true though.

It was already pointed out that we can't rely on that assumption.

 Or using aio write ahead as much as you want and then just make checkpoint
 block until all the writes are completed. You don't actually need to rush them
 at all, just know when they're done.

If the objective is to avoid an i/o storm, I don't think this does it.
The system could easily delay most of the writes until the next syncer()
pass.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM

2003-11-04 Thread Stephen

I don't mind the long delay as long as we have a choice as we clearly do in
this case to set vacuum_page_delay=WHATEVER. Of course, if VACUUM can be
improved with better code placement for the delays or buffer replacement
policies then I'm all for it. Right now, I'm pretty satisfied with the
responsiveness on large DBs using vacuum_page_delay=10ms delay.

Any ideas if this patch will be included into 7.4 before final release?

Stephen


Andrew Dunstan [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]

 Not surprising, I should have thought. Why would you care that much?
 The idea as I understand it is to improve the responsiveness of things
 happening alongside vacuum (real work). I normally run vacuum when I
 don't expect anything else much to be happening - but I don't care how
 long it takes (within reason), especially if it isn't going to intefere
 with other uses.

 cheers

 andrew




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Experimental patch for inter-page delay in VACUUM