Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I wrote:
  ... most file systems can't process fsync's
  simultaneous with other writes, so those writes block because the file
  system grabs its own internal locks.


tom lane replies:
 Oh?  That would be a serious problem, but I've never heard that asserted
 before.  Please provide some evidence.

Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.

 On a filesystem that does have that kind of problem, can't you avoid it
 just by using O_DSYNC on the WAL files?  Then there's no need to call
 fsync() at all, except during checkpoints (which actually issue sync()
 not fsync(), anyway).


No, they're not exactly the same thing. Consider:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   .
Writes another index buffer .
Writes another log entry.
Writes tuple buffer .
Writes another log entry.
Index scan  .
Large table sort.
Writes tuple buffer .
Writes another log entry.
Writes  .
Writes another index buffer .
Writes another log entry.
Writes another index buffer .
Writes another log entry.
Index scan  .
Large table sort.
Commit  .
File Write Log Entry.
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
File Write Log Entry.idling...
.idling...  Write to cache
Write Commit Log Entry  .idling...
.idling...  Write to cache
Call fsync  .idling...
.idling...  Write all buffers to device.
.DONE.

In this case, Process A is waiting for all the buffers to write
at the end of the transaction.

With asynchronous I/O this becomes:

Process A   File System
-   ---
Writes index buffer .idling...
Writes entry to log cache   Queue up write - move head to cylinder
Writes another index buffer Write log entry to media
Writes another log entryImmediate write to cylinder since head is
still there.
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Index scan  .busy with scan...
Large table sortWrite log entry to media
Writes tuple buffer .
Writes another log entryQueue up write - move head to cylinder
Writes  .
Writes another index buffer Write log entry to media
Writes another log entryQueue up write - move head to cylinder
Writes another index buffer .
Writes another log entryWrite log entry to media
Index scan  .
Large table sortWrite log entry to media
Commit  .
Write Commit Log Entry  Immediate write to cylinder since head is
still there.
.DONE.

Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.

  Whether by threads or multiple processes, there is the same
 contention on
  the file through multiple writers. The file system can decide to reorder
  writes before they start but not after. If a write comes after a
  fsync starts it will have to wait on that fsync.

 AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
 on safety grounds (we want to be sure we have a consistent WAL up to the
 end of what we've written).  Even if we can allow some reordering when a
 single transaction puts out a large volume of WAL data, I fail to see
 where any large gain is going to come from.  We're going to be issuing
 those writes sequentially and that ought to match the disk layout about
 as well as can be hoped anyway.

My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.

  Likewise a given process's writes can NEVER be reordered if they are
  submitted synchronously, as is done in the calls to flush the log as
  well as the dirty pages in the buffer in the current code.

 We do not fsync buffer pages; in fact a transaction commit doesn't write
 buffer pages at all.  I think the above is just a misunderstanding of
 what's really 

Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Neil Conway

Curtis Faith [EMAIL PROTECTED] writes:
 It looks to me like BufferAlloc will simply result in a call to
 BufferReplace  smgrblindwrt  write for md storage manager objects.
 
 This means that a process will block while the write of dirty cache
 buffers takes place.

I think Tom was suggesting that when a buffer is written out, the
write() call only pushes the data down into the filesystem's buffer --
which is free to then write the actual blocks to disk whenever it
chooses to. In other words, the write() returns, the backend process
can continue with what it was doing, and at some later time the blocks
that we flushed from the Postgres buffer will actually be written to
disk. So in some sense of the word, that I/O is asynchronous.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Greg Copeland

On Fri, 2002-10-04 at 18:03, Neil Conway wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  It looks to me like BufferAlloc will simply result in a call to
  BufferReplace  smgrblindwrt  write for md storage manager objects.
  
  This means that a process will block while the write of dirty cache
  buffers takes place.
 
 I think Tom was suggesting that when a buffer is written out, the
 write() call only pushes the data down into the filesystem's buffer --
 which is free to then write the actual blocks to disk whenever it
 chooses to. In other words, the write() returns, the backend process
 can continue with what it was doing, and at some later time the blocks
 that we flushed from the Postgres buffer will actually be written to
 disk. So in some sense of the word, that I/O is asynchronous.


Isn't that true only as long as there is buffer space available?  When
there isn't buffer space available, seems the window for blocking comes
into play??  So I guess you could say it is optimally asynchronous and
worse case synchronous.  I think the worse case situation is one which
he's trying to address.

At least that's how I interpret it.

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Neil Conway [EMAIL PROTECTED] writes:
 Curtis Faith [EMAIL PROTECTED] writes:
 It looks to me like BufferAlloc will simply result in a call to
 BufferReplace  smgrblindwrt  write for md storage manager objects.
 
 This means that a process will block while the write of dirty cache
 buffers takes place.

 I think Tom was suggesting that when a buffer is written out, the
 write() call only pushes the data down into the filesystem's buffer --
 which is free to then write the actual blocks to disk whenever it
 chooses to.

Exactly --- in all Unix systems that I know of, a write() is
asynchronous unless one takes special pains (like opening the file
with O_SYNC).  Pushing the data from userspace to the kernel disk
buffers does not count as I/O in my mind.

I am quite concerned about Curtis' worries about fsync, though.
There's not any fundamental reason for fsync to block other operations,
but that doesn't mean that it's been implemented reasonably everywhere
:-(.  We need to take a look at that.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

I resent this since it didn't seem to get to the list.

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)
struct proc *p;
struct fsync_args /* {
syscallarg(int) fd;
} */ *uap;
{
register struct vnode *vp;
struct file *fp;
vm_object_t obj;
int error;

if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0)
return (error);
vp = (struct vnode *)fp-f_data;
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
if (VOP_GETVOBJECT(vp, obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);
if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 
vp-v_mount  (vp-v_mount-mnt_flag  MNT_SOFTDEP) 
bioops.io_fsync)
error = (*bioops.io_fsync)(vp);
VOP_UNLOCK(vp, 0, p);
return (error);
}

Notice the calls to:

vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
..
VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .


 VOP_LOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);

 int
 VOP_ISLOCKED(struct vnode *vp, struct proc *p);

 int
 vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION
 These calls are used to serialize access to the filesystem, such as to
 prevent two writes to the same file from happening at the same time.

 The arguments are:

 vp the vnode being locked or unlocked

 flags  One of the lock request types:

  LK_SHARED Shared lock
  LK_EXCLUSIVE  Exclusive lock
  LK_UPGRADEShared-to-exclusive upgrade
  LK_EXCLUPGRADEFirst shared-to-exclusive upgrade
  LK_DOWNGRADE  Exclusive-to-shared downgrade
  LK_RELEASERelease any type of lock
  LK_DRAIN  Wait for all lock activity to end

The lock type may be or'ed with these lock flags:

  LK_NOWAITDo not sleep to wait for lock
  LK_SLEEPFAIL Sleep, then return failure
  LK_CANRECURSEAllow recursive exclusive lock
  LK_REENABLE  Lock is to be reenabled after drain
  LK_NOPAUSE   No spinloop

The lock type may be or'ed with these control flags:

  LK_INTERLOCKSpecify when the caller already has a simple
  lock (VOP_LOCK will unlock the simple lock
  after getting the lock)
  LK_RETRYRetry until locked
  LK_NOOBJDon't create object

 p  process context to use for the locks

 Kernel code should use vn_lock() to lock a vnode rather than calling
 VOP_LOCK() directly.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 After some research I still hold that fsync blocks, at least on
 FreeBSD. Am I missing something?

 Here's the evidence:
 [ much snipped ]
 vp = (struct vnode *)fp-f_data;
 vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);

Hm, I take it a vnode is what's usually called an inode, ie the unique
identification data for a specific disk file?

This is kind of ugly in general terms but I'm not sure that it really
hurts Postgres.  In our present scheme, the only files we ever fsync()
are WAL log files, not data files.  And in normal operation there is
only one WAL writer at a time, and *no* WAL readers.  So an exclusive
kernel-level lock on a WAL file while we fsync really shouldn't create
any problem for us.  (Unless this indirectly blocks other operations
that I'm missing?)

As I commented before, I think we could do with an extra process to
issue WAL writes in places where they're not in the critical path for
a foreground process.  But that seems to be orthogonal from this issue.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


Bruce Momjian wrote:
 I am again confused.  When we do write(), we don't have to lock
 anything, do we?  (Multiple processes can write() to the same file just
 fine.)  We do block the current process, but we have nothing else to do
 until we know it is written/fsync'ed.  Does aio more easily allow the
 kernel to order those write?  Is that the issue?  Well, certainly the
 kernel already order the writes.  Just because we write() doesn't mean
 it goes to disk.  Only fsync() or the kernel do that.

We don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.

For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say  10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.

Further comment on:
We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.

Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.

The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.

 Yes, but Oracle is threaded, right, so, yes, they clearly could win with
 it.  I read the second URL and it said we could issue separate writes
 and have them be done in an optimal order.  However, we use the file
 system, not raw devices, so don't we already have that in the kernel
 with fsync()?

Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.

Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.

 Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
 into the trap where we add stuff that just makes things faster on
 SMP/threaded systems when it makes our code _slower_ on single CPU
 systems, which is exaclty what Informix did in Informix 7, and we know
 how that ended (lost customers, bought by IBM).  I don't think that's
 going to happen to us, but I thought I would mention it.

Yes, I hate improvements that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.

- Curtis


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Curtis Faith wrote:
 Bruce Momjian wrote:
  I may be missing something here, but other backends don't block while
  one writes to WAL.
 
 I don't think they'll block until they get to the fsync or XLogWrite
 call while another transaction is fsync'ing.
 
 I'm no Unix filesystem expert but I don't see how the OS can
 handle multiple writes and fsyncs to the same file descriptors without
 blocking other processes from writing at the same time. It may be that
 there are some clever data structures they use but I've not seen huge
 praise for most of the file systems. A well written file system could
 minimize this contention but I'll bet it's there with most of the ones
 that PostgreSQL most commonly runs on.
 
 I'll have to write a test and see if there really is a problem.

Yes, I can see some contention, but what does aio solve?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith


I wrote:
  I'm no Unix filesystem expert but I don't see how the OS can
  handle multiple writes and fsyncs to the same file descriptors without
  blocking other processes from writing at the same time. It may be that
  there are some clever data structures they use but I've not seen huge
  praise for most of the file systems. A well written file system could
  minimize this contention but I'll bet it's there with most of the ones
  that PostgreSQL most commonly runs on.
  
  I'll have to write a test and see if there really is a problem.

Bruce Momjian wrote:

 Yes, I can see some contention, but what does aio solve?
 

Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads. 
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.

I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.

Now whether or not the common file systems handle this well, I can't say,

Take a look at some comments on how Oracle uses asynchronous I/O

http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm

It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.

It might be easier for me to implement the change I propose and then
see what kind of difference it makes.

I wanted to run the idea past this group first. We can all postulate
whether or not it will work but we won't know unless we try it. My real
issue is one of what happens in the event that it does work.

I've had very good luck implementing this sort of thing for other systems
but I don't yet know the range of i/o requests that PostgreSQL makes.

Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Curtis Faith wrote:
  Yes, I can see some contention, but what does aio solve?
  
 
 Well, theoretically, aio lets the file system handle the writes without
 requiring any locks being held by the processes issuing those reads. 
 The disk i/o scheduler can therefore issue the writes using spinlocks or
 something very fast since it controls the timing of each of the actual
 writes. In some systems this is handled by the kernal and can be very
 fast.

I am again confused.  When we do write(), we don't have to lock
anything, do we?  (Multiple processes can write() to the same file just
fine.)  We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.  Does aio more easily allow the
kernel to order those write?  Is that the issue?  Well, certainly the
kernel already order the writes.  Just because we write() doesn't mean
it goes to disk.  Only fsync() or the kernel do that.

 
 I suspect that with large RAID controllers or intelligent disk systems
 like EMC this is even more important because they should be able to
 handle a much higher level of concurrent i/o.
 
 Now whether or not the common file systems handle this well, I can't say,
 
 Take a look at some comments on how Oracle uses asynchronous I/O
 
 http://www.ixora.com.au/notes/redo_write_multiplexing.htm
 http://www.ixora.com.au/notes/asynchronous_io.htm
 http://www.ixora.com.au/notes/raw_asynchronous_io.htm

Yes, but Oracle is threaded, right, so, yes, they clearly could win with
it.  I read the second URL and it said we could issue separate writes
and have them be done in an optimal order.  However, we use the file
system, not raw devices, so don't we already have that in the kernel
with fsync()?

 It seems that OS support for this will likely increase and that this
 issue will become more and more important as uses contemplate SMP systems
 or if threading is added to certain PostgreSQL subsystems.

Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
into the trap where we add stuff that just makes things faster on
SMP/threaded systems when it makes our code _slower_ on single CPU
systems, which is exaclty what Informix did in Informix 7, and we know
how that ended (lost customers, bought by IBM).  I don't think that's
going to happen to us, but I thought I would mention it.

 Assuming we can demonstrate no detrimental effects on system reliability
 and that the change is implemented in such a way that it can be turned
 on or off easily, will a 50% or better increase in speed for updates
 justify the sort or change I am proposing. 20%? 10%?

Yea, let's see what boost we get, and the size of the patch, and we can
review it.  It is certainly worth researching.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 ... most file systems can't process fsync's
 simultaneous with other writes, so those writes block because the file
 system grabs its own internal locks.

Oh?  That would be a serious problem, but I've never heard that asserted
before.  Please provide some evidence.

On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files?  Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).

 Whether by threads or multiple processes, there is the same contention on
 the file through multiple writers. The file system can decide to reorder
 writes before they start but not after. If a write comes after a
 fsync starts it will have to wait on that fsync.

AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written).  Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from.  We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.

 Likewise a given process's writes can NEVER be reordered if they are
 submitted synchronously, as is done in the calls to flush the log as
 well as the dirty pages in the buffer in the current code.

We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all.  I think the above is just a misunderstanding of
what's really happening.  We have synchronous WAL writing, agreed, but
we want that AFAICS.  Data block writes are asynchronous (between
checkpoints, anyway).

There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC.  It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is).  But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Zeugswetter Andreas SB SD


  ... most file systems can't process fsync's
  simultaneous with other writes, so those writes block because the file
  system grabs its own internal locks.
 
 Oh?  That would be a serious problem, but I've never heard that asserted
 before.  Please provide some evidence.
 
 On a filesystem that does have that kind of problem, can't you avoid it
 just by using O_DSYNC on the WAL files?

To make this competitive, the WAL writes would need to be improved to 
do more than one block (up to 256k or 512k per write) with one write call 
(if that much is to be written for this tx to be able to commit).
This should actually not be too difficult since the WAL buffer is already 
contiguous memory.

If that is done, then I bet O_DSYNC will beat any other config we currently 
have.

With this, a separate disk for WAL and large transactions you shoud be able 
to see your disks hit the max IO figures they are capable of :-)

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Tom Lane

Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes:
 To make this competitive, the WAL writes would need to be improved to 
 do more than one block (up to 256k or 512k per write) with one write call 
 (if that much is to be written for this tx to be able to commit).
 This should actually not be too difficult since the WAL buffer is already 
 contiguous memory.

Hmmm ... if you were willing to dedicate a half meg or meg of shared
memory for WAL buffers, that's doable.  I was originally thinking of
having the (still hypothetical) background process wake up every time a
WAL page was completed and available to write.  But it could be set up
so that there is some slop, and it only wakes up when the number of
writable pages exceeds N, for some N that's still well less than the
number of buffers.  Then it could write up to N sequential pages in a
single write().

However, this would only be a win if you had few and large transactions.
Any COMMIT will force a write of whatever we have so far, so the idea of
writing hundreds of K per WAL write can only work if it's hundreds of K
between commit records.  Is that a common scenario?  I doubt it.

If you try to set it up that way, then it's more likely that what will
happen is the background process seldom awakens at all, and each
committer effectively becomes responsible for writing all the WAL
traffic since the last commit.  Wouldn't that lose compared to someone
else having written the previous WAL pages in background?

We could certainly build the code to support this, though, and then
experiment with different values of N.  If it turns out N==1 is best
after all, I don't think we'd have wasted much code.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Zeugswetter Andreas SB SD


 Hmmm ... if you were willing to dedicate a half meg or meg of shared
 memory for WAL buffers, that's doable.

Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here. 

 However, this would only be a win if you had few and large transactions.
 Any COMMIT will force a write of whatever we have so far, so the idea of
 writing hundreds of K per WAL write can only work if it's hundreds of K
 between commit records.  Is that a common scenario?  I doubt it.

It should help most for data loading, or mass updating, yes.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Giles Lean


Curtis Faith writes:

 I'm no Unix filesystem expert but I don't see how the OS can handle
 multiple writes and fsyncs to the same file descriptors without
 blocking other processes from writing at the same time.

Why not?  Other than the necessary synchronisation for attributes such
as file size and modification times, multiple processes can readily
write to different areas of the same file at the same time.

fsync() may not return until after the buffers it schedules are
written, but it doesn't have to block subsequent writes to different
buffers in the file either.  (Note too Tom Lane's responses about
when fsync() is used and not used.)

 I'll have to write a test and see if there really is a problem.

Please do.  I expect you'll find things aren't as bad as you fear.

In another posting, you write:

 Hmm, I keep hearing that buffer block writes are asynchronous but I don't
 read that in the code at all. There are simple write calls with files
 that are not opened with O_NOBLOCK, so they'll be done synchronously. The
 code for this is relatively straighforward (once you get past the
 storage manager abstraction) so I don't see what I might be missing.

There is a confusion of terminology here: the write() is synchronous
from the point of the application only in that the data is copied into
kernel buffers (or pages remapped, or whatever) before the system call
returns.  For files opened with O_DSYNC the write() would wait for the
data to be written to disk.  Thus O_DSYNC is synchronous I/O, but
there is no equivalently easy name for the regular flush to disk
after write() returns that the Unix kernel has done ~forever.

The asynchronous I/O that you mention (aio) is a third thing,
different from both regular write() and write() with O_DSYNC. I
understand that with aio the data is not even transferred to the
kernel before the aio_write() call returns, but I've never programmed
with aio and am not 100% sure how it works.

Regards,

Giles



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-04 Thread Bruce Momjian

Tom Lane wrote:
 Curtis Faith [EMAIL PROTECTED] writes:
  After some research I still hold that fsync blocks, at least on
  FreeBSD. Am I missing something?
 
  Here's the evidence:
  [ much snipped ]
  vp = (struct vnode *)fp-f_data;
  vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
 
 Hm, I take it a vnode is what's usually called an inode, ie the unique
 identification data for a specific disk file?

Yes, Virtual Inode.  I think it is virtual because it is used for NFS,
where the handle really isn't an inode.

 This is kind of ugly in general terms but I'm not sure that it really
 hurts Postgres.  In our present scheme, the only files we ever fsync()
 are WAL log files, not data files.  And in normal operation there is
 only one WAL writer at a time, and *no* WAL readers.  So an exclusive
 kernel-level lock on a WAL file while we fsync really shouldn't create
 any problem for us.  (Unless this indirectly blocks other operations
 that I'm missing?)

I think the small issue is:

proc1   proc2
write
fsync   write
fync

Proc2 has to wait for the fsync, but the write is so short compared to
the fsync, I don't see an issue.  Now, if someone would come up with
code that did only one fsync for the above case, that would be a big
win.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
 and then use aio_write for all log writes?

We already offer an O_DSYNC option.  It's not obvious to me what
aio_write brings to the table (aside from loss of portability).
You still have to wait for the final write to complete, no?

 2) Allow transactions to complete and do work while other threads are
 waiting on the completion of the log write.

I'm missing something.  There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith


tom lane replies:
 Curtis Faith [EMAIL PROTECTED] writes:
  So, why don't we use files opened with O_DSYNC | O_APPEND for 
 the WAL log
  and then use aio_write for all log writes?
 
 We already offer an O_DSYNC option.  It's not obvious to me what
 aio_write brings to the table (aside from loss of portability).
 You still have to wait for the final write to complete, no?

Well, for starters by the time the write which includes the commit
log entry is written, much of the writing of the log for the
transaction will already be on disk, or in a controller on its 
way.

I don't see any O_NONBLOCK or O_NDELAY references in the sources 
so it looks like the log writes are blocking. If I read correctly,
XLogInsert calls XLogWrite which calls write which blocks. If these
assumptions are correct, there should be some significant gain here but I
won't know how much until I try to change it. This issue only affects the
speed of a given back-ends transaction processing capability.

The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Since fsynch needs to
obtain a write lock on the file descriptor, as does the write calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a 
XLogWrite call to flush the log cache. I'd guess this means that one
won't gain much by adding other back-end processes past three or four
if there are a lot of inserts or updates.

The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).

  2) Allow transactions to complete and do work while other threads are
  waiting on the completion of the log write.
 
 I'm missing something.  There is no useful work that a transaction can
 do between writing its commit record and reporting completion, is there?
 It has to wait for that record to hit disk.

The key here is that a thread that has not committed and therefore is
not blocking can do work while other threads (should have said back-ends 
or processes) are waiting on their commit writes.

- Curtis

P.S. If I am right in my assumptions about the way the current system
works, I'll bet the change would speed up inserts in Shridhar's huge
database test by at least a factor of two or three, perhaps even an
order of magnitude. :-)

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, October 03, 2002 7:17 PM
 To: Curtis Faith
 Cc: Pgsql-Hackers
 Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching 
 
 
 Curtis Faith [EMAIL PROTECTED] writes:
  So, why don't we use files opened with O_DSYNC | O_APPEND for 
 the WAL log
  and then use aio_write for all log writes?
 
 We already offer an O_DSYNC option.  It's not obvious to me what
 aio_write brings to the table (aside from loss of portability).
 You still have to wait for the final write to complete, no?
 
  2) Allow transactions to complete and do work while other threads are
  waiting on the completion of the log write.
 
 I'm missing something.  There is no useful work that a transaction can
 do between writing its commit record and reporting completion, is there?
 It has to wait for that record to hit disk.
 
   regards, tom lane
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Bruce Momjian

Curtis Faith wrote:
 The method I propose does not result in any blocking because of writes
 other than the final commit's write and it has the very significant
 advantage of allowing other transactions (from other back-ends) to
 continue until they enter commit (and blocking waiting for their final
 commit write to complete).
 
   2) Allow transactions to complete and do work while other threads are
   waiting on the completion of the log write.
  
  I'm missing something.  There is no useful work that a transaction can
  do between writing its commit record and reporting completion, is there?
  It has to wait for that record to hit disk.
 
 The key here is that a thread that has not committed and therefore is
 not blocking can do work while other threads (should have said back-ends 
 or processes) are waiting on their commit writes.

I may be missing something here, but other backends don't block while
one writes to WAL.  Remember, we are proccess based, not thread based,
so the write() call only blocks the one session.  If you had threads,
and you did a write() call that blocked other threads, I can see where
your idea would be good, and where async i/o becomes an advantage.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 The REAL issue and the one that will greatly affect total system
 throughput is that of contention on the file locks. Since fsynch needs to
 obtain a write lock on the file descriptor, as does the write calls which
 originate from XLogWrite as the writes are written to the disk, other
 back-ends will block while another transaction is committing if the
 log cache fills to the point where their XLogInsert results in a 
 XLogWrite call to flush the log cache.

But that's exactly *why* we have a log cache: to ensure we can buffer a
reasonable amount of log data between XLogFlush calls.  If the above
scenario is really causing a problem, doesn't that just mean you need
to increase wal_buffers?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith

I wrote:
  The REAL issue and the one that will greatly affect total system
  throughput is that of contention on the file locks. Since
 fsynch needs to
  obtain a write lock on the file descriptor, as does the write
 calls which
  originate from XLogWrite as the writes are written to the disk, other
  back-ends will block while another transaction is committing if the
  log cache fills to the point where their XLogInsert results in a
  XLogWrite call to flush the log cache.

tom lane wrote:
 But that's exactly *why* we have a log cache: to ensure we can buffer a
 reasonable amount of log data between XLogFlush calls.  If the above
 scenario is really causing a problem, doesn't that just mean you need
 to increase wal_buffers?

Well, in cases where there are a lot of small transactions the contention
will not be on the XLogWrite calls from caches getting full but from
XLogWrite calls from transaction commits which will happen very frequently.
I think this will have a detrimental effect on very high update frequency
performance.

So while larger WAL caches will help in the case of cache flushing because
of its being full I don't think it will make any difference for the
potentially
more common case of transaction commits.

- Curtis


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Potential Large Performance Gain in WAL synching

2002-10-03 Thread Curtis Faith

Bruce Momjian wrote:
 I may be missing something here, but other backends don't block while
 one writes to WAL.

I don't think they'll block until they get to the fsync or XLogWrite
call while another transaction is fsync'ing.

I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.

I'll have to write a test and see if there really is a problem.

- Curtis

 -Original Message-
 From: Bruce Momjian [mailto:[EMAIL PROTECTED]]
 Sent: Friday, October 04, 2002 12:44 AM
 To: Curtis Faith
 Cc: Tom Lane; Pgsql-Hackers
 Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching


 Curtis Faith wrote:
  The method I propose does not result in any blocking because of writes
  other than the final commit's write and it has the very significant
  advantage of allowing other transactions (from other back-ends) to
  continue until they enter commit (and blocking waiting for their final
  commit write to complete).
 
2) Allow transactions to complete and do work while other
 threads are
waiting on the completion of the log write.
  
   I'm missing something.  There is no useful work that a transaction can
   do between writing its commit record and reporting
 completion, is there?
   It has to wait for that record to hit disk.
 
  The key here is that a thread that has not committed and therefore is
  not blocking can do work while other threads (should have
 said back-ends
  or processes) are waiting on their commit writes.

 I may be missing something here, but other backends don't block while
 one writes to WAL.  Remember, we are proccess based, not thread based,
 so the write() call only blocks the one session.  If you had threads,
 and you did a write() call that blocked other threads, I can see where
 your idea would be good, and where async i/o becomes an advantage.

 --
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square,
 Pennsylvania 19073



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])