Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: ... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. tom lane replies: Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. Well I'm basing this on past empirical testing and having read some man pages that describe fsync under this exact scenario. I'll have to write a test to prove this one way or another. I'll also try and look into the linux/BSD source for the common file systems used for PostgreSQL. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). No, they're not exactly the same thing. Consider: Process A File System - --- Writes index buffer .idling... Writes entry to log cache . Writes another index buffer . Writes another log entry. Writes tuple buffer . Writes another log entry. Index scan . Large table sort. Writes tuple buffer . Writes another log entry. Writes . Writes another index buffer . Writes another log entry. Writes another index buffer . Writes another log entry. Index scan . Large table sort. Commit . File Write Log Entry. .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache File Write Log Entry.idling... .idling... Write to cache Write Commit Log Entry .idling... .idling... Write to cache Call fsync .idling... .idling... Write all buffers to device. .DONE. In this case, Process A is waiting for all the buffers to write at the end of the transaction. With asynchronous I/O this becomes: Process A File System - --- Writes index buffer .idling... Writes entry to log cache Queue up write - move head to cylinder Writes another index buffer Write log entry to media Writes another log entryImmediate write to cylinder since head is still there. Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Index scan .busy with scan... Large table sortWrite log entry to media Writes tuple buffer . Writes another log entryQueue up write - move head to cylinder Writes . Writes another index buffer Write log entry to media Writes another log entryQueue up write - move head to cylinder Writes another index buffer . Writes another log entryWrite log entry to media Index scan . Large table sortWrite log entry to media Commit . Write Commit Log Entry Immediate write to cylinder since head is still there. .DONE. Effectively the real work of writing the cache is done while the CPU for the process is busy doing index scans, sorts, etc. With the WAL log on another device and SCSI I/O the log writing should almost always be done except for the final commit write. Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. My comment was applying to reads and writes of other processes not the WAL log. In my original email, recall I mentioned using the O_APPEND open flag which will ensure that all log entries are done sequentially. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. In other words, the write() returns, the backend process can continue with what it was doing, and at some later time the blocks that we flushed from the Postgres buffer will actually be written to disk. So in some sense of the word, that I/O is asynchronous. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
On Fri, 2002-10-04 at 18:03, Neil Conway wrote: Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. In other words, the write() returns, the backend process can continue with what it was doing, and at some later time the blocks that we flushed from the Postgres buffer will actually be written to disk. So in some sense of the word, that I/O is asynchronous. Isn't that true only as long as there is buffer space available? When there isn't buffer space available, seems the window for blocking comes into play?? So I guess you could say it is optimally asynchronous and worse case synchronous. I think the worse case situation is one which he's trying to address. At least that's how I interpret it. Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Neil Conway [EMAIL PROTECTED] writes: Curtis Faith [EMAIL PROTECTED] writes: It looks to me like BufferAlloc will simply result in a call to BufferReplace smgrblindwrt write for md storage manager objects. This means that a process will block while the write of dirty cache buffers takes place. I think Tom was suggesting that when a buffer is written out, the write() call only pushes the data down into the filesystem's buffer -- which is free to then write the actual blocks to disk whenever it chooses to. Exactly --- in all Unix systems that I know of, a write() is asynchronous unless one takes special pains (like opening the file with O_SYNC). Pushing the data from userspace to the kernel disk buffers does not count as I/O in my mind. I am quite concerned about Curtis' worries about fsync, though. There's not any fundamental reason for fsync to block other operations, but that doesn't mean that it's been implemented reasonably everywhere :-(. We need to take a look at that. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I resent this since it didn't seem to get to the list. After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: Code from: /usr/src/sys/syscalls/vfs_syscalls int fsync(p, uap) struct proc *p; struct fsync_args /* { syscallarg(int) fd; } */ *uap; { register struct vnode *vp; struct file *fp; vm_object_t obj; int error; if ((error = getvnode(p-p_fd, SCARG(uap, fd), fp)) != 0) return (error); vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); if (VOP_GETVOBJECT(vp, obj) == 0) vm_object_page_clean(obj, 0, 0, 0); if ((error = VOP_FSYNC(vp, fp-f_cred, MNT_WAIT, p)) == 0 vp-v_mount (vp-v_mount-mnt_flag MNT_SOFTDEP) bioops.io_fsync) error = (*bioops.io_fsync)(vp); VOP_UNLOCK(vp, 0, p); return (error); } Notice the calls to: vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); .. VOP_UNLOCK(vp, 0, p); surrounding the call to VOP_FSYNC. From the man pages for VOP_UNLOCK: HEADER STUFF . VOP_LOCK(struct vnode *vp, int flags, struct proc *p); int VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p); int VOP_ISLOCKED(struct vnode *vp, struct proc *p); int vn_lock(struct vnode *vp, int flags, struct proc *p); DESCRIPTION These calls are used to serialize access to the filesystem, such as to prevent two writes to the same file from happening at the same time. The arguments are: vp the vnode being locked or unlocked flags One of the lock request types: LK_SHARED Shared lock LK_EXCLUSIVE Exclusive lock LK_UPGRADEShared-to-exclusive upgrade LK_EXCLUPGRADEFirst shared-to-exclusive upgrade LK_DOWNGRADE Exclusive-to-shared downgrade LK_RELEASERelease any type of lock LK_DRAIN Wait for all lock activity to end The lock type may be or'ed with these lock flags: LK_NOWAITDo not sleep to wait for lock LK_SLEEPFAIL Sleep, then return failure LK_CANRECURSEAllow recursive exclusive lock LK_REENABLE Lock is to be reenabled after drain LK_NOPAUSE No spinloop The lock type may be or'ed with these control flags: LK_INTERLOCKSpecify when the caller already has a simple lock (VOP_LOCK will unlock the simple lock after getting the lock) LK_RETRYRetry until locked LK_NOOBJDon't create object p process context to use for the locks Kernel code should use vn_lock() to lock a vnode rather than calling VOP_LOCK() directly. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: [ much snipped ] vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); Hm, I take it a vnode is what's usually called an inode, ie the unique identification data for a specific disk file? This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Bruce Momjian wrote: I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. We don't have to lock anything, but most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. The fsync call is more contentious than typical writes because its duration is usually longer so it holds the locks longer over more pages and structures. That is the real issue. The contention caused by fsync'ing very frequently which blocks other writers and readers. For the buffer manager, the blocking of readers is probably even more problematic when the cache is a small percentage (say 10% to 15%) of the total database size because most leaf node accesses will result in a read. Each of these reads will have to wait on the fsync as well. Again, a very well written file system probably can minimize this but I've not seen any. Further comment on: We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Writing out a bunch of calls at the end, after having consumed a lot of CPU cycles and then waiting is not as efficient as writing them out, while those CPU cycles are being used. We are currently waisting the time it takes for a given process to write. The thinking probably has been that this is no big deal because other processes, say B, C and D can use the CPU cycles while process A blocks. This is true UNLESS the other processes are blocking on reads or writes caused by process A doing the final writes and fsync. Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. Yes, I hate improvements that make things worse for most people. Any changes I'd contemplate would be simply another configuration driven optimization that could be turned off very easily. - Curtis ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith wrote: Bruce Momjian wrote: I may be missing something here, but other backends don't block while one writes to WAL. I don't think they'll block until they get to the fsync or XLogWrite call while another transaction is fsync'ing. I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. Yes, I can see some contention, but what does aio solve? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. Bruce Momjian wrote: Yes, I can see some contention, but what does aio solve? Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. It might be easier for me to implement the change I propose and then see what kind of difference it makes. I wanted to run the idea past this group first. We can all postulate whether or not it will work but we won't know unless we try it. My real issue is one of what happens in the event that it does work. I've had very good luck implementing this sort of thing for other systems but I don't yet know the range of i/o requests that PostgreSQL makes. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith wrote: Yes, I can see some contention, but what does aio solve? Well, theoretically, aio lets the file system handle the writes without requiring any locks being held by the processes issuing those reads. The disk i/o scheduler can therefore issue the writes using spinlocks or something very fast since it controls the timing of each of the actual writes. In some systems this is handled by the kernal and can be very fast. I am again confused. When we do write(), we don't have to lock anything, do we? (Multiple processes can write() to the same file just fine.) We do block the current process, but we have nothing else to do until we know it is written/fsync'ed. Does aio more easily allow the kernel to order those write? Is that the issue? Well, certainly the kernel already order the writes. Just because we write() doesn't mean it goes to disk. Only fsync() or the kernel do that. I suspect that with large RAID controllers or intelligent disk systems like EMC this is even more important because they should be able to handle a much higher level of concurrent i/o. Now whether or not the common file systems handle this well, I can't say, Take a look at some comments on how Oracle uses asynchronous I/O http://www.ixora.com.au/notes/redo_write_multiplexing.htm http://www.ixora.com.au/notes/asynchronous_io.htm http://www.ixora.com.au/notes/raw_asynchronous_io.htm Yes, but Oracle is threaded, right, so, yes, they clearly could win with it. I read the second URL and it said we could issue separate writes and have them be done in an optimal order. However, we use the file system, not raw devices, so don't we already have that in the kernel with fsync()? It seems that OS support for this will likely increase and that this issue will become more and more important as uses contemplate SMP systems or if threading is added to certain PostgreSQL subsystems. Probably. Having seen the Informix 5/7 debacle, I don't want to fall into the trap where we add stuff that just makes things faster on SMP/threaded systems when it makes our code _slower_ on single CPU systems, which is exaclty what Informix did in Informix 7, and we know how that ended (lost customers, bought by IBM). I don't think that's going to happen to us, but I thought I would mention it. Assuming we can demonstrate no detrimental effects on system reliability and that the change is implemented in such a way that it can be turned on or off easily, will a 50% or better increase in speed for updates justify the sort or change I am proposing. 20%? 10%? Yea, let's see what boost we get, and the size of the patch, and we can review it. It is certainly worth researching. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: ... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? Then there's no need to call fsync() at all, except during checkpoints (which actually issue sync() not fsync(), anyway). Whether by threads or multiple processes, there is the same contention on the file through multiple writers. The file system can decide to reorder writes before they start but not after. If a write comes after a fsync starts it will have to wait on that fsync. AFAICS we cannot allow the filesystem to reorder writes of WAL blocks, on safety grounds (we want to be sure we have a consistent WAL up to the end of what we've written). Even if we can allow some reordering when a single transaction puts out a large volume of WAL data, I fail to see where any large gain is going to come from. We're going to be issuing those writes sequentially and that ought to match the disk layout about as well as can be hoped anyway. Likewise a given process's writes can NEVER be reordered if they are submitted synchronously, as is done in the calls to flush the log as well as the dirty pages in the buffer in the current code. We do not fsync buffer pages; in fact a transaction commit doesn't write buffer pages at all. I think the above is just a misunderstanding of what's really happening. We have synchronous WAL writing, agreed, but we want that AFAICS. Data block writes are asynchronous (between checkpoints, anyway). There is one thing in the current WAL code that I don't like: if the WAL buffers fill up then everybody who would like to make WAL entries is forced to wait while some space is freed, which means a write, which is synchronous if you are using O_DSYNC. It would be nice to have a background process whose only task is to issue write()s as soon as WAL pages are filled, thus reducing the probability that foreground processes have to wait for WAL writes (when they're not committing that is). But this could be done portably with one more postmaster child process; I see no real need to dabble in aio_write. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
... most file systems can't process fsync's simultaneous with other writes, so those writes block because the file system grabs its own internal locks. Oh? That would be a serious problem, but I've never heard that asserted before. Please provide some evidence. On a filesystem that does have that kind of problem, can't you avoid it just by using O_DSYNC on the WAL files? To make this competitive, the WAL writes would need to be improved to do more than one block (up to 256k or 512k per write) with one write call (if that much is to be written for this tx to be able to commit). This should actually not be too difficult since the WAL buffer is already contiguous memory. If that is done, then I bet O_DSYNC will beat any other config we currently have. With this, a separate disk for WAL and large transactions you shoud be able to see your disks hit the max IO figures they are capable of :-) Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Zeugswetter Andreas SB SD [EMAIL PROTECTED] writes: To make this competitive, the WAL writes would need to be improved to do more than one block (up to 256k or 512k per write) with one write call (if that much is to be written for this tx to be able to commit). This should actually not be too difficult since the WAL buffer is already contiguous memory. Hmmm ... if you were willing to dedicate a half meg or meg of shared memory for WAL buffers, that's doable. I was originally thinking of having the (still hypothetical) background process wake up every time a WAL page was completed and available to write. But it could be set up so that there is some slop, and it only wakes up when the number of writable pages exceeds N, for some N that's still well less than the number of buffers. Then it could write up to N sequential pages in a single write(). However, this would only be a win if you had few and large transactions. Any COMMIT will force a write of whatever we have so far, so the idea of writing hundreds of K per WAL write can only work if it's hundreds of K between commit records. Is that a common scenario? I doubt it. If you try to set it up that way, then it's more likely that what will happen is the background process seldom awakens at all, and each committer effectively becomes responsible for writing all the WAL traffic since the last commit. Wouldn't that lose compared to someone else having written the previous WAL pages in background? We could certainly build the code to support this, though, and then experiment with different values of N. If it turns out N==1 is best after all, I don't think we'd have wasted much code. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Hmmm ... if you were willing to dedicate a half meg or meg of shared memory for WAL buffers, that's doable. Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here. However, this would only be a win if you had few and large transactions. Any COMMIT will force a write of whatever we have so far, so the idea of writing hundreds of K per WAL write can only work if it's hundreds of K between commit records. Is that a common scenario? I doubt it. It should help most for data loading, or mass updating, yes. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith writes: I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. Why not? Other than the necessary synchronisation for attributes such as file size and modification times, multiple processes can readily write to different areas of the same file at the same time. fsync() may not return until after the buffers it schedules are written, but it doesn't have to block subsequent writes to different buffers in the file either. (Note too Tom Lane's responses about when fsync() is used and not used.) I'll have to write a test and see if there really is a problem. Please do. I expect you'll find things aren't as bad as you fear. In another posting, you write: Hmm, I keep hearing that buffer block writes are asynchronous but I don't read that in the code at all. There are simple write calls with files that are not opened with O_NOBLOCK, so they'll be done synchronously. The code for this is relatively straighforward (once you get past the storage manager abstraction) so I don't see what I might be missing. There is a confusion of terminology here: the write() is synchronous from the point of the application only in that the data is copied into kernel buffers (or pages remapped, or whatever) before the system call returns. For files opened with O_DSYNC the write() would wait for the data to be written to disk. Thus O_DSYNC is synchronous I/O, but there is no equivalently easy name for the regular flush to disk after write() returns that the Unix kernel has done ~forever. The asynchronous I/O that you mention (aio) is a third thing, different from both regular write() and write() with O_DSYNC. I understand that with aio the data is not even transferred to the kernel before the aio_write() call returns, but I've never programmed with aio and am not 100% sure how it works. Regards, Giles ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Tom Lane wrote: Curtis Faith [EMAIL PROTECTED] writes: After some research I still hold that fsync blocks, at least on FreeBSD. Am I missing something? Here's the evidence: [ much snipped ] vp = (struct vnode *)fp-f_data; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p); Hm, I take it a vnode is what's usually called an inode, ie the unique identification data for a specific disk file? Yes, Virtual Inode. I think it is virtual because it is used for NFS, where the handle really isn't an inode. This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) I think the small issue is: proc1 proc2 write fsync write fync Proc2 has to wait for the fsync, but the write is so short compared to the fsync, I don't see an issue. Now, if someone would come up with code that did only one fsync for the above case, that would be a big win. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
tom lane replies: Curtis Faith [EMAIL PROTECTED] writes: So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? Well, for starters by the time the write which includes the commit log entry is written, much of the writing of the log for the transaction will already be on disk, or in a controller on its way. I don't see any O_NONBLOCK or O_NDELAY references in the sources so it looks like the log writes are blocking. If I read correctly, XLogInsert calls XLogWrite which calls write which blocks. If these assumptions are correct, there should be some significant gain here but I won't know how much until I try to change it. This issue only affects the speed of a given back-ends transaction processing capability. The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. I'd guess this means that one won't gain much by adding other back-end processes past three or four if there are a lot of inserts or updates. The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while other threads (should have said back-ends or processes) are waiting on their commit writes. - Curtis P.S. If I am right in my assumptions about the way the current system works, I'll bet the change would speed up inserts in Shridhar's huge database test by at least a factor of two or three, perhaps even an order of magnitude. :-) -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED]] Sent: Thursday, October 03, 2002 7:17 PM To: Curtis Faith Cc: Pgsql-Hackers Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching Curtis Faith [EMAIL PROTECTED] writes: So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log and then use aio_write for all log writes? We already offer an O_DSYNC option. It's not obvious to me what aio_write brings to the table (aside from loss of portability). You still have to wait for the final write to complete, no? 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith wrote: The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while other threads (should have said back-ends or processes) are waiting on their commit writes. I may be missing something here, but other backends don't block while one writes to WAL. Remember, we are proccess based, not thread based, so the write() call only blocks the one session. If you had threads, and you did a write() call that blocked other threads, I can see where your idea would be good, and where async i/o becomes an advantage. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. But that's exactly *why* we have a log cache: to ensure we can buffer a reasonable amount of log data between XLogFlush calls. If the above scenario is really causing a problem, doesn't that just mean you need to increase wal_buffers? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Potential Large Performance Gain in WAL synching
I wrote: The REAL issue and the one that will greatly affect total system throughput is that of contention on the file locks. Since fsynch needs to obtain a write lock on the file descriptor, as does the write calls which originate from XLogWrite as the writes are written to the disk, other back-ends will block while another transaction is committing if the log cache fills to the point where their XLogInsert results in a XLogWrite call to flush the log cache. tom lane wrote: But that's exactly *why* we have a log cache: to ensure we can buffer a reasonable amount of log data between XLogFlush calls. If the above scenario is really causing a problem, doesn't that just mean you need to increase wal_buffers? Well, in cases where there are a lot of small transactions the contention will not be on the XLogWrite calls from caches getting full but from XLogWrite calls from transaction commits which will happen very frequently. I think this will have a detrimental effect on very high update frequency performance. So while larger WAL caches will help in the case of cache flushing because of its being full I don't think it will make any difference for the potentially more common case of transaction commits. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Potential Large Performance Gain in WAL synching
Bruce Momjian wrote: I may be missing something here, but other backends don't block while one writes to WAL. I don't think they'll block until they get to the fsync or XLogWrite call while another transaction is fsync'ing. I'm no Unix filesystem expert but I don't see how the OS can handle multiple writes and fsyncs to the same file descriptors without blocking other processes from writing at the same time. It may be that there are some clever data structures they use but I've not seen huge praise for most of the file systems. A well written file system could minimize this contention but I'll bet it's there with most of the ones that PostgreSQL most commonly runs on. I'll have to write a test and see if there really is a problem. - Curtis -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED]] Sent: Friday, October 04, 2002 12:44 AM To: Curtis Faith Cc: Tom Lane; Pgsql-Hackers Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching Curtis Faith wrote: The method I propose does not result in any blocking because of writes other than the final commit's write and it has the very significant advantage of allowing other transactions (from other back-ends) to continue until they enter commit (and blocking waiting for their final commit write to complete). 2) Allow transactions to complete and do work while other threads are waiting on the completion of the log write. I'm missing something. There is no useful work that a transaction can do between writing its commit record and reporting completion, is there? It has to wait for that record to hit disk. The key here is that a thread that has not committed and therefore is not blocking can do work while other threads (should have said back-ends or processes) are waiting on their commit writes. I may be missing something here, but other backends don't block while one writes to WAL. Remember, we are proccess based, not thread based, so the write() call only blocks the one session. If you had threads, and you did a write() call that blocked other threads, I can see where your idea would be good, and where async i/o becomes an advantage. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])