Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. This measurement is quite understandable, since the current software does 8k writes, and the OS only has a chance to write bigger blocks in the write+fsync case. In the O_SYNC case you need to group bigger blocks yourself. (bigger blocks are essential for max IO) I am still convinced, that writing bigger blocks would allow the fastest solution. But reading the recent posts the solution might only be to change the current loop foreach dirty 8k WAL buffer write 8k to one or two large write calls. Andreas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On 6 Oct 2002, Greg Copeland wrote: On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. Actually, I believe that write will be *mostly* asynchronous while aio_write will always be asynchronous. In a buffer poor environment, I believe write will degrade into a synchronous operation. In an ideal situation, I think they will prove to be on par with one another with a slight bias toward aio_write. In less than ideal situations where buffer space is at a premium, I think aio_write will get the leg up. Browsed web and came across this piece of text regarding a Linux-KAIO patch by Silicon Graphics... The asynchronous I/O (AIO) facility implements interfaces defined by the POSIX standard, although it has not been through formal compliance certification. This version of AIO is implemented with support from kernel modifications, and hence will be called KAIO to distinguish it from AIO facilities available from newer versions of glibc/librt. Because of the kernel support, KAIO is able to perform split-phase I/O to maximize concurrency of I/O at the device. With split-phase I/O, the initiating request (such as an aio_read) truly queues the I/O at the device as the first phase of the I/O request; a second phase of the I/O request, performed as part of the I/O completion, propagates results of the request. The results may include the contents of the I/O buffer on a read, the number of bytes read or written, and any error status. Preliminary experience with KAIO have shown over 35% improvement in database performance tests. Unit tests (which only perform I/O) using KAIO and Raw I/O have been successful in achieving 93% saturation with 12 disks hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these encouraging results are a direct result of implementing a significant part of KAIO in the kernel using split-phase I/O while avoiding or minimizing the use of any globally contented locks. Well... In a worse case scenario, it seems that aio_write does get a win. I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Mon, 2002-10-07 at 10:38, Antti Haapala wrote: Browsed web and came across this piece of text regarding a Linux-KAIO patch by Silicon Graphics... Ya, I have read this before. The problem here is that I'm not aware of which AIO implementation on Linux is the forerunner nor do I have any idea how it's implementation or performance details defer from that of other implementations on other platforms. I know there are at least two aio efforts underway for Linux. There could yet be others. Attempting to cite specifics that only pertain to Linux and then, only with a specific implementation which may or may not be in general use is questionable. Because of this I simply left it as saying that I believe my analysis is pessimistic. Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? Preliminary experience with KAIO have shown over 35% improvement in database performance tests. Unit tests (which only perform I/O) using KAIO and Raw I/O have been successful in achieving 93% saturation with 12 disks hung off 2 X 40 MB/s Ultra-Wide SCSI channels. We believe that these encouraging results are a direct result of implementing a significant part of KAIO in the kernel using split-phase I/O while avoiding or minimizing the use of any globally contented locks. The problem here is, I have no idea what they are comparing to (worse case read/writes which we know PostgreSQL *mostly* isn't suffering from). If we assume that PostgreSQL's read/write operations are somewhat optimized (as it currently sounds like they are), I'd seriously doubt we'd see that big of a difference. On the other hand, I'm hoping that if an aio postgresql implementation does get done we'll see something like a 5%-10% performance boost. Even still, I have nothing to pin that on other than hope. If we do see a notable performance increase for Linux, I have no idea what it will do for other platforms. Then, there are all of the issues that Tom brought up about bloat/uglification and maintainability. So, while I certainly do keep those remarks in my mind, I think it's best to simply encourage the effort (or something like it) and help determine where we really sit by means of empirical evidence. Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
Greg Copeland [EMAIL PROTECTED] writes: Ya, I have read this before. The problem here is that I'm not aware of which AIO implementation on Linux is the forerunner nor do I have any idea how it's implementation or performance details defer from that of other implementations on other platforms. The implementation of AIO in 2.5 is the one by Ben LaHaise (not SGI). Not sure what the performance is like -- although it's been merged into 2.5 already, so someone can do some benchmarking. Can anyone suggest a good test? Keep in mind that glibc has had a user-space implementation for a little while (although I'd guess the performance to be unimpressive), so AIO would not be *that* kernel-version specific. Anyone have any idea of Red Hat's Advanced Server uses KAIO or what? RH AS uses Ben LaHaise's implemention of AIO, I believe. Cheers, Neil -- Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
I sent this yesterday, but it seems not to have made it to the list... I have a couple of comments orthogonal to the present discussion. 1) It would be fairly easy to write log records over a network to a dedicated process on another system. If the other system has an uninterruptible power supply, this is about as safe as writing to disk. This would get rid of the need for any fsync on the log at all. There would be extra code needed on restart to get the end of the log from the other system, but it doesn't seem like much. I think this would be an attractive option to a lot of people. Most people have at least two systems, and the requirements of the logging system would be minimal. 2) It is also possible, with kernel modifications, to have special logging partitions where log records are written where the head is. Tzi-cker Chueh and Lan Huang at Stony Brook (http://www.cs.sunysb.edu/~lanhuang/research.htm) have written this, although I don't think they have released any code. (A similar idea called WADS is mentioned in Gray Reuter's book.) If the people at Red Hat are interested in having some added value for using PostgreSQL on Red Hat Linux, this would be one idea. It could also be used to speed up ext3 and other journaling file systems. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sun, 2002-10-06 at 04:03, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. This part is exactly the same point that I've been proposing to solve with a background writer process. We don't need aio_write for that. The background writer can handle pushing completed WAL pages out to disk. The sticky part is trying to gang the writes for multiple transactions whose COMMIT records would fit into the same WAL page, and that WAL page isn't full yet. I just hoped that kernel could be used as the background writer process and in the process also solve the multiple commits on the same page problem The rest of what you wrote seems like wishful thinking about how aio_write might behave :-(. I have no faith in it. Yeah, and the fact that there are several slightly different implementations of AIO even on Linux alone does not help. I have to test the SGI KAIO implementation for conformance with my wishful thinking ;) Perhaps you could ask around about AIO in RedHat Advanced Server (is it the same AIO as SGI, how does it behave in multiple writes on the same page case) as you may have better links to RedHat ? -- Hannu ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. Actually, I believe that write will be *mostly* asynchronous while aio_write will always be asynchronous. In a buffer poor environment, I believe write will degrade into a synchronous operation. In an ideal situation, I think they will prove to be on par with one another with a slight bias toward aio_write. In less than ideal situations where buffer space is at a premium, I think aio_write will get the leg up. The kernel doesn't need to know anything about platter rotation. It just needs to keep the disk write buffers full enough not to cause a rotational latency. Which is why in a buffer poor environment, aio_write is generally preferred as the write is still queued even if the buffer is full. That means it will be ready to begin placing writes into the buffer, all without the process having to wait. On the other hand, when using write, the process must wait. In a worse case scenario, it seems that aio_write does get a win. I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
Greg Copeland [EMAIL PROTECTED] writes: I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? It'd be considered, but whether it'd be accepted would have to depend on the size of the performance boost, its portability (how many platforms/scenarios do you actually get a boost for), and the extent of bloat/uglification of the code. I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sun, 2002-10-06 at 11:46, Tom Lane wrote: I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. I do suspect my analysis is somewhat pessimistic too but to what degree, I have no idea. You make a good case on your memory argument but please allow me to further kick it around. I don't find it far fetched to imagine situations where people may commit large amounts of memory for the database yet marginally starve available memory for file system buffers. Especially so on heavily I/O bound systems or where sporadicly other types of non-database file activity may occur. Now, while I continue to assure myself that it is not far fetched I honestly have no idea how often this type of situation will typically occur. Of course, that opens the door for simply adding more memory and/or slightly reducing the amount of memory available to the database (thus making it available elsewhere). Now, after all that's said and done, having something like aio in use would seemingly allowing it to be somewhat more self-tuning from a potential performance perspective. Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
Curtis Faith [EMAIL PROTECTED] writes: Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. You are confusing WALWriteLock with WALInsertLock. A transaction-committing flush operation only holds the former. XLogInsert only needs the latter --- at least as long as it doesn't need to write. Thus, given adequate space in the WAL buffers, transactions B-E do not get blocked by someone else who is writing/syncing in order to commit. Now, as the code stands at the moment there is no event other than commit or full-buffers that prompts a write; that means that we are likely to run into the full-buffer case more often than is good for performance. But a background writer task would fix that. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. This will pessimize performance except in the case where WAL traffic is very heavy, because it means you don't commit until the block containing your commit record is filled. What if you are the only active backend? My view of this is that backends would wait for the background writer only when they encounter a full-buffer situation, or indirectly when they are trying to do a commit write and the background guy has the WALWriteLock. The latter serialization is unavoidable: in that scenario, the background guy is writing/flushing an earlier page of the WAL log, and we *must* have that down to disk before we can declare our transaction committed. So any scheme that tries to eliminate the serialization of WAL writes will fail. I do not, however, see any value in forcing all the WAL writes to be done by a single process; which is essentially what you're saying we should do. That just adds extra process-switch overhead that we don't really need. The log file would be opened O_DSYNC, O_APPEND every time. Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. As I said, we already have that; you are confusing WALWriteLock with WALInsertLock. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. How are you going to avoid pessimizing the few-transactions case? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
You are confusing WALWriteLock with WALInsertLock. A transaction-committing flush operation only holds the former. XLogInsert only needs the latter --- at least as long as it doesn't need to write. Well that make things better than I thought. We still end up with a disk write for each transaction though and I don't see how this can ever get better than (Disk RPM)/ 60 transactions per second, since commit fsyncs are serialized. Every fsync will have to wait almost a full revolution to reach the end of the log. As a practial matter then everyone will use commit_delay to improve this. This will pessimize performance except in the case where WAL traffic is very heavy, because it means you don't commit until the block containing your commit record is filled. What if you are the only active backend? We could handle this using a mechanism analogous to the current commit delay. If there are more than commit_siblings other processes running then do the write automatically after commit_delay seconds. This would make things no more pessimistic than the current implementation but provide the additional benefit of allowing the LogWriter to write in optimal sizes if there are many transactions. The commit_delay method won't be as good in many cases. Consider a update scenario where a larger commit delay gives better throughput. A given transaction will flush after commit_delay milliseconds. The delay is very unlikely to result in a scenario where the dirty log buffers are the optimal size. As a practical matter I think this would tend to make the writes larger than they would otherwise have been and this would unnecessarily delay the commit on the transaction. I do not, however, see any value in forcing all the WAL writes to be done by a single process; which is essentially what you're saying we should do. That just adds extra process-switch overhead that we don't really need. I don't think that an fsync will ever NOT cause the process to get switched out so I don't see how another process doing the write would result in more overhead. The fsync'ing process will block on the fsync, so there will always be at least one process switch (probably many) while waiting for the fsync to comlete since we are talking many milliseconds for the fsync in every case. The log file would be opened O_DSYNC, O_APPEND every time. Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. Well there is no reason that the logwriter couldn't be doing fsyncs instead of O_DSYNC writes in those cases. I'd leave this switchable using the current flags. Just change the semantics a bit. - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
In particular, it would seriously degrade performance if the WAL file isn't on its own spindle but has to share bandwidth with data file access. If the OS is stupid I could see this happening. But if there are buffers and some sort of elevator algorithm the I/O won't happen at bad times. I agree with you though that writing for every single insert probably does not make sense. There should be some blocking of writes. The optimal size would have to be derived empirically. What we really want, of course, is write on every revolution where there's something worth writing --- either we've filled a WAL blovk or there is a commit pending. But that just gets us back into the same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. I don't see how an extra process makes that problem any easier. The whole point of the extra process handling all the writes is so that it can write on every revolution, if there is something to write. It doesn't need to care if more commits will arrive soon. BTW, it would seem to me that aio_write() buys nothing over plain write() in terms of ability to gang writes. If we issue the write at time T and it completes at T+X, we really know nothing about exactly when in that interval the data was read out of our WAL buffers. We cannot assume that commit records that were stored into the WAL buffer during that interval got written to disk. Why would we need to make that assumption? The only thing we'd need to know is that a given write succeeded meaning that commits before that write are done. The advantage to aio_write in this scenario is when writes cross track boundaries or when the head is in the wrong spot. If we write in reasonable blocks with aio_write the write might get to the disk before the head passes the location for the write. Consider a scenario where: Head is at file offset 10,000. Log contains blocks 12,000 - 12,500 ..time passes.. Head is now at 12,050 Commit occurs writing block 12,501 In the aio_write case the write would already have been done for blocks 12,000 to 12,050 and would be queued up for some additional blocks up to potentially 12,500. So the write for the commit could occur without an additional rotation delay. We are talking 85 to 200 milliseconds delay for this rotation on a single disk. I don't know how often this happens in actual practice but it might occur as often as every other time. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Curtis Faith wrote: Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters: #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. Remember, write() is fast, fsync is slow. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
pgman wrote: Curtis Faith wrote: Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters: #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. Remember, write() is fast, fsync is slow. Let me add to what I just said: While the above idea doesn't win for normal operation, because each backend waits for the fsync, and we have no good way of determining of other backends are nearing commit, a background WAL fsync process would be nice if we wanted an option between fsync on (wait for fsync before reporting commit), and fsync off (no crash recovery). We could have a mode where we did an fsync every X milliseconds, so we issue a COMMIT to the client, but wait a few milliseconds before fsync'ing. Many other databases have such a mode, but we don't, and I always felt it would be valuable. It may allow us to remove the fsync option in favor of one that has _some_ crash recovery. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
Bruce Momjian kirjutas L, 05.10.2002 kell 13:49: Curtis Faith wrote: Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters: #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. Exactly. If I understand correctly what Curtis is proposing, you don't have to figure it out under his scheme - you just issue a WALWait command and the WAL writing process notifies you when your transactions WAL is safe storage. If the other committer was able to get his WALWait in before the actual write took place, it will notified too, if not, it will be notified about 1/166th sec. later (for 10K rpm disk) when it's write is done on the next rev of disk platters. The writer process should just issue a continuous stream of aio_write()'s while there are any waiters and keep track which waiters are safe to continue - thus no guessing of who's gonna commit. If supported by platform this should use zero-copy writes - it should be safe because WAL is append-only. --- Hannu ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching
Bruce Momjian wrote: So every backend is to going to wait around until its fsync gets done by the backend process? How is that a win? This is just another version of our GUC parameters: #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5# range 1-1000 which attempt to delay fsync if other backends are nearing commit. Pushing things out to another process isn't a win; figuring out if someone else is coming for commit is. It's not the same at all. My proposal make two extremely important changes from a performance perspective. 1) WALWriteLocks are never held by processes for lengthy transations. Only for long enough to copy the log entry into the buffer. This means real work can be done by other processes while a transaction is waiting for it's commit to finish. I'm sure that blocking on XLogInsert because another transaction is performing an fsync is extremely common with frequent update scenarios. 2) The log is written using optimal write sizes which is much better than a user-defined guess of the microseconds to delay the fsync. We should be able to get the bottleneck to be the maximum write throughput of the disk with the modifications to Tom Lane's scheme I proposed. Remember, write() is fast, fsync is slow. Okay, it's clear I missed the point about Unix write earlier :-) However, it's not just saving fsyncs that we need to worry about. It's the unnecessary blocking of other processes that are simply trying to append some log records in the course of whatever updating, inserting they are doing. They may be a long way from commit. fsync being slow is the whole reason for not wanting to have exclusive locks held for the duration of an fsync. On an SMP machine this change alone would probably speed things up by an order of magnitude (assuming there aren't any other similar locks causing the same problem). - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
Tom Lane [EMAIL PROTECTED] writes: Curtis Faith [EMAIL PROTECTED] writes: The log file would be opened O_DSYNC, O_APPEND every time. Keep in mind that we support platforms without O_DSYNC. I am not sure whether there are any that don't have O_SYNC either, but I am fairly sure that we measured O_SYNC to be slower than fsync()s on some platforms. And don't we preallocate WAL files anyway? So O_APPEND would be irrelevant? -Doug ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Hannu Krosing [EMAIL PROTECTED] writes: The writer process should just issue a continuous stream of aio_write()'s while there are any waiters and keep track which waiters are safe to continue - thus no guessing of who's gonna commit. This recipe sounds like eat I/O bandwidth whether we need it or not. It might be optimal in the case where activity is so heavy that we do actually need a WAL write on every disk revolution, but in any scenario where we're not maxing out the WAL disk's bandwidth, it will hurt performance. In particular, it would seriously degrade performance if the WAL file isn't on its own spindle but has to share bandwidth with data file access. What we really want, of course, is write on every revolution where there's something worth writing --- either we've filled a WAL blovk or there is a commit pending. But that just gets us back into the same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. I don't see how an extra process makes that problem any easier. BTW, it would seem to me that aio_write() buys nothing over plain write() in terms of ability to gang writes. If we issue the write at time T and it completes at T+X, we really know nothing about exactly when in that interval the data was read out of our WAL buffers. We cannot assume that commit records that were stored into the WAL buffer during that interval got written to disk. The only safe assumption is that only records that were in the buffer at time T are down to disk; and that means that late arrivals lose. You can't issue aio_write immediately after the previous one completes and expect that this optimizes performance --- you have to delay it as long as you possibly can in hopes that more commit records arrive. So it comes down to being the same problem. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Curtis Faith wrote: The advantage to aio_write in this scenario is when writes cross track boundaries or when the head is in the wrong spot. If we write in reasonable blocks with aio_write the write might get to the disk before the head passes the location for the write. Consider a scenario where: Head is at file offset 10,000. Log contains blocks 12,000 - 12,500 ..time passes.. Head is now at 12,050 Commit occurs writing block 12,501 In the aio_write case the write would already have been done for blocks 12,000 to 12,050 and would be queued up for some additional blocks up to potentially 12,500. So the write for the commit could occur without an additional rotation delay. We are talking 85 to 200 milliseconds delay for this rotation on a single disk. I don't know how often this happens in actual practice but it might occur as often as every other time. So, you are saying that we may get back aio confirmation quicker than if we issued our own write/fsync because the OS was able to slip our flush to disk in as part of someone else's or a general fsync? I don't buy that because it is possible our write() gets in as part of someone else's fsync and our fsync becomes a no-op, meaning there aren't any dirty buffers for that file. Isn't that also possible? Also, remember the kernel doesn't know where the platter rotation is either. Only the SCSI drive can reorder the requests to match this. The OS can group based on head location, but it doesn't know much about the platter location, and it doesn't even know where the head is. Also, does aio return info when the data is in the kernel buffers or when it is actually on the disk? Simply, aio allows us to do the write and get notification when it is complete. I don't see how that helps us, and I don't see any other advantages to aio. To use aio, we need to find something that _can't_ be solved with more traditional Unix API's, and I haven't seen that yet. This aio thing is getting out of hand. It's like we have a hammer, and everything looks like a nail, or a use for aio. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
So, you are saying that we may get back aio confirmation quicker than if we issued our own write/fsync because the OS was able to slip our flush to disk in as part of someone else's or a general fsync? I don't buy that because it is possible our write() gets in as part of someone else's fsync and our fsync becomes a no-op, meaning there aren't any dirty buffers for that file. Isn't that also possible? Separate out the two concepts: 1) Writing of incomplete transactions at the block level by a background LogWriter. I think it doesn't matter whether the write is aio_write or write, writing blocks when we get them should provide the benefit I outlined. Waiting till fsync could miss the opporunity to write before the head passes the end of the last durable write because the drive buffers might empty causing up to a full rotation's delay. 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. This is not a big difference but it seems to me that the OS might be able to avoid some context switches by grouping copying in the case of aio_write. I've heard anecdotal reports that this is significantly faster for some things but I don't know for certain. Also, remember the kernel doesn't know where the platter rotation is either. Only the SCSI drive can reorder the requests to match this. The OS can group based on head location, but it doesn't know much about the platter location, and it doesn't even know where the head is. The kernel doesn't need to know anything about platter rotation. It just needs to keep the disk write buffers full enough not to cause a rotational latency. It's not so much a matter of reordering as it is of getting the data into the SCSI drive before the head passes the last write's position. If the SCSI drive's buffers are kept full it can continue writing at its full throughput. If the writes stop and the buffers empty it will need to wait up to a full rotation before it gets to the end of the log again Also, does aio return info when the data is in the kernel buffers or when it is actually on the disk? Simply, aio allows us to do the write and get notification when it is complete. I don't see how that helps us, and I don't see any other advantages to aio. To use aio, we need to find something that _can't_ be solved with more traditional Unix API's, and I haven't seen that yet. This aio thing is getting out of hand. It's like we have a hammer, and everything looks like a nail, or a use for aio. Yes, while I think its probably worth doing and faster, it won't help as much as just keeping the drive buffers full even if that's by using write calls. I still don't understand the opposition to aio_write. Could we just have the configuration setup determine whether one or the other is used? I don't see why we wouldn't use the faster calls if they were present and reliable on a given system. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Curtis Faith wrote: So, you are saying that we may get back aio confirmation quicker than if we issued our own write/fsync because the OS was able to slip our flush to disk in as part of someone else's or a general fsync? I don't buy that because it is possible our write() gets in as part of someone else's fsync and our fsync becomes a no-op, meaning there aren't any dirty buffers for that file. Isn't that also possible? Separate out the two concepts: 1) Writing of incomplete transactions at the block level by a background LogWriter. I think it doesn't matter whether the write is aio_write or write, writing blocks when we get them should provide the benefit I outlined. Waiting till fsync could miss the opportunity to write before the head passes the end of the last durable write because the drive buffers might empty causing up to a full rotation's delay. No question about that! The sooner we can get stuff to the WAL buffers, the more likely we will get some other transaction to do our fsync work. Any ideas on how we can do that? 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. This is not a big difference but it seems to me that the OS might be able to avoid some context switches by grouping copying in the case of aio_write. I've heard anecdotal reports that this is significantly faster for some things but I don't know for certain. I suppose it is possible, but because we spend so much time in fsync, we want to focus on that. People have recommended mmap of the WAL file, and that seems like a much more direct way to handle it rather than aio. However, we can't control when the stuff gets sent to disk with mmap'ed WAL, or should I say we can't write to it and withhold writes to the disk file with mmap, so we would need some intermediate step, and then again, it just becomes more steps and extra steps slow things down too. This aio thing is getting out of hand. It's like we have a hammer, and everything looks like a nail, or a use for aio. Yes, while I think its probably worth doing and faster, it won't help as much as just keeping the drive buffers full even if that's by using write calls. I still don't understand the opposition to aio_write. Could we just have the configuration setup determine whether one or the other is used? I don't see why we wouldn't use the faster calls if they were present and reliable on a given system. We hesitate to add code relying on new features unless it is a significant win, and in the aio case, we would have different WAL disk write models for with/without aio, so it clearly could be two code paths, and with two code paths, we can't as easily improve or optimize. If we get 2% boost out of some feature, but it later discourages us from adding a 5% optimization, it is a loss. And, in most cases, the 2% optimization is for a few platform, while the 5% optimization is for all. This code is +15 years old, so we are looking way down the road, not just for today's hot feature. For example, Tom just improved DISTINCT by 25% by optimizing some of the sorting and function call handling. If we had more complex threaded sort code, that may not have been possible, or it may have been possible for him to optimize only one of the code paths. I can't tell you how many aio/mmap/fancy feature discussions we have had, and we obviously discuss them, but in the end, they end up being of questionable value for the risk/complexity; but, we keep talking, hoping we are wrong or some good ideas come out of it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sat, 2002-10-05 at 20:32, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: The writer process should just issue a continuous stream of aio_write()'s while there are any waiters and keep track which waiters are safe to continue - thus no guessing of who's gonna commit. This recipe sounds like eat I/O bandwidth whether we need it or not. It might be optimal in the case where activity is so heavy that we do actually need a WAL write on every disk revolution, but in any scenario where we're not maxing out the WAL disk's bandwidth, it will hurt performance. In particular, it would seriously degrade performance if the WAL file isn't on its own spindle but has to share bandwidth with data file access. What we really want, of course, is write on every revolution where there's something worth writing --- either we've filled a WAL blovk or there is a commit pending. That's what I meant by while there are any waiters. But that just gets us back into the same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon. I don't see how an extra process makes that problem any easier. I still think that we could get gang writes automatically, if we just ask for aio_write at completion of each WAL file page and keep track of those that are written. We could also keep track of write position inside the WAL page for 1. end of last write() of each process 2. WAL files write position at each aio_write() Then we can safely(?) assume, that each backend wants only its own write()'s be on disk before it can assume the trx has committed. If the fsync()-like request comes in at time when aio_write for that processes last position has committed, we can let that process continue without even a context switch. In the above scenario I assume that kernel can do the right thing by doing multiple aio_write requests for the same page in one sweep and not doing one physical write for each aio_write. BTW, it would seem to me that aio_write() buys nothing over plain write() in terms of ability to gang writes. If we issue the write at time T and it completes at T+X, we really know nothing about exactly when in that interval the data was read out of our WAL buffers. Yes, most likely. If we do several write's of the same pages they will hit physical disk at the same physical write. We cannot assume that commit records that were stored into the WAL buffer during that interval got written to disk. The only safe assumption is that only records that were in the buffer at time T are down to disk; and that means that late arrivals lose. I assume that if each commit record issues an aio_write when all of those which actually reached the disk will be notified. IOW the first aio_write orders the write, but all the latecomers which arrive before actual write will also get written and notified. You can't issue aio_write immediately after the previous one completes and expect that this optimizes performance --- you have to delay it as long as you possibly can in hopes that more commit records arrive. I guess we have quite different cases for different hardware configurations - if we have a separate disk subsystem for WAL, we may want to keep the log flowing to disk as fast as it is ready, including the writing of last, partial page as often as new writes to it are done - as we possibly can't write more than ~ 250 times/sec (with 15K drives, no battery RAM) we will always have at least two context switches between writes (for 500Hz ontext switch clock), and much more if processes background themselves while waiting for small transactions to commit. So it comes down to being the same problem. Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. --- Hannu ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
No question about that! The sooner we can get stuff to the WAL buffers, the more likely we will get some other transaction to do our fsync work. Any ideas on how we can do that? More like the sooner we get stuff out of the WAL buffers and into the disk's buffers whether by write or aio_write. It doesn't do any good to have information in the XLog unless it gets written to the disk buffers before they empty. We hesitate to add code relying on new features unless it is a significant win, and in the aio case, we would have different WAL disk write models for with/without aio, so it clearly could be two code paths, and with two code paths, we can't as easily improve or optimize. If we get 2% boost out of some feature, but it later discourages us from adding a 5% optimization, it is a loss. And, in most cases, the 2% optimization is for a few platform, while the 5% optimization is for all. This code is +15 years old, so we are looking way down the road, not just for today's hot feature. I'll just have to implement it and see if it's as easy and isolated as I think it might be and would allow the same algorithm for aio_write or write. I can't tell you how many aio/mmap/fancy feature discussions we have had, and we obviously discuss them, but in the end, they end up being of questionable value for the risk/complexity; but, we keep talking, hoping we are wrong or some good ideas come out of it. I'm all in favor of keeping clean designs. I'm very pleased with how easy PostreSQL is to read and understand given how much it does. - Curtis ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Curtis Faith wrote: No question about that! The sooner we can get stuff to the WAL buffers, the more likely we will get some other transaction to do our fsync work. Any ideas on how we can do that? More like the sooner we get stuff out of the WAL buffers and into the disk's buffers whether by write or aio_write. Does aio_write to write or write _and_ fsync()? It doesn't do any good to have information in the XLog unless it gets written to the disk buffers before they empty. Just for clarification, we have two issues in this thread: WAL memory buffers fill up, forcing WAL write multiple commits at the same time force too many fsync's I just wanted to throw that out. I can't tell you how many aio/mmap/fancy feature discussions we have had, and we obviously discuss them, but in the end, they end up being of questionable value for the risk/complexity; but, we keep talking, hoping we are wrong or some good ideas come out of it. I'm all in favor of keeping clean designs. I'm very pleased with how easy PostreSQL is to read and understand given how much it does. Glad you see the situation we are in. ;-) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance
Hannu Krosing [EMAIL PROTECTED] writes: Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. This part is exactly the same point that I've been proposing to solve with a background writer process. We don't need aio_write for that. The background writer can handle pushing completed WAL pages out to disk. The sticky part is trying to gang the writes for multiple transactions whose COMMIT records would fit into the same WAL page, and that WAL page isn't full yet. The rest of what you wrote seems like wishful thinking about how aio_write might behave :-(. I have no faith in it. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching
It appears the fsync problem is pervasive. Here's Linux 2.4.19's version from fs/buffer.c: lock- down(inode-i_sem); ret = filemap_fdatasync(inode-i_mapping); err = file-f_op-fsync(file, dentry, 1); if (err !ret) ret = err; err = filemap_fdatawait(inode-i_mapping); if (err !ret) ret = err; unlock-up(inode-i_sem); But this is probably not a big factor as you outline below because the WALWriteLock is causing the same kind of contention. tom lane wrote: This is kind of ugly in general terms but I'm not sure that it really hurts Postgres. In our present scheme, the only files we ever fsync() are WAL log files, not data files. And in normal operation there is only one WAL writer at a time, and *no* WAL readers. So an exclusive kernel-level lock on a WAL file while we fsync really shouldn't create any problem for us. (Unless this indirectly blocks other operations that I'm missing?) I hope you're right but I see some very similar contention problems in the case of many small transactions because of the WALWriteLock. Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. When A finishes the fsync and subsequently releases the WALWriteLock B unblocks and gets the WALWriteLock for its fsync for the flush. C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT. B Releases and now C writes its XLOG_XACT_COMMIT. There now seems to be a lot of contention on the WALWriteLock. This is a shame for a system that has no locking at the logical level and therefore seems like it could be very, very fast and offer incredible concurrency. As I commented before, I think we could do with an extra process to issue WAL writes in places where they're not in the critical path for a foreground process. But that seems to be orthogonal from this issue. It's only orthogonal to the fsync-specific contention issue. We now have to worry about WALWriteLock semantics causes the same contention. Your idea of a separate LogWriter process could very nicely solve this problem and accomplish a few other things at the same time if we make a few enhancements. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. WALWriteLocks would only be held for the brief time needed to copy the entries into the log buffer. The LogWriter would only need to grab a lock to determine the current end of the log buffer. Since it would be writing blocks that occur earlier in the cache than the XLogInsert log writers it won't need to grab a WALWriteLock before writing the cache buffers. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. This would handle all the issues I had and it doesn't sound like a huge change. In fact, it ends up being almost semantically identical to the aio_write suggestion I made orignally, except the LogWriter is doing the background writing instead of the OS and we don't have to worry about aio implementations and portability. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org