Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Added to TODO: * Delay fsync() when other backends are about to commit too [ Charset ISO-8859-1 unsupported, converting... ] BUT, do we know for sure that sleep(0) is not optimized in the library to just return? We can only do our best here. I think guessing whether other backends are _about_ to commit is pretty shaky, and sleeping every time is a waste. This seems the cleanest. A long ago you, Bruce, made me gift - book about transaction processing (thanks again -:)). This sleeping before fsync in commit is described there as standard technique. And the reason is cleanest. Men, cost of fsync is very high! { write (64 bytes) + fsync() } takes ~ 1/50 sec. Yes, additional 1/200 sec or so results in worse performance when there is only one backend running but greatly increase overall performance for 100 simultaneous backends. Ie this delay is trade off to gain better scalability. I agreed that it must be configurable, smaller or probably 0 by default, use approximate # of simultaneously running backends for guessing (postmaster could maintain this number in shmem and backends could just read it without any locking - exact number is not required), good described as tuning patameter in documentation. Anyway I object sleep(0). Vadim -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
Ok, so with CHECKPOINTS, we could move the offline log files to somewhere else so that we could archive them, in my undertstanding. Now question is, how we could recover from disaster like losing every table files except log files. Can we do this with WAL? If so, how can we do it? Not currently. WAL based BAR is required. I think there will be no BAR in 7.1, but it may be added in 7.1.X (no initdb will be required). Anyway BAR implementation is not in my plans. All in your hands, guys -:) Vadim Cam I ask what BAR is ? Backup And Restore. Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
At 07:05 PM 11/19/00 +0100, [EMAIL PROTECTED] wrote: Cam I ask what BAR is ? Backup and recovery, presumably... - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Bruce Momjian [EMAIL PROTECTED] [001117 11:39]: * Bruce Momjian [EMAIL PROTECTED] [001117 11:23]: sleep(3) should conform to POSIX specification, if anyone has the reference they can check it to see what the effect of sleep(0) should be. Yes, but Posix also specifies sched_yield() which rather explicitly allows a process to yield its timeslice. No idea how well that is supported. I have it on BSDI. We could add a configure check, and use it if it is there. Another idea is to add a shared memory flag when someone enters the 'commit' section of the transaction code. That way, a backend could check to see if another process is _about_ to commit, and wait. On UnixWare, it requires the -Kthread or -Kpthread command, which then links in the threads library... I'm not sure that this is a good thing or not I would hope it just calls the function, and does not bring in thread startup stuff. I suspect it DOES bring in the thread startup and all that implies... Tread lightly. The good news is UnixWare Threads are LWP's and the kernel is multithreaded... LER -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. This sounds like an interesting approach, yes. - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Don Baccus [EMAIL PROTECTED] [001116 13:46]: At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. This sounds like an interesting approach, yes. Question: Is sleep(0) guaranteed to at least give up control? The way I read my UnixWare 7's man page, it might not, since alarm(0) just cancels the alarm... Larry - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote: * Bruce Momjian [EMAIL PROTECTED] [001116 08:59] wrote: Ewe, so we have this 1/200 second delay for every transaction. Seems bad to me. I think as long as it becomes a tunable this isn't a bad idea at all. Fixing it at 1/200 isn't so great because people not wrapping large amounts of inserts/updates with transaction blocks will suffer. I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote: * Bruce Momjian [EMAIL PROTECTED] [001116 08:59] wrote: Ewe, so we have this 1/200 second delay for every transaction. Seems bad to me. I think as long as it becomes a tunable this isn't a bad idea at all. Fixing it at 1/200 isn't so great because people not wrapping large amounts of inserts/updates with transaction blocks will suffer. I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Bruce Momjian [EMAIL PROTECTED] [001116 11:59] wrote: At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. This sounds like an interesting approach, yes. In OS kernel design, you try to avoid process herding bottlenecks. Here, we want them herded, and giving up the CPU may be the best way to do it. Yes, but if everyone yeilds you're back where you started, and with 128 or more backends do you really want to cause possibly that many context switches per fsync? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
In OS kernel design, you try to avoid process herding bottlenecks. Here, we want them herded, and giving up the CPU may be the best way to do it. Yes, but if everyone yeilds you're back where you started, and with 128 or more backends do you really want to cause possibly that many context switches per fsync? You are going to kernel call/yield anyway to fsync, so why not try and if someone does the fsync, we don't need to do it. I am suggesting re-checking the need for fsync after the return from sleep(0). -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Bruce Momjian writes: The way I read my UnixWare 7's man page, it might not, since alarm(0) just cancels the alarm... Well, it certainly is a kernel call, and most OS's re-evaluate on kernel call return. In glibc, sleep(0) just does "return 0;", so if the compiler has a good day the call will disappear completely. -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Alfred Perlstein [EMAIL PROTECTED] writes: It might make more sense to keep a private copy of the last time the file was modified per-backend by that particular backend and a timestamp of the last fsync shared globally so one can forgo the fsync if "it hasn't been dirtied by me since the last fsync" This would provide a rendevous point for the fsync call although cost more as one would need to periodically call gettimeofday to set the modified by me timestamp as well as the post-fsync shared timestamp. That's the hard way to do it. We just need to keep track of the endpoint of the log as of the last fsync. You need to fsync (after returning from sleep()) iff your commit record position fsync endpoint. No need to ask the kernel for time-of-day. regards, tom lane
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Tom Lane [EMAIL PROTECTED] [001116 13:31] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: It might make more sense to keep a private copy of the last time the file was modified per-backend by that particular backend and a timestamp of the last fsync shared globally so one can forgo the fsync if "it hasn't been dirtied by me since the last fsync" This would provide a rendevous point for the fsync call although cost more as one would need to periodically call gettimeofday to set the modified by me timestamp as well as the post-fsync shared timestamp. That's the hard way to do it. We just need to keep track of the endpoint of the log as of the last fsync. You need to fsync (after returning from sleep()) iff your commit record position fsync endpoint. No need to ask the kernel for time-of-day. Well that breaks when you move to a overwriting storage manager, however if you use oid instead that optimization would survive the change to a overwriting storage manager. ? -Alfred
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Bruce Momjian [EMAIL PROTECTED] [001116 12:31] wrote: In OS kernel design, you try to avoid process herding bottlenecks. Here, we want them herded, and giving up the CPU may be the best way to do it. Yes, but if everyone yeilds you're back where you started, and with 128 or more backends do you really want to cause possibly that many context switches per fsync? You are going to kernel call/yield anyway to fsync, so why not try and if someone does the fsync, we don't need to do it. I am suggesting re-checking the need for fsync after the return from sleep(0). It might make more sense to keep a private copy of the last time the file was modified per-backend by that particular backend and a timestamp of the last fsync shared globally so one can forgo the fsync if "it hasn't been dirtied by me since the last fsync" This would provide a rendevous point for the fsync call although cost more as one would need to periodically call gettimeofday to set the modified by me timestamp as well as the post-fsync shared timestamp. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
RE: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
You are going to kernel call/yield anyway to fsync, so why not try and if someone does the fsync, we don't need to do it. I am suggesting re-checking the need for fsync after the return from sleep(0). It might make more sense to keep a private copy of the last time the file was modified per-backend by that particular backend and a timestamp of the last fsync shared globally so one can forgo the fsync if "it hasn't been dirtied by me since the last fsync" This would provide a rendevous point for the fsync call although cost more as one would need to periodically call gettimeofday to set the modified by me timestamp as well as the post-fsync shared timestamp. Already made, but without timestamps. WAL maintains last byte of log written/fsynced in shmem, so XLogFlush(_last_byte_to_be_flushed_) will do nothing if data are already on disk. Vadim
RE: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
No. Checkpoints are to speedup after crash recovery and to remove/archive log files. With WAL server doesn't write any datafiles on commit, only commit record goes to log (and log fsync-ed). Dirty buffers remains in memory long Ok, so with CHECKPOINTS, we could move the offline log files to somewhere else so that we could archive them, in my undertstanding. Now question is, how we could recover from disaster like losing every table files except log files. Can we do this with WAL? If so, how can we do it? Not currently. WAL based BAR is required. I think there will be no BAR in 7.1, but it may be added in 7.1.X (no initdb will be required). Anyway BAR implementation is not in my plans. All in your hands, guys -:) Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. This sounds like an interesting approach, yes. In OS kernel design, you try to avoid process herding bottlenecks. Here, we want them herded, and giving up the CPU may be the best way to do it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
RE: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
I am just suggesting that instead of flushing the log on every transaction end, just do it every X seconds. Or maybe more practical is, when the log buffer fills. And of course during checkpoints. Also before backend's going to write dirty buffer from pool to system cache - changes must be logged before reflected in data files. Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Earlier, Vadim was talking about arranging to share fsyncs of the WAL log file across transactions (after writing your commit record to the log, sleep a few milliseconds to see if anyone else fsyncs before you do; if not, issue the fsync yourself). That would offer less-than- one-fsync-per-transaction performance without giving up any guarantees. If people feel a compulsion to have a tunable parameter, let 'em tune the length of the pre-fsync sleep ... Already implemented (without ability to tune this parameter - xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so backend sleeps 1/200 sec before checking/forcing log fsync. But it returns _completed_ to the client before sleeping, right? No. Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
* Tatsuo Ishii [EMAIL PROTECTED] [001110 18:42] wrote: Yes, though we can change this. We also can implement now feature that Bruce wanted so long and so much -:) - fsync log not on each commit but each ~ 5sec, if losing some recent commits is acceptable. Sounds great. Not really, I thought an ack on a commit would mean that the data is actually in stable storage, breaking that would be pretty bad no? Or are you only talking about when someone is running with async Postgresql? Although this doesn't have an effect on my current application, when running Postgresql with sync commits and WAL can one expect the old behavior, ie. success only after data and meta data (log) are written? Probably you misunderstand what Bruce expected to have. He wished to have not-everytime-fsync as an *option*. I believe we wil do strict fsync in default. -- Tatsuo Ishii
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Bruce Momjian [EMAIL PROTECTED] writes: I have to agree with Alfred here: this does not sound like a feature, it sounds like a horrid hack. You're giving up *all* consistency guarantees for a performance gain that is really going to be pretty minimal in the WAL context. It does not give up consistency. The db is still consistent, it is just consistent from a few seconds ago, rather than commit time. No, it isn't consistent. Without the fsync you don't know what order the kernel will choose to plop down WAL log blocks in; you could end up with a corrupt log. (Actually, perhaps that could be worked around if the log blocks are suitably marked so that you can tell where the last sequentially valid one is. I haven't looked at the log structure in any detail...) regards, tom lane
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Bruce Momjian [EMAIL PROTECTED] [00 00:16] wrote: * Tatsuo Ishii [EMAIL PROTECTED] [001110 18:42] wrote: Yes, though we can change this. We also can implement now feature that Bruce wanted so long and so much -:) - fsync log not on each commit but each ~ 5sec, if losing some recent commits is acceptable. Sounds great. Not really, I thought an ack on a commit would mean that the data is actually in stable storage, breaking that would be pretty bad no? Or are you only talking about when someone is running with async Postgresql? The default is to sync on commit, but we need to give people options of several seconds delay for performance reasons. Inforimx calls it buffered logging, and it is used by most of the sites I know because it has much better performance that sync on commit. If the machine crashes five seconds after commit, many people don't have a problem with just re-entering the data. We have several critical tables and running certain updates/deletes/inserts on them in async mode worries me. Would it be possible to add a 'set' command to force a backend into fsync mode and perhaps back into non-fsync mode as well? What about setting an attribute on a table that could mean a) anyone updating me better fsync me. b) anyone updating me better fsync me as well as fsyncing anything else they touch. I swear one of these days I'm going to get more familiar with the codebase and actually submit some useful patches for the backend. :( -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
Bruce Momjian [EMAIL PROTECTED] writes: Not really, I thought an ack on a commit would mean that the data is actually in stable storage, breaking that would be pretty bad no? The default is to sync on commit, but we need to give people options of several seconds delay for performance reasons. Inforimx calls it buffered logging, and it is used by most of the sites I know because it has much better performance that sync on commit. I have to agree with Alfred here: this does not sound like a feature, it sounds like a horrid hack. You're giving up *all* consistency guarantees for a performance gain that is really going to be pretty minimal in the WAL context. Earlier, Vadim was talking about arranging to share fsyncs of the WAL log file across transactions (after writing your commit record to the log, sleep a few milliseconds to see if anyone else fsyncs before you do; if not, issue the fsync yourself). That would offer less-than- one-fsync-per-transaction performance without giving up any guarantees. If people feel a compulsion to have a tunable parameter, let 'em tune the length of the pre-fsync sleep ... regards, tom lane
RE: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
Can you tell me how to use CHECKPOINT please? You shouldn't normally use it - postmaster will start backend each 3-5 minutes to do this automatically. Oh, I see. Is this the same as a SAVEPOINT? No. Checkpoints are to speedup after crash recovery and to remove/archive log files. With WAL server doesn't write any datafiles on commit, only commit record goes to log (and log fsync-ed). Dirty buffers remains in memory long Ok, so with CHECKPOINTS, we could move the offline log files to somewhere else so that we could archive them, in my undertstanding. Now question is, how we could recover from disaster like losing every table files except log files. Can we do this with WAL? If so, how can we do it? Is log fsynced even I turn of -F? Yes, though we can change this. We also can implement now feature that Bruce wanted so long and so much -:) - fsync log not on each commit but each ~ 5sec, if losing some recent commits is acceptable. Sounds great. -- Tatsuo Ishii
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
* Tatsuo Ishii [EMAIL PROTECTED] [001110 18:42] wrote: Yes, though we can change this. We also can implement now feature that Bruce wanted so long and so much -:) - fsync log not on each commit but each ~ 5sec, if losing some recent commits is acceptable. Sounds great. Not really, I thought an ack on a commit would mean that the data is actually in stable storage, breaking that would be pretty bad no? Or are you only talking about when someone is running with async Postgresql? Although this doesn't have an effect on my current application, when running Postgresql with sync commits and WAL can one expect the old behavior, ie. success only after data and meta data (log) are written? Another question I had was what would the effect of a mid-fsync crash have on a system using WAL, let's say someone yanks the power while the OS in the midst of fsync, will all be ok? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
RE: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam (xact.c xlog.c)
New CHECKPOINT command. Auto removing of offline log files and creating new file at checkpoint time. Can you tell me how to use CHECKPOINT please? You shouldn't normally use it - postmaster will start backend each 3-5 minutes to do this automatically. Is this the same as a SAVEPOINT? No. Checkpoints are to speedup after crash recovery and to remove/archive log files. With WAL server doesn't write any datafiles on commit, only commit record goes to log (and log fsync-ed). Dirty buffers remains in memory long Is log fsynced even I turn of -F? Yes, though we can change this. We also can implement now feature that Bruce wanted so long and so much -:) - fsync log not on each commit but each ~ 5sec, if losing some recent commits is acceptable. Nevertheless, when bufmgr replaces dirty buffer it must ensure first that log record of last buffer update is on disk already and so bufmgr forces log fsync if required. This cannot be changed - rule is simple: log before applying changes to permanent storage. Vadim
Re: [HACKERS] RE: [COMMITTERS] pgsql/src/backend/access/transam(xact.c xlog.c)
New CHECKPOINT command. Auto removing of offline log files and creating new file at checkpoint time. Can you tell me how to use CHECKPOINT please? Is this the same as a SAVEPOINT? No. Checkpoints are to speedup after crash recovery and to remove/archive log files. With WAL server doesn't write any datafiles on commit, only commit record goes to log (and log fsync-ed). Dirty buffers remains in memory long Is log fsynced even I turn of -F? -- Tatsuo Ishii