[HACKERS] Re: [ADMIN] v7.1b4 bad performance
lockhart ... See included png file. lockhart lockhart What kind of machine was this run on? lockhart lockhart - Thomas Sorry to forget to mention about that. SONY VAIO Z505CR/K (note PC) Pentium III 750MHz/256MB memory/20GB IDE HDD Linux (kernel 2.2.17) configure --enable-multibyte=EUC_JP postgresql.conf: fsync = on max_connections = 128 shared_buffers = 1024 silent_mode = on commit_delay = 0 postmaster opts for 7.0.3: -B 1024 -N 128 -S pgbench settings: scaling factor = 1 data excludes connetion establishing time number of total transactions are always 640 (see included scripts I ran for the testing) -- #! /bin/sh pgbench -i test for i in 1 2 4 8 16 32 64 128 do t=`expr 640 / $i` pgbench -t $t -c $i test echo "= sync ==" sync;sync;sync;sleep 10 echo "= sync done ==" done -- -- Tatsuo Ishii
[HACKERS] Docs generation fixed
The html docs should once again be generated automatically on postgresql.org on a twice-daily basis. Thanks to Peter E for working me through the toolset and configuration changes... - Thomas
Re: [HACKERS] Re: [ADMIN] v7.1b4 bad performance
* Tom Lane [EMAIL PROTECTED] [010216 22:49]: "Schmidt, Peter" [EMAIL PROTECTED] writes: So, is it OK to use commit_delay=0? Certainly. In fact, I think that's about to become the default ;-) I have now experimented with several different platforms --- HPUX, FreeBSD, and two considerably different strains of Linux --- and I find that the minimum delay supported by select(2) is 10 or more milliseconds on all of them, as much as 20 msec on some popular platforms. Try it yourself (my test program is attached). Thus, our past arguments about whether a few microseconds of delay before commit are a good idea seem moot; we do not have any portable way of implementing that, and a ten millisecond delay for commit is clearly Not Good. regards, tom lane Here is another one. UnixWare 7.1.1 on a P-III 500 256 Meg Ram: $ cc -o tgl.test -O tgl.test.c $ time ./tgl.test 0 real0m0.01s user0m0.01s sys 0m0.00s $ time ./tgl.test 1 real0m10.01s user0m0.00s sys 0m0.01s $ time ./tgl.test 2 real0m10.01s user0m0.00s sys 0m0.00s $ time ./tgl.test 3 real0m10.11s user0m0.00s sys 0m0.01s $ uname -a UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5 $ -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
[HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.
Ok, after Tatsuo and Peter have both said that building without locale support should not use the locale support in the OS, and remembering my 6.5.3 experience of a year back, I decided to test it out completely. And I am wrong with respect to 7.1beta4. For 7.1beta4 disabling locale will indeed work properly, at least on RedHat 6.2. Testing methodology: 1.) Blow out entire PGDATA tree; 2.) Initdb with locale-enabled backend; 3.) Run regression with locale-enable binaries (locale=en_US); 4.) Rebuild without --enable-locale; 5.) Blow out entire PGDATA tree; 6.) Initdb with non-locale backend; 7.) Run regression with non-locale binaries. Results: For --enable-locale RPM's, pg_regress --schedule=parallel_schedule produces: parallel group (13 tests): boolean char name varchar int4 int2 oid float4 float 8 text bit int8 numeric boolean ... ok char ... ok name ... ok varchar ... ok text ... ok int2 ... ok int4 ... ok int8 ... FAILED oid ... ok float4 ... ok float8 ... ok bit ... ok numeric ... FAILED test strings ... ok test numerology ... ok parallel group (18 tests): point lseg box path polygon circle comments reltime date abstime interval time inet type_sanity tinterval timestamp oidjoins opr_san ity point... ok lseg ... ok box ... ok path ... ok polygon ... ok circle ... ok date ... ok time ... ok timestamp... ok interval ... ok abstime ... ok reltime ... ok tinterval... ok inet ... ok comments ... ok oidjoins ... ok type_sanity ... ok opr_sanity ... ok test geometry ... ok test horology ... ok test create_function_1... ok test create_type ... ok test create_table ... ok test create_function_2... ok test copy ... ok parallel group (7 tests): create_aggregate create_operator triggers inherit con straints create_misc create_index constraints ... ok triggers ... ok create_misc ... ok create_aggregate ... ok create_operator ... ok create_index ... ok inherit ... ok test create_view ... ok test sanity_check ... ok test errors ... ok test select ... ok parallel group (16 tests): select_into select_distinct_on select_distinct selec t_having select_implicit subselect transactions union case random arrays aggrega tes join portals hash_index btree_index select_into ... ok select_distinct ... ok select_distinct_on ... ok select_implicit ... FAILED select_having... FAILED subselect... ok union... ok case ... ok join ... ok aggregates ... ok transactions ... ok random ... failed (ignored) portals ... ok arrays ... ok btree_index ... ok hash_index ... ok test misc ... ok parallel group (5 tests): portals_p2 alter_table rules foreign_key select_views select_views ... FAILED alter_table ... ok portals_p2 ... ok rules... ok foreign_key ... ok parallel group (3 tests): limit temp plpgsql limit... ok plpgsql ... ok temp ... ok With locale disabled: All 76 tests passed. So, there's the data. This is different behavior from the 6.5.3 non-locale set I produced a year ago. Is there interest in a non-locale RPM distribution, or? The locale enabled regression results fail due to currency format and collation errors. Diffs attached. I'm not sure I understand the select_views failure, either. Locale used was en_US. Comments? -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 locale-run.diffs
Re: [HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.
Lamar Owen [EMAIL PROTECTED] writes: The locale enabled regression results fail due to currency format and collation errors. Diffs attached. I'm not sure I understand the select_views failure, either. Locale used was en_US. The select_views delta looks like a sort-order issue as well; nothing to worry about. These deltas would go away if you allowed pg_regress to build a temp installation in which it could force the locale to C. Of course, that doesn't presently work without a built source tree to install from. I wonder if it is worth adding a third operating mode to pg_regress that would build a temp PGDATA directory but use the already-installed bin/lib/share directories ... regards, tom lane
Re: [HACKERS] Non-locale 7.1beta4 binaries on RedHat 6.2 test results.
Tom Lane wrote: Lamar Owen [EMAIL PROTECTED] writes: The locale enabled regression results fail due to currency format and collation errors. Diffs attached. I'm not sure I understand the select_views failure, either. Locale used was en_US. The select_views delta looks like a sort-order issue as well; nothing to worry about. Good. I didn't see any difference -- but maybe that's because I went cross-eyed :-) These deltas would go away if you allowed pg_regress to build a temp installation in which it could force the locale to C. Of course, that doesn't presently work without a built source tree to install from. I wonder if it is worth adding a third operating mode to Possibly. If pg_regress uses a different port for postmaster, AND a different PGDATA, you could run regression on a sandbox while a production system was running, FWIW. Since that's more of an RPM issue than a core issue, I can do that third mode work, as I would be the direct benefactor (unless someone else does it first, of course). Both the locale and non-locale installation were from RPM, BTW, as I wanted the least number of variables possible. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Re: [HACKERS] Microsecond sleeps with select()
Bruce Momjian wrote: In fact, the kernel doesn't even contain have a way to measure microsecond timings. Linux has patches available to do microsecond timings, but they're nonportable, of course. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Re: [HACKERS] Re: beta5 ...
On Sat, 17 Feb 2001, Bruce Momjian wrote: BTW, is 7.1 going to be a bit slower than 7.0? Or just Beta 5? Just curious. Don't mind waiting for 7.2 for the speed-up if necessary. It is possible that it will be ... the question is whether the slow down is unbearable or not, as to whether we'll let it hold things up or not ... From reading one of Tom's email's, it looks like the changes to 'fix' the slowdown are drastic/large enough that it might not be safe (or desirable) to fix it at this late of a stage in beta ... Depending on what is involved, we might put out a v7.1 for March 1st, so that ppl can feel confident about using the various features, but have a v7.1.1 that follows relatively closely on its heels that addresses the performance problem ... The easy fix is to just set the delay to zero. Looks like that will fix most of the problem. Except that Vadim had a reason for setting it to 5, and I'm loath to see that changed unless someone actaully understands the ramifications other then increasing performance ... The near-committers thing may indeed be overkill, and certainly is not worth holding beta. What is this 'near-committers thing'??
Re: [HACKERS] Re: beta5 ...
The Hermit Hacker [EMAIL PROTECTED] writes: The easy fix is to just set the delay to zero. Looks like that will fix most of the problem. Except that Vadim had a reason for setting it to 5, He claimed to have seen better performance with a nonzero delay. So far none of the rest of us have been able to duplicate that. Perhaps he was using a machine where a 5-microsecond select() delay actually is 5 microseconds? If so, he's the outlier, not the rest of us ... regards, tom lane
[HACKERS] WAL and commit_delay
I want to give some background on commit_delay, its initial purpose, and possible options. First, looking at the process that happens during a commit: write() - copy WAL dirty page to kernel disk buffer fsync() - force WAL kernel disk buffer to disk platter fsync() take much longer than write(). What Vadim doesn't want is: timebackend 1 backend 2 - - 0 write() 1 fysnc() write() 2 fsync() This would be better as: timebackend 1 backend 2 - - 0 write() 1 write() 2 fsync() fsync() This was the purpose of the commit_delay. Having two fsync()'s is not a problem because only one will see there are dirty buffers. The other will probably either return right away, or wait for the other's fsync() to complete. With the delay, it looks like: timebackend 1 backend 2 - - 0 write() 1 sleep() write() 2 fsync() sleep() 3 fsync() Which shows the second fsync() doing nothing, which is good, because there are no dirty buffers at that time. However, a very possible circumstance is: timebackend 1 backend 2 backend 3 - - - 0 write() 1 sleep() write() 2 fsync() sleep() write() 3 fsync() sleep() 4 fsync() In this case, the fsync() by backend 2 does indeed do some work because fsync's backend 3's write(). Frankly, I don't see how the sleep does much except delay things because it doesn't have any smarts about when the delay is useful, and when it is useless. Without that feedback, I recommend removing the entire setting. For single backends, the sleep is clearly a loser. Another situation it can not deal with is: timebackend 1 backend 2 - - 0 write() 1 sleep() 2 fsync() write() 3 sleep() 4 fsync() My solution can't deal with this either. --- The quick fix is to remove the commit_delay code. A more elaborate performance boost would be to have the each backend get feedback from other backends, so they can block and wait for other about-to-fsync backends before fsync(). This allows the write() to bunch up before the fsync(). Here is the single backend case, which experiences no delays: timebackend 1 backend 2 - - 0 get_shlock() 1 write() 2 rel_shlock() 3 get_exlock() 4 rel_exlock() 5 fsync() Here is the two-backend case, which shows both write()'s completing before the fsync()'s: timebackend 1 backend 2 - - 0 get_shlock() 1 write() 2 rel_shlock()get_shlock() 3 get_exlock()write() 4 rel_shlock() 5 rel_exlock() 6 fsync() get_exlock() 7 rel_exlock() 8 fsync() Contrast that with the first 2 backend case presented above: timebackend 1 backend 2 - - 0 write() 1 fysnc() write() 2 fsync() Now, it is my understanding that instead of just shared locking around the write()'s, we could block the entire commit code, so the backend can signal to other about-to-fsync backends to wait. I believe our existing lock code can be used for the locking/unlocking. We can just lock a random, unused table log pg_log or something. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Microsecond sleeps with select()
Bruce Momjian [EMAIL PROTECTED] writes: A comment on microsecond delays using select(). Most Unix kernels run at 100hz, meaning that they have a programmable timer that interrupts the CPU every 10 milliseconds. Right --- this probably also explains my observation that some kernels seem to add an extra 10msec to the requested sleep time. Actually they're interpreting a one-clock-tick select() delay as "wait till the next clock tick, plus one tick". The actual delay will be between one and two ticks depending on just when you went to sleep. The BSDI code would be pselect(): /* * If poll wait was tiny, this could be zero; we will * have to round it up to avoid sleeping forever. If * we retry below, the timercmp above will get us out. * Note that if wait was 0, the timercmp will prevent * us from getting here the first time. */ timo = hzto(atv); if (timo == 0) timo = 1; -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Microsecond sleeps with select()
I have been thinking some more about the s_lock() delay loop in connection with this. We currently have /* * Each time we busy spin we select the next element of this array as the * number of microseconds to wait. This accomplishes pseudo random back-off. * Values are not critical but 10 milliseconds is a common platform * granularity. * * Total time to cycle through all 20 entries might be about .07 sec, * so the given value of S_MAX_BUSY results in timeout after ~70 sec. */ #define S_NSPINCYCLE 20 #define S_MAX_BUSY1000 * S_NSPINCYCLE int s_spincycle[S_NSPINCYCLE] = { 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1 }; Having read the select(2) man page more closely, I now realize that it is *defined* not to yield the processor when the requested delay is zero: it just checks the file ready status and returns immediately. Actually, a kernel call is something. On kernel call return, process priorities are checked and the CPU may be yielded to a higher-priority backend that perhaps just had its I/O completed. I think the 0 and 1 are correct. They would be zero ticks and one tick. You think 5000 and 1 would be better? I can see that. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Re: beta5 ...
The easy fix is to just set the delay to zero. Looks like that will fix most of the problem. Except that Vadim had a reason for setting it to 5, and I'm loath to see that changed unless someone actaully understands the ramifications other then increasing performance ... See post from a few minutes ago with analysis of purpose and actual affect of Vadim's parameter. I objected to the delay when it was introduced because of my analysis, but Vadim's argument is that 5 microseconds is very small delay, just enough to yield the CPU. We now see that is much longer than that. The near-committers thing may indeed be overkill, and certainly is not worth holding beta. What is this 'near-committers thing'?? Other backends about to commit. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Microsecond sleeps with select()
Bruce Momjian [EMAIL PROTECTED] writes: Having read the select(2) man page more closely, I now realize that it is *defined* not to yield the processor when the requested delay is zero: it just checks the file ready status and returns immediately. Actually, a kernel call is something. On kernel call return, process priorities are checked and the CPU may be yielded to a higher-priority backend that perhaps just had its I/O completed. So *if* some I/O just completed, the call *might* do what we need, which is yield the CPU. Otherwise we're just wasting cycles, and will continue to waste them until we do a select with a nonzero delay. I propose we cut out the spinning and just do a nonzero delay immediately. I think the 0 and 1 are correct. They would be zero ticks and one tick. You think 5000 and 1 would be better? I can see that. No, I am not suggesting that, because there is no difference between 5000 and 1. All of this stuff probably ought to be replaced with a less-bogus mechanism (POSIX semaphores maybe?), but not in late beta. regards, tom lane
Re: [HACKERS] Microsecond sleeps with select()
So *if* some I/O just completed, the call *might* do what we need, which is yield the CPU. Otherwise we're just wasting cycles, and will continue to waste them until we do a select with a nonzero delay. I propose we cut out the spinning and just do a nonzero delay immediately. Well, any backend with a higher piority would get run over the current process. The question is how would that happen. I will say that because of CPU cache issues, the system tries _not_ to change processes if the current one still needs the CPU, so the zero may be bogus. I think the 0 and 1 are correct. They would be zero ticks and one tick. You think 5000 and 1 would be better? I can see that. No, I am not suggesting that, because there is no difference between 5000 and 1. All of this stuff probably ought to be replaced with a less-bogus mechanism (POSIX semaphores maybe?), but not in late beta. Good question. We have sched_yield, that is a threads function, or at least only in the pthreads library. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL and commit_delay
Bruce Momjian [EMAIL PROTECTED] writes: With the delay, it looks like: time backend 1 backend 2 - - 0 write() 1 sleep() write() 2 fsync() sleep() 3 fsync() Actually ... take a close look at the code. The delay is done in xact.c between XLogInsert(commitrecord) and XLogFlush(). As near as I can tell, both the write() and the fsync() will happen in XLogFlush(). This means the delay is just plain broken: placed there, it cannot do anything except waste time. Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? regards, tom lane
Re: [HACKERS] WAL and commit_delay
I wrote: Actually ... take a close look at the code. The delay is done in xact.c between XLogInsert(commitrecord) and XLogFlush(). As near as I can tell, both the write() and the fsync() will happen in XLogFlush(). This means the delay is just plain broken: placed there, it cannot do anything except waste time. Uh ... scratch that ... nevermind. The point is that we've inserted our commit record into the WAL output buffer. Now we are sleeping in the hope that some other backend will do both the write and the fsync for us, and that when we eventually call XLogFlush() it will find nothing to do. So the delay is not in the wrong place. Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? This still looks like it'd be a win, by reducing the number of seeks needed to complete a WAL logfile flush. Right now, each XLogFlush requires writing both the file's data area and its inode. regards, tom lane
Re: [HACKERS] WAL and commit_delay
Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? This still looks like it'd be a win, by reducing the number of seeks needed to complete a WAL logfile flush. Right now, each XLogFlush requires writing both the file's data area and its inode. Don't we have to fsync the inode too? Actually, I was hoping sequential fsync's could sit on the WAL disk track, but I can imagine it has to seek around to hit both areas. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL and commit_delay
Bruce Momjian [EMAIL PROTECTED] writes: Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? Don't we have to fsync the inode too? Actually, I was hoping sequential fsync's could sit on the WAL disk track, but I can imagine it has to seek around to hit both areas. That's the point: we're trying to get things set up so that successive writes/fsyncs in the WAL file do the minimum amount of seeking. The WAL code tries to preallocate the whole log file (incorrectly, but that's easily fixed, see below) so that we should not need to update the file metadata when we write into the file. I don't have fdatasync() here. How does it compare to fsync(). HPUX's man page says : fdatasync() causes all modified data and file attributes of fildes : required to retrieve the data to be written to disk. : fsync() causes all modified data and all file attributes of fildes : (including access time, modification time and status change time) to : be written to disk. The implication is that the only thing you can lose after fdatasync is the highly-inessential file mod time. However, I have been told that on some implementations, fdatasync only flushes data blocks, and never writes the inode or indirect blocks. That would mean that if you had allocated new disk space to the file, fdatasync would not guarantee that that allocation was reflected on disk. This is the reason for preallocating the WAL log file (and doing a full fsync *at that time*). Then you know the inode block pointers and indirect blocks are down on disk, and so fdatasync is sufficient even if you have the cheesy version of fdatasync. Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. regards, tom lane
Re: [HACKERS] Performance lossage in checkpoint dumping
On Sat, 17 Feb 2001, Bruce Momjian wrote: No, but I haven't looked at it. I am now much more concerned with the delay, and am wondering if I should start thinking about trying my idea of looking for near-committers and post the patch to the list to see if anyone likes it for 7.1 final. Vadim will not be back in enough time to write any new code in this area, I am afraid. Near committers? *puzzled look* Umm, uh, it means backends that have entered COMMIT and will be issuing an fsync() of their own very soon. It took me a while to remember what I mean too because I was thinking of CVS committers. That's what I was thinking to, which was what was confusing the hell out of me ... like, a near committer ... is that the guy sitting beside you while you commit? :)
Re: [HACKERS] Performance lossage in checkpoint dumping
On Sat, 17 Feb 2001, Tom Lane wrote: The Hermit Hacker [EMAIL PROTECTED] writes: No way to group the writes to you can keep the most recent one open? Don't see an easy way, do you? No, but I haven't looked at it. I am now much more concerned with the delay, I concur. The blind write business is not important enough to hold up the release for --- for one thing, it has nothing to do with the pgbench results we're seeing, because these tests don't run long enough to include any checkpoint cycles. The commit delay, on the other hand, is a big problem. and am wondering if I should start thinking about trying my idea of looking for near-committers and post the patch to the list to see if anyone likes it for 7.1 final. Vadim will not be back in enough time to write any new code in this area, I am afraid. Near committers? *puzzled look* Processes nearly ready to commit. I'm thinking that any mechanism for detecting that might be overkill, however, especially compared to just setting commit_delay to zero by default. I've been sitting here running pgbench under various scenarios, and so far I can't find any condition where commit_delay0 is materially better than commit_delay=0, even under heavy load. It's either the same or much worse. Numbers to follow... Okay, if the whole commit_delay is purely means as a performance thing, I'd say go with lowering the default to zero for v7.1, and once Vadim gets back, we can properly determine why it appears to improve performance in his case ... since I believe his OS of choice is FreeBSD, and you mentioned doing tests on it, I can't see how he'd have a more fine grain'd select() then you have for testing ...
Re: [HACKERS] WAL and commit_delay
Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. Are OS's smart enough to not allocate zero-written blocks? Do we need to write non-zeros? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL and commit_delay
* Bruce Momjian [EMAIL PROTECTED] [010217 14:46]: Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. Are OS's smart enough to not allocate zero-written blocks? Do we need to write non-zeros? I don't believe so. writing Zeros is valid. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Re: [HACKERS] WAL and commit_delay
* Bruce Momjian [EMAIL PROTECTED] [010217 14:46]: Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. Are OS's smart enough to not allocate zero-written blocks? Do we need to write non-zeros? I don't believe so. writing Zeros is valid. The reason I ask is because I know you get zeros when trying to read data from a file with holes, so it seems some OS could actually drop those blocks from storage. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL and commit_delay
Larry Rosenman [EMAIL PROTECTED] writes: I've written swap files and such with: dd if=/dev/zero of=SWAPFILE bs=512 count=204800 and all the blocks are allocated. I've also confirmed that writing zeroes is sufficient on HPUX (du shows that the correct amount of space is allocated, unlike the current seek-to-the-end method). Some poking around the net shows that pre-2.4 Linux kernels implement fdatasync() as fsync(), and we already knew that BSD hasn't got it at all. So distinguishing fdatasync from fsync won't be helpful for very many people yet --- but I still think we should do it. I'm playing with a test setup in which I just changed pg_fsync to call fdatasync instead of fsync, and on HPUX I'm seeing pgbench tps values around 17, as opposed to 13 yesterday. (The HPUX man page warns that these calls are inefficient for large files, and I wouldn't be surprised if a lot of the run time is now being spent in the kernel scanning through all the buffers that belong to the logfile. 2.4 Linux is apparently reasonably smart about this case, and only looks at the actually dirty buffers.) Is anyone out there running a 2.4 Linux kernel? Would you try pgbench with current sources, commit_delay=0, -B at least 1024, no -F, and see how the results change when pg_fsync is made to call fdatasync instead of fsync? (It's in src/backend/storage/file/fd.c) regards, tom lane
Re: [HACKERS] WAL and commit_delay
On Sat, Feb 17, 2001 at 03:45:30PM -0500, Bruce Momjian wrote: Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. Are OS's smart enough to not allocate zero-written blocks? No, but some disks are. Writing zeroes is a bit faster on smart disks. This has no real implications for PG, but it is one of the reasons that writing zeroes doesn't really wipe a disk, for forensic purposes. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] WAL and commit_delay
On Sat, 17 Feb 2001, Tom Lane wrote: Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? Linux/x86 fdatasync(2) manpage: BUGS Currently (Linux 2.0.23) fdatasync is equivalent to fsync. -- Dominic J. Eidson "Baruk Khazad! Khazad ai-menu!" - Gimli --- http://www.the-infinite.org/ http://www.the-infinite.org/~dominic/
[HACKERS] Linux 2.2 vs 2.4
Hi, Not sure if anyone will find this of interest, but I ran pgbench on my main Linux box to see what sort of performance difference might be visible between 2.2 and 2.4 kernels. Hardware: A dual P3-450 with 384Mb of RAM and 3 SCSI disks. The pg datafiles live in a half-gig partition on the first one. Software: Red Hat 6.1 plus all sort of bits and pieces. PostgreSQL 7.1beta4 RPMs. pgbench hand-compiled from source for same. No options changed from defaults. (I'll look at that tomorrow -- is there anything worth changing other than commit_delay and fsync?) Kernels: 2.2.15 + software RAID patches, 2.4.2-pre2 With 2.2.15: pgbench -s5 -i: 1.27.78 elapsed pgbench -s5 -t100: clients: TPS / TPS (excluding connection establishment) 1: 39.66 / 40.08 TPS 2: 60.77 / 61.64 TPS 4: 76.15 / 77.42 8: 90.99 / 92.73 16: 71.10 / 72.15 32: 49.20 / 49.70 1: 27.76 / 28.00 1: 27.82 / 28.03 pgbench -v -s5 -t100: 1: 30.73 / 30.98 And with 2.4.2-pre2: pgbench -s5 -i: 1:17.46 elapsed pgbench -s5 -t100 1: 43.57 / 44.11 TPS 2: 62.85 / 63.86 TPS 4: 87.24 / 89.08 TPS 8: 86.60 / 88.38 TPS 16: 53.22 / 53.88 TPS 32: 60.28 / 61.10 TPS 1: 35.93 / 36.33 1: 34.82 / 35.18 pgbench -v -s5 -t100: 1: 35.70 / 36.01 Overall, two things jump out at me. Firstly, it looks like 2.4 is mixed news for heavy pgbench users :) Low-utilisation numbers are better, but the sweet spot seems lower and narrower. Secondly, in both occasions after a run, performance has been more than 20% lower. Restarting or performing a full vacuum does not seem to help. Is there some sort of fragmentation issue here? Matthew.
Re: [HACKERS] Microsecond sleeps with select()
On Sat, Feb 17, 2001 at 12:26:31PM -0500, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: A comment on microsecond delays using select(). Most Unix kernels run at 100hz, meaning that they have a programmable timer that interrupts the CPU every 10 milliseconds. Right --- this probably also explains my observation that some kernels seem to add an extra 10msec to the requested sleep time. Actually they're interpreting a one-clock-tick select() delay as "wait till the next clock tick, plus one tick". The actual delay will be between one and two ticks depending on just when you went to sleep. ... In short: s_spincycle in its current form does not do anything anywhere near what the author thought it would. It's wasted complexity. I am thinking about simplifying s_lock_sleep down to simple wait-one-tick-on-every-call logic. An alternative is to keep s_spincycle, but populate it with, say, 1, 2 and larger entries, which would offer some hope of actual random-backoff behavior. Either change would clearly be a win on single-CPU machines, and I doubt it would hurt on multi-CPU machines. Comments? I don't believe that most kernels schedule only on clock ticks. They schedule on a clock tick *or* whenever the process yields, which on a loaded system may be much more frequently. The question is whether, scheduling, the kernel considers processes that have requested to sleep less than a clock tick as "ready" once their actual request time expires. On V7 Unix, the answer was no, because the kernel had no way to measure any time shorter than a tick, so it rounded up all sleeps to "the next tick". Certainly there are machines and kernels that count time more precisely (isn't PG ported to QNX?). We do users of such kernels no favors by pretending they only count clock ticks. Furthermore, a 1ms clock tick is pretty common, e.g. on Alpha boxes. A 10ms initial delay is ten clock ticks, far longer than seems appropriate. This argues for yielding the minimum discernable amount of time (1us) and then backing off to a less-minimal time (1ms). On systems that chug at 10ms, this is equivalent to a sleep of up-to-10ms (i.e. until the next tick), then a sequence of 10ms sleeps; on dumbOS Alphas, it's equivalent to a sequence of 1ms sleeps; and on a smartOS on an Alpha it's equivalent to a short, variable time (long enough for other runnable processes to run and yield) followed by a sequence of 1ms sleeps. (Some of the numbers above are doubled on really dumb kernels, as Tom noted.) Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] Re: WAL and commit_delay
On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote: On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote: [snipped] | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench | with current sources, commit_delay=0, -B at least 1024, no -F, and see | how the results change when pg_fsync is made to call fdatasync instead | of fsync? (It's in src/backend/storage/file/fd.c) I've not run this requested test, but glibc-2.2 provides this bit of code for fdatasync, so it /appears/ to me that kernel version will not affect the test case. [glibc-2.2/sysdeps/generic/fdatasync.c] int fdatasync (int fildes) { return fsync (fildes); } In the 2.4 kernel it says (fs/buffer.c) /* this needs further work, at the moment it is identical to fsync() */ down(inode-i_sem); err = file-f_op-fsync(file, dentry); up(inode-i_sem); We can probably expect this to be fixed in an upcoming 2.4.x, i.e. well before 2.6. This is moot, though, if you're writing to a raw volume, which you will be if you are really serious. Then, fsync really is equivalent to fdatasync. Nathan Myers [EMAIL PROTECTED]
[HACKERS] Re: Re: WAL and commit_delay
On 17 Feb 2001 at 15:53 (-0800), Nathan Myers wrote: | On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote: | On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote: | | [snipped] | | | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench | | with current sources, commit_delay=0, -B at least 1024, no -F, and see | | how the results change when pg_fsync is made to call fdatasync instead | | of fsync? (It's in src/backend/storage/file/fd.c) | | I've not run this requested test, but glibc-2.2 provides this bit | of code for fdatasync, so it /appears/ to me that kernel version | will not affect the test case. | | [glibc-2.2/sysdeps/generic/fdatasync.c] | |int |fdatasync (int fildes) |{ |return fsync (fildes); |} | | In the 2.4 kernel it says (fs/buffer.c) | |/* this needs further work, at the moment it is identical to fsync() */ |down(inode-i_sem); |err = file-f_op-fsync(file, dentry); |up(inode-i_sem); | | We can probably expect this to be fixed in an upcoming 2.4.x, i.e. | well before 2.6. 2.4.0-ac11 already has provisions for fdatasync [fs/buffer.c] 352 asmlinkage long sys_fsync(unsigned int fd) 353 { ... 372 down(inode-i_sem); 373 filemap_fdatasync(inode-i_mapping); 374 err = file-f_op-fsync(file, dentry, 0); 375 filemap_fdatawait(inode-i_mapping); 376 up(inode-i_sem); 384 asmlinkage long sys_fdatasync(unsigned int fd) 385 { ... 403 down(inode-i_sem); 404 filemap_fdatasync(inode-i_mapping); 405 err = file-f_op-fsync(file, dentry, 1); 406 filemap_fdatawait(inode-i_mapping); 407 up(inode-i_sem); ext2 does use this third param of its fsync() operation to (potentially) bypass a call to ext2_sync_inode(inode) b
Re: [HACKERS] Linux 2.2 vs 2.4
Matthew Kirkwood [EMAIL PROTECTED] writes: No options changed from defaults. (I'll look at that tomorrow -- is there anything worth changing other than commit_delay and fsync?) -B for sure ... the default -B is way too small for WAL. Firstly, it looks like 2.4 is mixed news for heavy pgbench users :) Low-utilisation numbers are better, but the sweet spot seems lower and narrower. Huh? With the exception of the 16-user case (possibly measurement noise), 2.4 looks better across the board, AFAICS. But see below. Secondly, in both occasions after a run, performance has been more than 20% lower. I find that pgbench's reported performance can vary quite a bit from run to run, at least with smaller values of total transactions. I think this is because it's a bit of a crapshoot how many WAL logfile initializations occur during the run and get charged against the total time. Not to mention whatever else the machine might be doing. With longer runs (say at least 1 total transactions) the numbers should stabilize. I wouldn't put any faith at all in tests involving less than about 1000 total transactions... regards, tom lane
Re: [HACKERS] Microsecond sleeps with select()
[EMAIL PROTECTED] (Nathan Myers) writes: Certainly there are machines and kernels that count time more precisely (isn't PG ported to QNX?). We do users of such kernels no favors by pretending they only count clock ticks. Furthermore, a 1ms clock tick is pretty common, e.g. on Alpha boxes. Okay, I didn't know there were any popular systems that did that. This argues for yielding the minimum discernable amount of time (1us) and then backing off to a less-minimal time (1ms). Fair enough. As you say, it's the same result on machines with coarse time resolution, and it should help on smarter boxes. The main thing is that I want to change the zero entries in s_spincycle, which clearly aren't doing what the author intended. regards, tom lane
Re: [HACKERS] Re: WAL and commit_delay
[EMAIL PROTECTED] (Nathan Myers) writes: In the 2.4 kernel it says (fs/buffer.c) /* this needs further work, at the moment it is identical to fsync() */ down(inode-i_sem); err = file-f_op-fsync(file, dentry); up(inode-i_sem); Hmm, that's the same code that's been there since 2.0 or before. I had trawled the Linux kernel mail lists and found patch submissions from several different people to make fdatasync really work, and what I thought was an indication that at least one had been applied. Evidently not. Oh well... regards, tom lane
Re: [HACKERS] Re: WAL and commit_delay
[EMAIL PROTECTED] (Nathan Myers) writes: I.e. yes, Linux 2.4.0 and ext2 do implement the distinction. Sorry for the misinformation. Okay ... meanwhile I've got to report the reverse: I've just confirmed that on HPUX 10.20, there is *not* a distinction between fsync and fdatasync. I was misled by what was apparently an outlier result on my first try with fdatasync plugged in ... but when I couldn't reproduce that, some digging led to the fact that the fsync and fdatasync symbols in libc are at the same place :-(. Still, using fdatasync for the WAL file seems like a forward-looking thing to do, and it'll just take another couple of lines of configure code, so I'll go ahead and plug it in. regards, tom lane