Re: [HACKERS] Rendezvous/Bonjour broken in 8.1 beta
FYI: The DNSServiceDiscovery.h is Mac OS X specific and only kept around for Mac OS X 10.2 and older compatiability. Apple has moved to a cross platform impelentation with the dns_sd.h API available for OS X, Win32, and most any posix type system. Any reference to DNSServiceDiscovery.h in pgsql should go away (unless you want to keep support for pre-10.3 Mac OS X) and be replaced with the newer dns_sd.h interfaces. On 9/8/05, Andrew Dunstan [EMAIL PROTECTED] wrote: It gets a lot worse. The Windows SDK does not export the same API. None of the DNSfoo things we refer to in postmaster.c are defined in the supplied header file (not to mention other idiocies I was able to negotiate past). Apparently Apple have never heard of code portability. -- Sandy McArthur ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] FAQ/HTML standard?
On 9/11/05, Bruno Wolff III wrote: On Sat, Sep 10, 2005 at 14:31:06 -0400, Andrew Dunstan wrote: XHTML is simply a minimal reformulation of HTML in XML, and even uses the HTML 4.01 definitions for its semantics. Given that, it's hard to see why it should be considered a bad thing. Here is the article: http://www.hixie.ch/advocacy/xhtml I prefer standards over opinions: quote 5.1. Internet Media Type XHTML Documents which follow the guidelines set forth in Appendix C, HTML Compatibility Guidelines may be labeled with the Internet Media Type text/html [RFC2854], as they are compatible with most HTML browsers. Those documents, and any other document conforming to this specification, may also be labeled with the Internet Media Type application/xhtml+xml as defined in [RFC3236]. (..) /quote http://www.w3.org/TR/xhtml1/ So just follow the compatibility guidelines (we want people to be able to read the FAQ anyway) and use text/html. Jochem ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] random system table corruption ...
in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. my question is: are there any options to implement something which makes system tables more robust? the problem is: the described error happens only once i an while and cannot be reproduced. maybe there is a way to add some more sanity checks before the page is actually written. any suggestions? best regards, hans -- Cybertec Geschwinde Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] random system table corruption ...
On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote: in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. Near as I can tell, the only times pages are zeroed out is if zero_damaged_pages is set (destroying the evidence) or during WAL recovery. my question is: are there any options to implement something which makes system tables more robust? the problem is: the described error happens only once i an while and cannot be reproduced. maybe there is a way to add some more sanity checks before the page is actually written. Well, the most common causes are dodgy memory. Other than that I guess you could arrange for bgwriter to check the pages it is writing. I imagine it already does check the header, checking the data requires knowledge about the actual table and attributes. And about the only thing that says I'm broken is a varlena value with a long value. As they say, the only thing sure would be to have a backup. the only thing I can imagine being really useful would be a restore mode where you feed it the schema so it can reconstruct the pg_class and pg_attribute just enough for you to dump it to reconstruct everything... You know, VACUUM FREEZE BACKUP on pg_catalog, physically copy the datafiles and offer the option to blat your catalog with an old one... -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgp7YJB3n07fv.pgp Description: PGP signature
[HACKERS] Benchmarks?
Hi, I have tried to post the following question in the general malinglist, but haven't got any respons. I really hope some pgsql-developer can help. Btw. this is my first mail to this list, so please bear with me. I'm a computer science student doing a thesis on paging algorithms, both from a theoretical and practical viewpoint. I'm currently looking for some standard way of testing the performance of paging alogrithms. But I haven't been able to find any paging benchmarks anywhere (both in the literature and on the web). I would therefore like to know if the developers of postgresql are doing any kind of paging benchmarks? And in case they do, I would like to see them. Or there might be someone who has a good idea how to produce real-life sequences of requests, that I might test the algorithms against. Thanks in advance, Martin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] problem for o/p
sir, i'am getting different o/p for the same program code as well as the data base is the same .twice the output is being printed on some machines whereas correct o/p on some.the program code is create or replace function f() return int4 as' c1 cursor for select empname from emp; e1 emp.ename%type; begin open c1; loop fetch c1 into e1; exit when not found; raise notice ' 'emp name %' ', e1; end loop; close c1; return 1; end; 'language 'plpgsql'; Yahoo! India Matrimony: Find your partner now.
Re: [HACKERS] FAQ/HTML standard?
On Saturday 10 September 2005 12:10, Andrew Dunstan wrote: Is there an HTML standard that we try to follow in our HTML docs such as FAQs? If there isn't an explicit standard, may I suggest that we adopt XHTML 1.0 as the standard? Really the FAQ files need to be able to validate when viewed through the postgresql website, where we use: !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd; I have corrected some markup in the past, but by no means would I say this has been done thoroughly wrt the FAQ's. FWIW I'm not against changing the source to docbook if someone wanted to be really ambitious and convert all of the FAQ's to such a scheme (including the text only ones) as long as we had a way to easily build them out of cvs and into the website. I imagine my other web cohorts would probably think similarly. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] -fPIC
So far, we have tended to use -fpic to compile position-independent code until we have received some sort of overflow that forced the use of -fPIC. Since 8.0, the makefiles to build shared libraries are also available to external modules through the pgxs system, so we cannot exactly check anymore what the fpic vs. fPIC requirement of each conceivable module is. I have just received confirmation that PL/Java needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make at least that change, but maybe it's more prudent to change to -fPIC across the board now. Comments? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] initdb profiles
Tom Lane [EMAIL PROTECTED] writes: It'd be nice to get out from under the fixed-size-shmem restriction, but I don't know any very portable way to do that. Without knowing that part of the code at all it seems to me the logical approach would be to make the fsm steal its pages out of the shared buffers allocation. That is, you specify a total amount of shared memory to allocate and Postgres decides how much of it to use for shared buffers and how much for fsm. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] random system table corruption ...
=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes: in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. That sounds to me like a hardware problem --- disk or disk controller momentarily writing zeroes instead of what it should write. Have you seen this on more than one physical machine? Do you have any evidence for the implication that it only happens to system tables and not user tables? Also, you don't have zero_damaged_pages turned on by any chance? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -fPIC
Peter Eisentraut [EMAIL PROTECTED] writes: So far, we have tended to use -fpic to compile position-independent code until we have received some sort of overflow that forced the use of -fPIC. Since 8.0, the makefiles to build shared libraries are also available to external modules through the pgxs system, so we cannot exactly check anymore what the fpic vs. fPIC requirement of each conceivable module is. I have just received confirmation that PL/Java needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make at least that change, but maybe it's more prudent to change to -fPIC across the board now. Comments? PL/Java is bigger than the whole backend? The reason for -fpic vs -fPIC (on the machines where it makes any difference at all) is that the former is faster. I'm not real thrilled by the prospect that a bloated add-on should get to dictate an across-the-board slowdown even on installations where it will never be used. I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on CFLAGS in its Makefile, rather than trying to force the same on everything else. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] random system table corruption ...
Tom Lane wrote: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes: in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. That sounds to me like a hardware problem --- disk or disk controller momentarily writing zeroes instead of what it should write. Have you seen this on more than one physical machine? Do you have any evidence for the implication that it only happens to system tables and not user tables? Also, you don't have zero_damaged_pages turned on by any chance? regards, tom lane tom, well, there is some evidence that this is not a hardware related issue. we have only seen this problem from time to time but it happened on different machines. it cannot be reproduced. it can even happen when somebody runs a script which has been called million times before. in my current scenario the page header only consists of 0x00 bytes and therefore the page checks fails when reading the system table. i have never seen this in data files up to now (at least not when the hardware was still intact). did anybody face similar problems? maybe on sun? by the way: currently the broken system is running PostgreSQL 7.4 but as I said - we have also seen that on 8.0 once. best regards, hans -- Cybertec Geschwinde Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] initdb profiles
Tom Lane wrote: The thought behind my suggestion was that the current max_fsm_pages default of 2 pages is enough to track free space in a database of maybe a few hundred megabytes. The other defaults are sized appropriately for machines with about that much in main memory. This doesn't seem to add up :-(. The default max_fsm_pages probably should be about ten times bigger just to bring it in balance with the other defaults ... after that we could talk about increasing the defaults across-the-board. Ok, how about this? I based the numbers on your 10*current suggestion and some linear scaling When we test connection currently, we use shared bufferes if n*5. We could add in a setting of max_fsm_pages = n * 1000 in line with the arithmetic - not sure if it's worth it. When we test n shared buffers, let's add in a max_fsm_pages setting of n * 200. Another alternative I thought might be better would be that instead of fixing the default max_fsm_pages at 2, we set the default at a fixed ratio (say 200:1) to shared_buffers. Not sure how easy that is to do via the GUC mechanism. Lastly, I would suggest that we increase the limits we try modestly - adding in 400, 350, 300, 250, 200, and 150 to the number of cennections tried, and perhaps 3000, 2500, 2000 and 1500 to number of buffers tried. These numbers aren't entirely plucked out of the air. The number of connections is picked to match then number of clients a default apache setup can have under a hybrid MPM, and the number of shared buffers is picked to be somewhat less than the 10% on a modest machine that Peter thought would be too much. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] -fPIC
Tom Lane wrote: PL/Java is bigger than the whole backend? No, it's not, but the backend is not compiled as position-independent. The reason for -fpic vs -fPIC (on the machines where it makes any difference at all) is that the former is faster. I don't doubt that, but out of curiosity, considering that everyone else is using libtool, and libtool always uses -fPIC, what kind of impact does this *really* have? I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on CFLAGS in its Makefile, rather than trying to force the same on everything else. That would certainly work, but is that the kind of interface we want to offer? In the extreme case, a module could end up redefining a great deal of the shared library knowledge that it was supposed to not have to care about. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] initdb profiles
On Sun, Sep 11, 2005 at 12:15:01PM -0400, Greg Stark wrote: It'd be nice to get out from under the fixed-size-shmem restriction, but I don't know any very portable way to do that. Without knowing that part of the code at all it seems to me the logical approach would be to make the fsm steal its pages out of the shared buffers allocation. That is, you specify a total amount of shared memory to allocate and Postgres decides how much of it to use for shared buffers and how much for fsm. FWIW, I know this is how DB2 does things, and I think Oracle's the same. We probably still want some kind of limit so it doesn't blow the buffer cache completely out. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] random system table corruption ...
On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote: in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. IIRC the XFS filesystem zeroes out pages that it recovers from the journal but did not have a fsync on them (AFAIK XFS journals only metadata, so page creation but not the content itself). I don't think this would be applicable to your case, because we do fsync modified files on checkpoint, and rewrite them completely from WAL images after that. But I thought I'd mention it. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com Just treat us the way you want to be treated + some extra allowance for ignorance.(Michael Brusser) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -fPIC
On Sun, Sep 11, 2005 at 05:49:40PM +0200, Peter Eisentraut wrote: So far, we have tended to use -fpic to compile position-independent code until we have received some sort of overflow that forced the use of -fPIC. Since 8.0, the makefiles to build shared libraries are also available to external modules through the pgxs system, so we cannot exactly check anymore what the fpic vs. fPIC requirement of each conceivable module is. I have just received confirmation that PL/Java needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make at least that change, but maybe it's more prudent to change to -fPIC across the board now. Comments? Can we avoid those relocation by not exporting variables and function that shouldn't be exported and marking them static? Or is static already being used properly? Kurt ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] random system table corruption ...
Alvaro Herrera wrote: On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote: in the past we have faced a couple of problems with corrupted system tables. this seems to be a version independent problem which occurs on hackers' from time to time. i have checked a broken file and i have seen that the corrupted page has actually been zeroed out. IIRC the XFS filesystem zeroes out pages that it recovers from the journal but did not have a fsync on them (AFAIK XFS journals only metadata, so page creation but not the content itself). I don't think this would be applicable to your case, because we do fsync modified files on checkpoint, and rewrite them completely from WAL images after that. But I thought I'd mention it. alvora, thanks a lot. we have some reports about sun systems. meanwhile i got the impression that the filesystem might be doing something wrong. i have seen that the page is not completely zeroed out. at some strange positions there are 2 bytes of crap (i have overlooked that at first glance). the first couple hundreds of bytes are crap, however. very strange ... best regards, hans -- Cybertec Geschwinde Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] -fPIC
* Peter Eisentraut ([EMAIL PROTECTED]) wrote: Tom Lane wrote: The reason for -fpic vs -fPIC (on the machines where it makes any difference at all) is that the former is faster. I don't doubt that, but out of curiosity, considering that everyone else is using libtool, and libtool always uses -fPIC, what kind of impact does this *really* have? I certainly wouldn't assume something done in libtool is necessairly the 'smart' approach, ever. I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on CFLAGS in its Makefile, rather than trying to force the same on everything else. That would certainly work, but is that the kind of interface we want to offer? In the extreme case, a module could end up redefining a great deal of the shared library knowledge that it was supposed to not have to care about. I don't think it's all that sane to expect a generalized build system to support every possible library compilation requirement... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] problem for o/p
On Sat, Sep 10, 2005 at 12:11:52PM +0100, rajinder ruprai wrote: i'am getting different o/p for the same program code as well as the data base is the same .twice the output is being printed on some machines whereas correct o/p on some.the program code is [...] raise notice ' 'emp name %' ', e1; Where do the database server's logs go? You might be seeing duplicate output because the server's logs are sent to the same terminal as the client's output. BTW, this doesn't belong on pgsql-hackers. -- Michael Fuhr ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Spinlocks, yet again: a new test case
I've recently been taking another look at the test case shown here: http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php This is the infamous context swap storm problem that we've hacking at for so long. After the 8.1 buffer manager redesign, the problem of contention for the BufMgrLock is pretty much gone, and so the test case no longer exhibits a lot of semop()s. However ... performance of CVS tip on a multiprocessor still pretty much sucks :-( I started looking at this because of a discussion with a Red Hat customer, who was saying that the futex spinlock patch shown here http://archives.postgresql.org/pgsql-performance/2004-10/msg00185.php had helped him materially with a context-swap-storm problem on 8.0.3. That didn't make a lot of sense to me, because we had determined that the futex patch didn't help much for this problem (which is one reason why it never got applied). So I started digging, and I've found a few interesting things that we still need to fix. First off, the RH customer had found that the futex patch improved his query performance even though the context swap rate actually went *up* materially. So I realized that we shouldn't simply be looking at the CS rate and saying ugh, we have to get that down. It's important to find out how much work is really getting done, too. Accordingly, I generated a modified version of the test case with a shorter query that is actually meant to complete in a reasonable amount of time, and I looked at how long it took to complete multiple instances of this query running in parallel. Ideally, if you have N cpus you should be able to run up to N copies of the query in the same amount of time it takes for one copy, and then, say, 2*N copies ought to take about twice as long as N copies. We do actually seem to scale pretty linearly beyond N ... but below that, not. The files for the updated test case are attached if anyone else wants to try it. They are: test_setup.sql Run this to create the test tables test_run.sqlA version of the test query that will run pretty nearly forever (until you kill it) test_run_small.sql A one-minute-or-so test query startn.sh A shell script to run N copies of test_run_small.sql in parallel. You need this one-time setup: createdb test psql -f test_setup.sql test To perform testing, do time ./startn.sh N for N=1,2,... up to however many CPUs you have, and maybe a run with more queries than that to make sure performance doesn't drop off above that. Note the total time to complete the script for each N. You can also watch the CPU utilization and context swap rates using vmstat 1 or equivalent, and/or strace some of the backends to see what they're doing. I did my testing on a 4-way Opteron box at Red Hat, using kernel 2.6.12-1.1398_FC4smp. 8.0.3 completely sucks in this test: N, runtime: 1 45s 2 119s 4 507s (didn't try 8) With 4 processes running, it shows about 40K context swaps/sec and barely 15% CPU utilization! strace shows plenty of semop()s as expected, but also a fair number of delaying select() calls from s_lock(). I saw about one select for every hundred semops, but that was still enough to have 30 or so selects per second (per process). Since each select sleeps 10msec, that's about a third of the potential CPU usage being discarded in spinlock delays. (Even more interesting, this test case behaves much worse on Opteron than on other SMP boxes. I had previously done similar tests on a 4-way Xeon with similar-vintage kernel, and saw context swap rates over 140K/sec with about 70% CPU load. I don't have test_run_small results from that session, but clearly the spinlock code is behaving worse on the Opterons.) CVS tip eliminates essentially all the semops, as we hoped, but scalability is still not great: N, runtime: 1 36s 2 61s 4 105s 8 198s At 4 processes, CPU use is only about 30%; the rate of select() delays has gone up to 50/sec or more, so even though we fixed the bufmgr itself, we are still losing a lot of potential CPU cycles to spinlock waits. (Note: I later realized that the CVS-tip code was accidentally built with -O1, not -O2 as the 8.0.3 code had been, so these two sets of numbers are not completely comparable: with -O2 the single-query time drops to 31s in CVS tip. But the main point is all the time that's disappearing into spinlock delays.) I have some proposals for fixing this stuff, which I'll put into a separate message since this one is pretty long already. What I wanted to accomplish here is to document the test case and the problem, and that's done. Anyone have SMP boxes of other types that they can try this on? regards, tom lane drop table test_data; create table test_data(f1 int); insert into test_data values (random() * 100); insert into test_data select random() * 100 from test_data; insert into
[HACKERS] Spinlocks, yet again: analysis and proposed patches
The test case I just posted shows that our spinlock code, which we had thought largely done, is once again becoming a performance bottleneck. It's time to resurrect some of the ideas we kicked around in early 2002, and didn't pursue because we decided spinlocks weren't our major performance problem. I started investigating this by trying to understand why the futex spinlock patch had helped the Red Hat customer I mentioned, when it hadn't done anything much in last year's experiments. It turns out that this customer was using an 8-way Opteron, and the futex patch helps a lot more on that architecture than others. Some random experimentation eventually turned up part of the reason: the x86 TAS assembly code we are using is pessimal for x86_64. Specifically, s_lock.h uses this for both architectures: /* Use a non-locking test before asserting the bus lock */ __asm__ __volatile__( cmpb$0,%1\n jne1f\n lock \n xchgb%0,%1 \n 1: \n and it turns out that deleting the cmpb and jne makes things go significantly faster on Opterons. So the non-locking test is not a win at all on x86_64. As best I can tell from testing, removing it wins because it causes the rate of spinlock delays to drop significantly. Why this should be is not completely obvious. I speculate that the Opterons are capable of pulling in a copy of the cache line that contains the spinlock and then running 100 iterations of the cmpb test without ever noticing that the processor holding the lock has now released it. Without the cmpb, they are forced by the locking xchgb test to actually look at the real state of the spinlock each time. I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: it can only be a win if you expect a lot of contention for the spin lock, but in percentage terms we still have a very low conflict rate, so in most executions of the TAS macro, the cmpb is just wasted cycles. Bottom line: we definitely don't want it for x86_64, and maybe not at all, but we need more research to decide the latter. The second reason that the futex patch is helping is that when a spinlock delay does occur, it allows the delaying process to be awoken almost immediately, rather than delaying 10 msec or more as the existing code does. However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. I looked into several other methods of doing the spinlock delay instead. I think all of these were suggested at one point or another in our earlier discussions of spinlocks: 1. Use sched_yield() if available: it does just what we want, ie, yield the processor without forcing any useless time delay before we can be rescheduled. This doesn't exist everywhere but it exists in recent Linuxen, so I tried it. It made for a beautiful improvement in my test case numbers: CPU utilization went to 100% and the context swap rate to almost nil. Unfortunately, I also saw fairly frequent stuck spinlock panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 1 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) 2. Reduce the length of the select() delays in s_lock. The current code delays in quanta of 10msec, but on recent Linuxen (and I think other platforms too) the scheduling quantum is 1msec, so we can get the processor back faster if we ask for a 1msec delay. I tried this and it is definitely a win on Linux; it doesn't seem to hurt anything on older systems either, they just round the delay up to 10msec like before. So I think we should do this, even though it's only a partial answer. 3. Modify the spin loop to do a little more work between TAS() tests. In the existing code there are only about half a dozen instructions total in the normal spin loop, most of them very fast, with the result that the spinning processor does its best to monopolize the system bus with locked probe instructions. This is obviously not good, as it'll interfere with the ability of the spinlock holder to complete its work and release the lock. (The bulk of the spinlock uses are for LWLocks, and with the current data structures the LWLock's own state is usually going to be in the same cache line as the spinlock proper, so that cache grabs for the spinlock hurt the LWLock state updating as well as the eventual spinlock release.) I modified the code to use an integer modulo operation as
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: it can only be a win if you expect a lot of contention for the spin lock, but in percentage terms we still have a very low conflict rate, so in most executions of the TAS macro, the cmpb is just wasted cycles. Bottom line: we definitely don't want it for x86_64, and maybe not at all, but we need more research to decide the latter. I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. One reason this might behave differently for opterons is that it's a cc-NUMA instead of the normal SMP. Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. Kurt ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: a new test case
On Sun, 11 Sep 2005, Tom Lane wrote: The files for the updated test case are attached if anyone else wants to try it. They are: test_setup.sql Run this to create the test tables test_run.sqlA version of the test query that will run pretty nearly forever (until you kill it) test_run_small.sql A one-minute-or-so test query startn.sh A shell script to run N copies of test_run_small.sql in parallel. You need this one-time setup: createdb test psql -f test_setup.sql test To perform testing, do time ./startn.sh N for N=1,2,... up to however With HEAD on a 4-way Xeon, I'm seeing: 1: 56 2: 120 4: 230 Removing the the non-locking test and branch, which you suggested down-thread: 1: 55: 2: 111: 4: 207 -- Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] statement logging / extended query protocol issues
Simon Riggs wrote: Are we sure there is just 3 cases? I haven't exhaustively checked, but I think those are the main cases. Even if case (3) is not that common, I still want to know it is occurring, to see what effect or overhead it has. I don't want it to be more verbose than the other cases when I set log_statement = all. We'll only see the output for case (3) when someone has programmed things that way by using setFetchSize. Can we put extra output in this case into log_statement = verbose only please? (1) jdbc parse jdbc bind jdbc execute LOG: statement: SELECT * from pg_proc Notice that the parse of the unnamed statement does *not* now generate a log record. What about the syntax error case? (2) jdbc parse S_1 LOG: statement: PREPARE S_1 AS SELECT * from pg_proc (perhaps this should be logged at BIND time, just like the optimization?) jdbc bind S_1 jdbc execute LOG: statement: EXECUTE unnamed [PREPARE: SELECT * from pg_proc] I do not like logging queries that the driver never sent (the driver sends neither PREPARE nor EXECUTE). I also don't see why it's useful to log the statement and portal names. Can we reword this to what I suggested previously? LOG: parse statement: SELECT * from pg_proc LOG: execute statement: SELECT * from pg_proc (3) jdbc prepare S_2 LOG: statement: PREPARE S_2 AS SELECT * from pg_proc jdbc bind S_2 to C_2 jdbc execute C_2 LOG: statement: EXECUTE C_2 ROWS 42 [PREPARE: SELECT * from pg_proc] jdbc next (after cache has run out on 42nd row) v3 protocol sends E for Execute, execution halts at 49 rows for this set of bind parameters LOG: statement: FETCH C_2 ROWS 7 Again I do not like logging synthetic queries that the driver never sent (PREPARE / EXECUTE / FETCH). BTW, if you do it this way, you could get the bizarre PREPARE S_2 AS PREPARE xyz AS SELECT .. result if the application used PREPARE itself. I think that logging the second and subsequent Executes is not normally useful and shouldn't happen when log_statement = all. In that case you don't need to log the portal name either. So for the normal case: LOG: parse statement: SELECT * from pg_proc LOG: execute statement: SELECT * from pg_proc and for the verbose case perhaps something like: LOG: parse statement: SELECT * from pg_proc LOG: execute statement (C_2, 42 rows): SELECT * from pg_proc LOG: fetch statement results (C_2, 7 rows) -O ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Kurt Roeckx [EMAIL PROTECTED] writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. Good point --- anyone have one to try? Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. It's hard to see how the OS could affect behavior at the level of processor cache operations --- unless they did something truly spectacularly stupid, like mark main memory non-cacheable. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Kurt Roeckx [EMAIL PROTECTED] writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. Good point --- anyone have one to try? I've got one I can test on. I need to upgrade the kernel and some other things on it though (it's running 2.6.8 atm, and an older Debian/unstable which I should probably bring up to current). I'll work on it starting now and post results once I get some. Stephen signature.asc Description: Digital signature
Re: [HACKERS] -fPIC
Peter Eisentraut [EMAIL PROTECTED] writes: The reason for -fpic vs -fPIC (on the machines where it makes any difference at all) is that the former is faster. I don't doubt that, but out of curiosity, considering that everyone else is using libtool, and libtool always uses -fPIC, what kind of impact does this *really* have? Incidentally, Debian uses -fPIC as a matter of policy everywhere too. I think the problem is that it's hard to know in advance whether -fpic is going to cause a problem and mixing -fpic and -fPIC libraries is a problem. So it's safer to just compile all the libraries with -fPIC. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] counting disk access from index seek operation -- how to?
I need a way to tell how many pages loaded from disk for a particular index seek operation. What I did is to set a global flag to true before calling the following statement (inside index_getnext() in /backend/access/indexam.c) found = DatumGetBool(FunctionCall2(scan-fn_getnext, PointerGetDatum(scan), Int32GetDatum(direction))); then for each access to disk, I increment a counter until the above call is finished and set back the global flag. The number of page IOs is not even matching from what I got from select * from pg_stat_all_indexes --- (is there anything I need to set in postgresql.conf?) also, the number of paged IOs for a given index seek is always less than the total page loads. Is it because of the statement (following the above in /backend/access/indexam.c)? if (heap_release_fetch(scan-heapRelation, scan-xs_snapshot, heapTuple, scan-xs_cbuf, true, scan-xs_pgstat_info)) If I am running in a single user mode, is there a way to avoid using extra page IO in the above statement? It seems to me the extra page IO is caused by comparing snapshots... thanks ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] -fPIC
Greg Stark [EMAIL PROTECTED] writes: ... mixing -fpic and -fPIC libraries is a problem. Is it? I would think having two options would be essentially unworkable if so. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: My proposal therefore is to do #2, #3, and #4, and to modify the TAS assembly code at least on x86_64. Together, these changes bring my test case on a 4-way Opteron from N, runtime: 1 36s 2 61s 4 105s 8 198s em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s to N, runtime: 1 35s 2 48s 4 58s 8 105s N, runtime: 1 32s 2 53s 4 90s 8 169s CPU utilization is definitely higher when running with the patch though. Hope this helps, happy to do additional testing if necessary. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s N, runtime: 1 32s 2 53s 4 90s 8 169s Er, which (or both) of the two patches did you apply here? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: a new test case
Tom Lane wrote: Anyone have SMP boxes of other types that they can try this on? For those of us running antiques: 2x PIII 1Ghz 2G Running on FreeBSD 6.0beta4 (non-debug kernel) 8.0.3: N runtime: 1 158s 2 271s 4 567s 8.1beta1 (2005-08-28): N runtime: 1 85s 2 139s 4 220s Wow - a huge difference between the versions, I thought the handbrake must be on for 8.0.3, but apparently not, all relevant parameters are the same... regards Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Stephen Frost [EMAIL PROTECTED] writes: em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s N, runtime: 1 32s 2 53s 4 90s 8 169s Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: * Tom Lane ([EMAIL PROTECTED]) wrote: Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks. If you've got the time, could you try the two patches separately and see what you get? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] wrote My proposal therefore is to do #2, #3, and #4, and to modify the TAS assembly code at least on x86_64. Together, these changes bring my test case on a 4-way Opteron from Some changes are based on tests and heuristics, so can we make them into the configure script so the choice could be made automatically? Regards, Qingqing ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Stephen Frost [EMAIL PROTECTED] writes: * Tom Lane ([EMAIL PROTECTED]) wrote: Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks. If you've got the time, could you try the two patches separately and see what you get? Sure. CVS Head: N, runtime: 1 31s 2 47s 4 86s 8 159s With just slock-no-cmpb.patch: N, runtime: 1 32s 2 39s 4 82s 8 167s With just spin-delay.patch N, runtime: 1 32s 2 52s 4 94s 8 164s With both: N, runtime: 1 32s 2 53s 4 90s 8 169s Hope that helps, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] writes: Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. It's hard to see how the OS could affect behavior at the level of processor cache operations --- unless they did something truly spectacularly stupid, like mark main memory non-cacheable. Well it could schedule processes on processors in ways that force less than optimal memory usage patterns. But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA machines what you've been told about Linux not being that good in NUMA. Really, these kind of cargo cult anecdotes are pretty pointless. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Stephen Frost ([EMAIL PROTECTED]) wrote: Thanks. If you've got the time, could you try the two patches separately and see what you get? Sure. [...] Just to be clear- this was from a completely default 'make install' using the Debian configure options from 8.0.3 (which aren't that particularly interesting really- nls, integer-datetimes, debug, disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld, enable-thread-safety). If it'd be useful for the test I can adjust whichever parameters are appropriate (work_mem, cache_size, etc). There's absolutely nothing else going on except for these test, a few shells, top, etc, on the box. Enjoy, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Qingqing Zhou [EMAIL PROTECTED] writes: Some changes are based on tests and heuristics, so can we make them into the configure script so the choice could be made automatically? It's a bit premature to propose that, when we don't yet know if the suggested changes are a win or loss under any conditions, let alone how to automatically tell the difference. Right now we are in the find-out-the-facts mode ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: I attach two proposed patches: the first removes the cmpb/jne from the x86 TAS assembly code, and the second one does the s_lock changes enumerated as points #2, #3, #4. The first one in particular needs more testing to see if it hurts performance on any non-Opteron x86 chips. (If so, we'd just make it conditional to x86_64.) 2x PIII 1G 2G Freebsd 6.0Beta4 8.1beta1 (2005-08-28): N runtime: 1 85s 2 139s 4 220s 8.1beta1 (2005-08-28) + patch 1 (s_lock.h only) N runtime: 1 89s 2 137s 4 229s 8.1beta1 (2005-08-28) + patch 2 N runtime: 1 84s 2 108s 4 214s Observe the interesting little speed improvement for patch 2 with 2 processes (seems to be repeatable). Let me know if you want to see vmstat output for any of these. regards Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org