Re: [PERFORM] Horribly slow hash join
On Sat, 17 Apr 2004, Tom Lane wrote: *some* set of inputs. (Also, I have been harboring some notions of supporting cross-type hash joins for integer types, which will not work unless small int8 values hash the same as int4 etc.) The simple solution would be to always extend integers to 64 bits (or whatever the biggest integer is) before calculating the hash. It makes the hash function a little slower for smaller types, but it's mostly an operation in the cpu and no memory involved, so it's probably not noticable. -- /Dennis Björklund ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Horribly slow hash join
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: (Also, I have been harboring some notions of supporting cross-type hash joins for integer types, which will not work unless small int8 values hash the same as int4 etc.) The obvious way to modify the hash function is to xor the high 32 bits with the low 32 bits. That maintains the property you need No it doesn't ... Eh? Oh, negative numbers? So low^high^sign. I wonder if it makes sense to have check the hash distribution after generating the table and if it's bad then throw it away and try again with a different hash function. The different hash function would probably just be a seed value changing. Probably way overkill though. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Horribly slow hash join
Greg Stark [EMAIL PROTECTED] writes: Eh? Oh, negative numbers? So low^high^sign. [ thinks about it... ] Yeah, that would work. We can't backpatch it without breaking existing hash indexes on int8, but it'd be reasonable to change for 7.5 (since at the rate things are going, we won't have pg_upgrade for 7.5 anyway...) I wonder if it makes sense to have check the hash distribution after generating the table and if it's bad then throw it away and try again with a different hash function. The different hash function would probably just be a seed value changing. Probably way overkill though. Yeah, it'd be a pain trying to get all the type-specific hash functions doing that. I'm also unconvinced that a simple change of seed value would necessarily make the distribution better. In the worst case, if the real problem is that all the input values are identical, you can reseed all day long and it won't fix it. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Horribly slow hash join
Dennis Bjorklund [EMAIL PROTECTED] writes: On Sat, 17 Apr 2004, Tom Lane wrote: *some* set of inputs. (Also, I have been harboring some notions of supporting cross-type hash joins for integer types, which will not work unless small int8 values hash the same as int4 etc.) The simple solution would be to always extend integers to 64 bits (or whatever the biggest integer is) before calculating the hash. That creates portability issues though. We do not depend on there being a 64-bit-int type for anything except int8 itself, and I don't want to start doing so. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Horribly slow hash join
Dennis Bjorklund [EMAIL PROTECTED] writes: What do you mean? int8 is supported on all platformas No it isn't. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Horribly slow hash join
On Sun, 18 Apr 2004, Tom Lane wrote: That creates portability issues though. We do not depend on there being a 64-bit-int type for anything except int8 itself, and I don't want to start doing so. What do you mean? int8 is supported on all platformas and if the hasfunction would convert all numbers to int8 before making the hash it would work. I don't see any portability problems. -- /Dennis Björklund ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Horribly slow hash join
On Sun, 18 Apr 2004, Tom Lane wrote: What do you mean? int8 is supported on all platformas No it isn't. So on platforms where it isn't you would use int4 as the biggest int then. I don't really see that as a problem. As long as you calculate the hash on the biggest int on that platform it should work. -- /Dennis Björklund ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Toooo many context switches (maybe SLES8?)
Isn't this a linux kernel issue ? My understanding is that the scheduler doesn't know that 2 of the CPU's are actually the same underlying hardware and sometimes two contexts end up fighting for the same underlying chip? --dc-- On Thu, 2004-04-15 at 16:37, Josh Berkus wrote: Folks, I am currently chasing what seems to be the same issue: massive context swapping on a dual Xeon system. I tried back-patching the above-mentioned patch ... it helps a little but by no means solves the problem ... BTW, I'm currently pursuing the possibility that this has something to do with the ServerWorks chipset on those motherboards. If anyone knows a high-end hardware+linux kernel geek I can corner, I'd appreciate it. Maybe I should contact OSDL ... -- Dave Cramer 519 939 0336 ICQ # 14675561 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] sunquery and estimated rows
, 17.04.2004, 01:45, Tom Lane : The planner sees that as where scope = some complicated expression and falls back to a default estimate. It won't simplify a sub-select to a constant. (Some people consider that a feature ;-).) Why? Thanks -- Markus Bertheau [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Wierd context-switching issue on Xeon
After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres. I did a bunch of oprofile measurements on a machine belonging to one of Josh's clients, using a test case that involved heavy concurrent access to a relatively small amount of data (little enough to fit into Postgres shared buffers, so that no I/O or kernel calls were really needed once the test got going). I found that by nearly any measure --- elapsed time, bus transactions, or machine-clear events --- the spinlock acquisitions associated with grabbing and releasing the BufMgrLock took an unreasonable fraction of the time. I saw about 15% of elapsed time, 40% of bus transactions, and nearly 100% of pipeline-clear cycles going into what is essentially two instructions out of the entire backend. (Pipeline clears occur when the cache coherency logic detects a memory write ordering problem.) I am not completely clear on why this machine-level bottleneck manifests as a lot of context swaps at the OS level. I think what is happening is that because SpinLockAcquire is so slow, a process is much more likely than you'd normally expect to arrive at SpinLockAcquire while another process is also acquiring the spinlock. This puts the two processes into a lockstep condition where the second process is nearly certain to observe the BufMgrLock as locked, and be forced to suspend itself, even though the time the first process holds the BufMgrLock is not really very long at all. If you google for Xeon and cache coherency you'll find quite a bit of suggestive information about why this might be more true on the Xeon setup than others. A couple of interesting hits: http://www.theinquirer.net/?article=10797 says that Xeon MP uses a *slower* FSB than Xeon DP. This would translate directly to more time needed to transfer a dirty cache line from one processor to the other, which is the basic operation that we're talking about here. http://www.aceshardware.com/Spades/read.php?article_id=3187 says that Opterons use a different cache coherency protocol that is fundamentally superior to the Xeon's, because dirty cache data can be transferred directly between two processor caches without waiting for main memory. So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any evidence for or against that idea, but it's certainly possible.) In the long run, however, CPUs continue to get faster than main memory and the price of cache contention will continue to rise. So it seems that we need to give up the assumption that SpinLockAcquire is a cheap operation. In the presence of heavy contention it won't be. One thing we probably have got to do soon is break up the BufMgrLock into multiple finer-grain locks so that there will be less contention. However I am wary of doing this incautiously, because if we do it in a way that makes for a significant rise in the number of locks that have to be acquired to access a buffer, we might end up with a net loss. I think Neil Conway was looking into how the bufmgr might be restructured to reduce lock contention, but if he had come up with anything he didn't mention exactly what. Neil? regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] sunquery and estimated rows
Markus Bertheau [EMAIL PROTECTED] writes: В Сбт, 17.04.2004, в 01:45, Tom Lane пишет: The planner sees that as where scope = some complicated expression and falls back to a default estimate. It won't simplify a sub-select to a constant. (Some people consider that a feature ;-).) Why? It's the only way to prevent it from simplifying when you don't want it to. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Wierd context-switching issue on Xeon
So the the kernel/OS is irrelevant here ? this happens on any dual xeon? What about hypterthreading does it still happen if HTT is turned off ? Dave On Sun, 2004-04-18 at 17:47, Tom Lane wrote: After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres. I did a bunch of oprofile measurements on a machine belonging to one of Josh's clients, using a test case that involved heavy concurrent access to a relatively small amount of data (little enough to fit into Postgres shared buffers, so that no I/O or kernel calls were really needed once the test got going). I found that by nearly any measure --- elapsed time, bus transactions, or machine-clear events --- the spinlock acquisitions associated with grabbing and releasing the BufMgrLock took an unreasonable fraction of the time. I saw about 15% of elapsed time, 40% of bus transactions, and nearly 100% of pipeline-clear cycles going into what is essentially two instructions out of the entire backend. (Pipeline clears occur when the cache coherency logic detects a memory write ordering problem.) I am not completely clear on why this machine-level bottleneck manifests as a lot of context swaps at the OS level. I think what is happening is that because SpinLockAcquire is so slow, a process is much more likely than you'd normally expect to arrive at SpinLockAcquire while another process is also acquiring the spinlock. This puts the two processes into a lockstep condition where the second process is nearly certain to observe the BufMgrLock as locked, and be forced to suspend itself, even though the time the first process holds the BufMgrLock is not really very long at all. If you google for Xeon and cache coherency you'll find quite a bit of suggestive information about why this might be more true on the Xeon setup than others. A couple of interesting hits: http://www.theinquirer.net/?article=10797 says that Xeon MP uses a *slower* FSB than Xeon DP. This would translate directly to more time needed to transfer a dirty cache line from one processor to the other, which is the basic operation that we're talking about here. http://www.aceshardware.com/Spades/read.php?article_id=3187 says that Opterons use a different cache coherency protocol that is fundamentally superior to the Xeon's, because dirty cache data can be transferred directly between two processor caches without waiting for main memory. So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any evidence for or against that idea, but it's certainly possible.) In the long run, however, CPUs continue to get faster than main memory and the price of cache contention will continue to rise. So it seems that we need to give up the assumption that SpinLockAcquire is a cheap operation. In the presence of heavy contention it won't be. One thing we probably have got to do soon is break up the BufMgrLock into multiple finer-grain locks so that there will be less contention. However I am wary of doing this incautiously, because if we do it in a way that makes for a significant rise in the number of locks that have to be acquired to access a buffer, we might end up with a net loss. I think Neil Conway was looking into how the bufmgr might be restructured to reduce lock contention, but if he had come up with anything he didn't mention exactly what. Neil? regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) !DSPAM:4082feb7326901956819835! -- Dave Cramer 519 939 0336 ICQ # 14675561 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Wierd context-switching issue on Xeon
Tom Lane [EMAIL PROTECTED] writes: So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any evidence for or against that idea, but it's certainly possible.) In the long run, however, CPUs continue to get faster than main memory and the price of cache contention will continue to rise. So it seems that we need to give up the assumption that SpinLockAcquire is a cheap operation. In the presence of heavy contention it won't be. There's nothing about the way Postgres spinlocks are coded that affects this? Is it something the kernel could help with? I've been wondering whether there's any benefits postgres is missing out on by using its own hand-rolled locking instead of using the pthreads infrastructure that the kernel is often involved in. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] sunquery and estimated rows
Rod Taylor [EMAIL PROTECTED] writes: It's the only way to prevent it from simplifying when you don't want it to. I'm having a difficult time coming up with a circumstance where that is beneficial except when stats are out of whack. Try trawling the archives --- I recall several cases in which people were using sub-selects for this purpose. In any case, I don't see the value of having the planner check to see if a sub-select is just a trivial arithmetic expression. The cases where people write that and expect it to be simplified are so few and far between that I can't believe it'd be a good use of planner cycles. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Wierd context-switching issue on Xeon
Dave Cramer [EMAIL PROTECTED] writes: So the the kernel/OS is irrelevant here ? this happens on any dual xeon? I believe so. The context-switch behavior might possibly be a little more pleasant on other kernels, but the underlying spinlock problem is not dependent on the kernel. What about hypterthreading does it still happen if HTT is turned off ? The problem comes from keeping the caches synchronized between multiple physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a hyperthreaded processor still only has one cache. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Horribly slow hash join
On Sun, Apr 18, 2004 at 18:27:09 +0200, Dennis Bjorklund [EMAIL PROTECTED] wrote: On Sun, 18 Apr 2004, Tom Lane wrote: What do you mean? int8 is supported on all platformas No it isn't. So on platforms where it isn't you would use int4 as the biggest int then. I don't really see that as a problem. As long as you calculate the hash on the biggest int on that platform it should work. Another option would be to put the numbers into two int4s. For int4 or smaller types one of these would be zero. int8s would be split between the two. The hash function would then be defined on the two int4s. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Wierd context-switching issue on Xeon
Greg Stark [EMAIL PROTECTED] writes: There's nothing about the way Postgres spinlocks are coded that affects this? No. AFAICS our spinlock sequences are pretty much equivalent to the way the Linux kernel codes its spinlocks, so there's no deep dark knowledge to be mined there. We could possibly use some more-efficient blocking mechanism than semop() once we've decided we have to block (it's a shame Linux still doesn't have cross-process POSIX semaphores). But the striking thing I learned from looking at the oprofile results is that most of the inefficiency comes at the very first TAS() operation, before we've even spun let alone decided we have to block. The s_lock() subroutine does not account for more than a few percent of the runtime in these tests, compared to 15% at the inline TAS() operations in LWLockAcquire and LWLockRelease. I interpret this to mean that once it's acquired ownership of the cache line, a Xeon can get through the spinning loop in s_lock() mighty quickly. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Wierd context-switching issue on Xeon
What about hypterthreading does it still happen if HTT is turned off ? The problem comes from keeping the caches synchronized between multiple physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a hyperthreaded processor still only has one cache. Also, I forgot to say that the numbers I'm quoting *are* with HTT off. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Horribly slow hash join
On Sun, 18 Apr 2004, Bruno Wolff III wrote: Another option would be to put the numbers into two int4s. For int4 or smaller types one of these would be zero. int8s would be split between the two. The hash function would then be defined on the two int4s. Sure, this is an internal calculation in the hash function. The only important thing is that the number 7 (for example) gives the same hash value no matter if it is an int2 or an int8 and that the hash function works well also for int8 numbers (which is does not today). At least that was the properties I understood that we wanted. We got side tracked into talking about what datatype exists in all platforms, that's not an issue at all. -- /Dennis Björklund ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match