Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sat, 17 Apr 2004, Tom Lane wrote:

 *some* set of inputs.  (Also, I have been harboring some notions of
 supporting cross-type hash joins for integer types, which will not work
 unless small int8 values hash the same as int4 etc.)

The simple solution would be to always extend integers to 64 bits (or
whatever the biggest integer is) before calculating the hash. It makes the
hash function a little slower for smaller types, but it's mostly an
operation in the cpu and no memory involved, so it's probably not
noticable.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  (Also, I have been harboring some notions of supporting cross-type hash
  joins for integer types, which will not work unless small int8 values hash
  the same as int4 etc.)
 
  The obvious way to modify the hash function is to xor the high 32 bits with
  the low 32 bits. That maintains the property you need
 
 No it doesn't ...

Eh? Oh, negative numbers? So low^high^sign.


I wonder if it makes sense to have check the hash distribution after
generating the table and if it's bad then throw it away and try again with a
different hash function. The different hash function would probably just be
a seed value changing. Probably way overkill though.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Eh? Oh, negative numbers? So low^high^sign.

[ thinks about it... ]  Yeah, that would work.  We can't backpatch it
without breaking existing hash indexes on int8, but it'd be reasonable
to change for 7.5 (since at the rate things are going, we won't have
pg_upgrade for 7.5 anyway...)

 I wonder if it makes sense to have check the hash distribution after
 generating the table and if it's bad then throw it away and try again with a
 different hash function. The different hash function would probably just be
 a seed value changing. Probably way overkill though.

Yeah, it'd be a pain trying to get all the type-specific hash functions
doing that.  I'm also unconvinced that a simple change of seed value
would necessarily make the distribution better.  In the worst case, if
the real problem is that all the input values are identical, you can
reseed all day long and it won't fix it.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 On Sat, 17 Apr 2004, Tom Lane wrote:
 *some* set of inputs.  (Also, I have been harboring some notions of
 supporting cross-type hash joins for integer types, which will not work
 unless small int8 values hash the same as int4 etc.)

 The simple solution would be to always extend integers to 64 bits (or
 whatever the biggest integer is) before calculating the hash.

That creates portability issues though.  We do not depend on there being
a 64-bit-int type for anything except int8 itself, and I don't want to
start doing so.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 What do you mean? int8 is supported on all platformas

No it isn't.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Tom Lane wrote:

 That creates portability issues though.  We do not depend on there being
 a 64-bit-int type for anything except int8 itself, and I don't want to
 start doing so.

What do you mean? int8 is supported on all platformas and if the 
hasfunction would convert all numbers to int8 before making the hash it 
would work.

I don't see any portability problems.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Tom Lane wrote:

  What do you mean? int8 is supported on all platformas
 
 No it isn't.

So on platforms where it isn't you would use int4 as the biggest int then. 
I don't really see that as a problem. As long as you calculate the hash on 
the biggest int on that platform it should work.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Toooo many context switches (maybe SLES8?)

2004-04-18 Thread Dave Cramer
Isn't this a linux kernel issue ?

My understanding is that the scheduler doesn't know that 2 of the CPU's
are actually the same underlying hardware and sometimes two contexts end
up fighting for the same underlying chip?

--dc--

On Thu, 2004-04-15 at 16:37, Josh Berkus wrote:
 Folks,
 
  I am currently chasing what seems to be the same issue: massive context
  swapping on a dual Xeon system.  I tried back-patching the above-mentioned
  patch ... it helps a little but by no means solves the problem ...
 
 BTW, I'm currently pursuing the possibility that this has something to do with 
 the ServerWorks chipset on those motherboards.   If anyone knows a high-end 
 hardware+linux kernel geek I can corner, I'd appreciate it.
 
 Maybe I should contact OSDL ...
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] sunquery and estimated rows

2004-04-18 Thread Markus Bertheau
 , 17.04.2004,  01:45, Tom Lane :

 The planner sees that as where scope = some complicated expression
 and falls back to a default estimate.  It won't simplify a sub-select
 to a constant.  (Some people consider that a feature ;-).)

Why?

Thanks

-- 
Markus Bertheau [EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
After some further digging I think I'm starting to understand what's up
here, and the really fundamental answer is that a multi-CPU Xeon MP box
sucks for running Postgres.

I did a bunch of oprofile measurements on a machine belonging to one of
Josh's clients, using a test case that involved heavy concurrent access
to a relatively small amount of data (little enough to fit into Postgres
shared buffers, so that no I/O or kernel calls were really needed once
the test got going).  I found that by nearly any measure --- elapsed
time, bus transactions, or machine-clear events --- the spinlock
acquisitions associated with grabbing and releasing the BufMgrLock took
an unreasonable fraction of the time.  I saw about 15% of elapsed time,
40% of bus transactions, and nearly 100% of pipeline-clear cycles going
into what is essentially two instructions out of the entire backend.
(Pipeline clears occur when the cache coherency logic detects a memory
write ordering problem.)

I am not completely clear on why this machine-level bottleneck manifests
as a lot of context swaps at the OS level.  I think what is happening is
that because SpinLockAcquire is so slow, a process is much more likely
than you'd normally expect to arrive at SpinLockAcquire while another
process is also acquiring the spinlock.  This puts the two processes
into a lockstep condition where the second process is nearly certain
to observe the BufMgrLock as locked, and be forced to suspend itself,
even though the time the first process holds the BufMgrLock is not
really very long at all.

If you google for Xeon and cache coherency you'll find quite a bit of
suggestive information about why this might be more true on the Xeon
setup than others.  A couple of interesting hits:

http://www.theinquirer.net/?article=10797
says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
translate directly to more time needed to transfer a dirty cache line
from one processor to the other, which is the basic operation that we're
talking about here.

http://www.aceshardware.com/Spades/read.php?article_id=3187
says that Opterons use a different cache coherency protocol that is
fundamentally superior to the Xeon's, because dirty cache data can be
transferred directly between two processor caches without waiting for
main memory.

So in the short term I think we have to tell people that Xeon MP is not
the most desirable SMP platform to run Postgres on.  (Josh thinks that
the specific motherboard chipset being used in these machines might
share some of the blame too.  I don't have any evidence for or against
that idea, but it's certainly possible.)

In the long run, however, CPUs continue to get faster than main memory
and the price of cache contention will continue to rise.  So it seems
that we need to give up the assumption that SpinLockAcquire is a cheap
operation.  In the presence of heavy contention it won't be.

One thing we probably have got to do soon is break up the BufMgrLock
into multiple finer-grain locks so that there will be less contention.
However I am wary of doing this incautiously, because if we do it in a
way that makes for a significant rise in the number of locks that have
to be acquired to access a buffer, we might end up with a net loss.

I think Neil Conway was looking into how the bufmgr might be
restructured to reduce lock contention, but if he had come up with
anything he didn't mention exactly what.  Neil?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] sunquery and estimated rows

2004-04-18 Thread Tom Lane
Markus Bertheau [EMAIL PROTECTED] writes:
 В Сбт, 17.04.2004, в 01:45, Tom Lane пишет:
 The planner sees that as where scope = some complicated expression
 and falls back to a default estimate.  It won't simplify a sub-select
 to a constant.  (Some people consider that a feature ;-).)

 Why?

It's the only way to prevent it from simplifying when you don't want it
to.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Dave Cramer
So the the kernel/OS is irrelevant here ? this happens on any dual xeon?

What about hypterthreading does it still happen if HTT is turned off ?

Dave
On Sun, 2004-04-18 at 17:47, Tom Lane wrote:
 After some further digging I think I'm starting to understand what's up
 here, and the really fundamental answer is that a multi-CPU Xeon MP box
 sucks for running Postgres.
 
 I did a bunch of oprofile measurements on a machine belonging to one of
 Josh's clients, using a test case that involved heavy concurrent access
 to a relatively small amount of data (little enough to fit into Postgres
 shared buffers, so that no I/O or kernel calls were really needed once
 the test got going).  I found that by nearly any measure --- elapsed
 time, bus transactions, or machine-clear events --- the spinlock
 acquisitions associated with grabbing and releasing the BufMgrLock took
 an unreasonable fraction of the time.  I saw about 15% of elapsed time,
 40% of bus transactions, and nearly 100% of pipeline-clear cycles going
 into what is essentially two instructions out of the entire backend.
 (Pipeline clears occur when the cache coherency logic detects a memory
 write ordering problem.)
 
 I am not completely clear on why this machine-level bottleneck manifests
 as a lot of context swaps at the OS level.  I think what is happening is
 that because SpinLockAcquire is so slow, a process is much more likely
 than you'd normally expect to arrive at SpinLockAcquire while another
 process is also acquiring the spinlock.  This puts the two processes
 into a lockstep condition where the second process is nearly certain
 to observe the BufMgrLock as locked, and be forced to suspend itself,
 even though the time the first process holds the BufMgrLock is not
 really very long at all.
 
 If you google for Xeon and cache coherency you'll find quite a bit of
 suggestive information about why this might be more true on the Xeon
 setup than others.  A couple of interesting hits:
 
 http://www.theinquirer.net/?article=10797
 says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
 translate directly to more time needed to transfer a dirty cache line
 from one processor to the other, which is the basic operation that we're
 talking about here.
 
 http://www.aceshardware.com/Spades/read.php?article_id=3187
 says that Opterons use a different cache coherency protocol that is
 fundamentally superior to the Xeon's, because dirty cache data can be
 transferred directly between two processor caches without waiting for
 main memory.
 
 So in the short term I think we have to tell people that Xeon MP is not
 the most desirable SMP platform to run Postgres on.  (Josh thinks that
 the specific motherboard chipset being used in these machines might
 share some of the blame too.  I don't have any evidence for or against
 that idea, but it's certainly possible.)
 
 In the long run, however, CPUs continue to get faster than main memory
 and the price of cache contention will continue to rise.  So it seems
 that we need to give up the assumption that SpinLockAcquire is a cheap
 operation.  In the presence of heavy contention it won't be.
 
 One thing we probably have got to do soon is break up the BufMgrLock
 into multiple finer-grain locks so that there will be less contention.
 However I am wary of doing this incautiously, because if we do it in a
 way that makes for a significant rise in the number of locks that have
 to be acquired to access a buffer, we might end up with a net loss.
 
 I think Neil Conway was looking into how the bufmgr might be
 restructured to reduce lock contention, but if he had come up with
 anything he didn't mention exactly what.  Neil?
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 
 
 
 !DSPAM:4082feb7326901956819835!
 
 
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 So in the short term I think we have to tell people that Xeon MP is not
 the most desirable SMP platform to run Postgres on.  (Josh thinks that
 the specific motherboard chipset being used in these machines might
 share some of the blame too.  I don't have any evidence for or against
 that idea, but it's certainly possible.)
 
 In the long run, however, CPUs continue to get faster than main memory
 and the price of cache contention will continue to rise.  So it seems
 that we need to give up the assumption that SpinLockAcquire is a cheap
 operation.  In the presence of heavy contention it won't be.

There's nothing about the way Postgres spinlocks are coded that affects this?

Is it something the kernel could help with? I've been wondering whether
there's any benefits postgres is missing out on by using its own hand-rolled
locking instead of using the pthreads infrastructure that the kernel is often
involved in.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] sunquery and estimated rows

2004-04-18 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes:
 It's the only way to prevent it from simplifying when you don't want it
 to.

 I'm having a difficult time coming up with a circumstance where that is
 beneficial except when stats are out of whack.

Try trawling the archives --- I recall several cases in which people
were using sub-selects for this purpose.

In any case, I don't see the value of having the planner check to see if
a sub-select is just a trivial arithmetic expression.  The cases where
people write that and expect it to be simplified are so few and far
between that I can't believe it'd be a good use of planner cycles.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
Dave Cramer [EMAIL PROTECTED] writes:
 So the the kernel/OS is irrelevant here ? this happens on any dual xeon?

I believe so.  The context-switch behavior might possibly be a little
more pleasant on other kernels, but the underlying spinlock problem is
not dependent on the kernel.

 What about hypterthreading does it still happen if HTT is turned off ?

The problem comes from keeping the caches synchronized between multiple
physical CPUs.  AFAICS enabling HTT wouldn't make it worse, because a
hyperthreaded processor still only has one cache.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Bruno Wolff III
On Sun, Apr 18, 2004 at 18:27:09 +0200,
  Dennis Bjorklund [EMAIL PROTECTED] wrote:
 On Sun, 18 Apr 2004, Tom Lane wrote:
 
   What do you mean? int8 is supported on all platformas
  
  No it isn't.
 
 So on platforms where it isn't you would use int4 as the biggest int then. 
 I don't really see that as a problem. As long as you calculate the hash on 
 the biggest int on that platform it should work.

Another option would be to put the numbers into two int4s. For int4 or
smaller types one of these would be zero. int8s would be split between
the two. The hash function would then be defined on the two int4s.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 There's nothing about the way Postgres spinlocks are coded that affects this?

No.  AFAICS our spinlock sequences are pretty much equivalent to the way
the Linux kernel codes its spinlocks, so there's no deep dark knowledge
to be mined there.

We could possibly use some more-efficient blocking mechanism than semop()
once we've decided we have to block (it's a shame Linux still doesn't
have cross-process POSIX semaphores).  But the striking thing I learned
from looking at the oprofile results is that most of the inefficiency
comes at the very first TAS() operation, before we've even spun let
alone decided we have to block.  The s_lock() subroutine does not
account for more than a few percent of the runtime in these tests,
compared to 15% at the inline TAS() operations in LWLockAcquire and
LWLockRelease.  I interpret this to mean that once it's acquired
ownership of the cache line, a Xeon can get through the spinning
loop in s_lock() mighty quickly.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Wierd context-switching issue on Xeon

2004-04-18 Thread Tom Lane
 What about hypterthreading does it still happen if HTT is turned off ?

 The problem comes from keeping the caches synchronized between multiple
 physical CPUs.  AFAICS enabling HTT wouldn't make it worse, because a
 hyperthreaded processor still only has one cache.

Also, I forgot to say that the numbers I'm quoting *are* with HTT off.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Bruno Wolff III wrote:

 Another option would be to put the numbers into two int4s. For int4 or
 smaller types one of these would be zero. int8s would be split between
 the two. The hash function would then be defined on the two int4s.

Sure, this is an internal calculation in the hash function. The only 
important thing is that the number 7 (for example) gives the same hash 
value no matter if it is an int2 or an int8 and that the hash function 
works well also for int8 numbers (which is does not today).

At least that was the properties I understood that we wanted.

We got side tracked into talking about what datatype exists in all 
platforms, that's not an issue at all.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match