Re: [PERFORM] Horribly slow hash join

2004-05-04 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well,
 and is likely faster than any multiple-instruction way to do the same.
 
 The quoted article seems to be by someone who has spent a lot of time
 counting assembly cycles and none at all reading the last thirty years
 worth of CS literature.  Knuth's treatment of hashing has some actual
 math to it... 

[incidentally, I just found that the quoted article was independently found by
Bruce Momjian who found it convincing enough to convert the hash_any table
over to it two years ago]

I just reviewed Knuth as well as C.L.R. and several papers from CACM and
SIGMOD.

It seems we have three options:

mod():

 Pro: empirical research shows it the best algorithm for avoiding collisions

 Con: requires the hash table be a prime size and far from a power of two.
 This is inconvenient to arrange for dynamic tables as used in postgres.

multiplication method (floor(tablesize * remainder(x * golden ratio)))

 Pro: works with any table size

 Con: I'm not clear if it can be done without floating point arithmetic. 
  It seems like it ought to be possible though.

Universal hashing:

 Pro: won't create gotcha situations where the distribution of data suddenly
  and unexpectedly creates unexpected performance problems. 

 Con: more complex. 

It would be trivial to switch the implementation from mod() to the
multiplicative method which is more suited to postgres's needs. However
universal hashing has some appeal. I hate the idea that a hash join could
perform perfectly well one day and suddenly become pessimal when new data is
loaded.

In a sense universal hashing is less predictable. For a DSS system that could
be bad. A query that runs fine every day might fail one day in an
unpredictable way even though the data is unchanged.

But in another sense it would be more predictable in that if you run the query
a thousand times the average performance would be very close to the expected
regardless of what the data is. Whereas more traditional algorithms have some
patterns of data that will consistently perform badly.

It's probably not worth it but postgres could maybe even be tricky and pretest
the parameters against the common values in the statistics table, generating
new ones if they fail to generate a nice distribution. That doesn't really
guarantee anything though, except that those common values would at least be
well distributed to start with.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Horribly slow hash join

2004-04-20 Thread Jim C. Nasby
Dammit, I somehow deleted a bunch of replies to this.

Did a TODO ever come out of this?
-- 
Jim C. Nasby, Database Consultant  [EMAIL PROTECTED]
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Horribly slow hash join

2004-04-19 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  If the hash tables were made a power of two then it would be possible to mix
  the bits of the 32 bit value and just mask off the unneeded bits. I've found
  one page via google that mentions mixing bits in a hash function, but I would
  look for a more serious treatment somewhere.


 Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well,
 and is likely faster than any multiple-instruction way to do the same.

Well a) any number that has any factors of two fails to mix in some bits.
That's a lot more common than non powers of two. b) The postgres code makes no
attempt to make the number of buckets a prime and c) Even if the number of
buckets were prime then it seems it would still be too easy to find real-world
data where all the data have that prime as a factor. As it is they only need
to have common factors to lose.

 The quoted article seems to be by someone who has spent a lot of time
 counting assembly cycles and none at all reading the last thirty years
 worth of CS literature.  

Yes, well I did note that.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Horribly slow hash join

2004-04-19 Thread Dave Cramer
Here's an interesting link that suggests that hyperthreading would be
much worse.

http://groups.google.com/groups?q=hyperthreading+dual+xeon+idlestart=10hl=enlr=ie=UTF-8c2coff=1selm=aukkonen-FE5275.21093624062003%40shawnews.gv.shawcable.netrnum=16

FWIW, I have anecdotal evidence that suggests that this is the case, on
of my clients was seeing very large context switches with HTT turned on,
and without it was much better.

Dave
On Mon, 2004-04-19 at 02:09, Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  If the hash tables were made a power of two then it would be possible to mix
  the bits of the 32 bit value and just mask off the unneeded bits. I've found
  one page via google that mentions mixing bits in a hash function, but I would
  look for a more serious treatment somewhere.
   http://burtleburtle.net/bob/hash/doobs.html
  Incidentally, this text claims mod is extremely slow compared to bit
  manipulations.
 
 Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well,
 and is likely faster than any multiple-instruction way to do the same.
 
 The quoted article seems to be by someone who has spent a lot of time
 counting assembly cycles and none at all reading the last thirty years
 worth of CS literature.  Knuth's treatment of hashing has some actual
 math to it... 
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly
 
 
 
 !DSPAM:40837183123741526418863!
 
 
-- 
Dave Cramer
519 939 0336
ICQ # 14675561


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Horribly slow hash join

2004-04-19 Thread Greg Stark

Dave Cramer [EMAIL PROTECTED] writes:

 Here's an interesting link that suggests that hyperthreading would be
 much worse.

Uh, this is the wrong thread.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sat, 17 Apr 2004, Tom Lane wrote:

 *some* set of inputs.  (Also, I have been harboring some notions of
 supporting cross-type hash joins for integer types, which will not work
 unless small int8 values hash the same as int4 etc.)

The simple solution would be to always extend integers to 64 bits (or
whatever the biggest integer is) before calculating the hash. It makes the
hash function a little slower for smaller types, but it's mostly an
operation in the cpu and no memory involved, so it's probably not
noticable.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  (Also, I have been harboring some notions of supporting cross-type hash
  joins for integer types, which will not work unless small int8 values hash
  the same as int4 etc.)
 
  The obvious way to modify the hash function is to xor the high 32 bits with
  the low 32 bits. That maintains the property you need
 
 No it doesn't ...

Eh? Oh, negative numbers? So low^high^sign.


I wonder if it makes sense to have check the hash distribution after
generating the table and if it's bad then throw it away and try again with a
different hash function. The different hash function would probably just be
a seed value changing. Probably way overkill though.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Eh? Oh, negative numbers? So low^high^sign.

[ thinks about it... ]  Yeah, that would work.  We can't backpatch it
without breaking existing hash indexes on int8, but it'd be reasonable
to change for 7.5 (since at the rate things are going, we won't have
pg_upgrade for 7.5 anyway...)

 I wonder if it makes sense to have check the hash distribution after
 generating the table and if it's bad then throw it away and try again with a
 different hash function. The different hash function would probably just be
 a seed value changing. Probably way overkill though.

Yeah, it'd be a pain trying to get all the type-specific hash functions
doing that.  I'm also unconvinced that a simple change of seed value
would necessarily make the distribution better.  In the worst case, if
the real problem is that all the input values are identical, you can
reseed all day long and it won't fix it.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 On Sat, 17 Apr 2004, Tom Lane wrote:
 *some* set of inputs.  (Also, I have been harboring some notions of
 supporting cross-type hash joins for integer types, which will not work
 unless small int8 values hash the same as int4 etc.)

 The simple solution would be to always extend integers to 64 bits (or
 whatever the biggest integer is) before calculating the hash.

That creates portability issues though.  We do not depend on there being
a 64-bit-int type for anything except int8 itself, and I don't want to
start doing so.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 What do you mean? int8 is supported on all platformas

No it isn't.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Tom Lane wrote:

 That creates portability issues though.  We do not depend on there being
 a 64-bit-int type for anything except int8 itself, and I don't want to
 start doing so.

What do you mean? int8 is supported on all platformas and if the 
hasfunction would convert all numbers to int8 before making the hash it 
would work.

I don't see any portability problems.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Tom Lane wrote:

  What do you mean? int8 is supported on all platformas
 
 No it isn't.

So on platforms where it isn't you would use int4 as the biggest int then. 
I don't really see that as a problem. As long as you calculate the hash on 
the biggest int on that platform it should work.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Bruno Wolff III
On Sun, Apr 18, 2004 at 18:27:09 +0200,
  Dennis Bjorklund [EMAIL PROTECTED] wrote:
 On Sun, 18 Apr 2004, Tom Lane wrote:
 
   What do you mean? int8 is supported on all platformas
  
  No it isn't.
 
 So on platforms where it isn't you would use int4 as the biggest int then. 
 I don't really see that as a problem. As long as you calculate the hash on 
 the biggest int on that platform it should work.

Another option would be to put the numbers into two int4s. For int4 or
smaller types one of these would be zero. int8s would be split between
the two. The hash function would then be defined on the two int4s.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Horribly slow hash join

2004-04-18 Thread Dennis Bjorklund
On Sun, 18 Apr 2004, Bruno Wolff III wrote:

 Another option would be to put the numbers into two int4s. For int4 or
 smaller types one of these would be zero. int8s would be split between
 the two. The hash function would then be defined on the two int4s.

Sure, this is an internal calculation in the hash function. The only 
important thing is that the number 7 (for example) gives the same hash 
value no matter if it is an int2 or an int8 and that the hash function 
works well also for int8 numbers (which is does not today).

At least that was the properties I understood that we wanted.

We got side tracked into talking about what datatype exists in all 
platforms, that's not an issue at all.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Horribly slow hash join

2004-04-17 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 We could change the hash function, perhaps, but then we'd just have
 different cases where there's a problem ... hashing will always fail on
 *some* set of inputs.

Sure, but completely ignoring part of the input seems like an unfortunate
choice of hash function.

 (Also, I have been harboring some notions of supporting cross-type hash
 joins for integer types, which will not work unless small int8 values hash
 the same as int4 etc.)

The obvious way to modify the hash function is to xor the high 32 bits with
the low 32 bits. That maintains the property you need and at least ensures
that all the bits are taken into account.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Horribly slow hash join

2004-04-17 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 (Also, I have been harboring some notions of supporting cross-type hash
 joins for integer types, which will not work unless small int8 values hash
 the same as int4 etc.)

 The obvious way to modify the hash function is to xor the high 32 bits with
 the low 32 bits. That maintains the property you need

No it doesn't ...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Horribly slow hash join

2004-04-16 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 Note the time for the hash join step:

Have you ANALYZEd these tables lately?

It looks to me like it's hashing on some column that has only a small
number of distinct values, so that the hash doesn't actually help to
avoid comparisons.  The planner should know better than to choose such
a plan, but if it's working with obsolete stats ...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match