Re: [PATCHES] Hash function for numeric (WIP)
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote: It'd be a good idea if you repeat the previous number-of-collisions experiment on this code. I repeated the same experiment, and got essentially the same number of collisions (829 collisions on ~2 million randomly generated numerics, with 273 duplicates). Since the modified hash still uses hash_any() and really only differs when there are leading/trailing zeros, this is consistent with what I'd expect. Patch applied to HEAD. -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Hash function for numeric (WIP)
Neil Conway [EMAIL PROTECTED] writes: On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote: Hm, but apply hash_any() to the remaining digits? That might work, if you are careful about how you factor the weight into it (or just not try to use the weight in the hash). Attached is a patch that implements this idea. Since leading or trailing zeroes are far from the common case, I think we should still include the weight in the hash when possible: the patch does so when it doesn't find a leading zero in the Numeric. You can do it always if you simply decrement the weight for each leading zero removed. (Of course, if you end up with no nonzero digits, you need to return a fixed hashcode such as 0, regardless of weight. The patch as given is wrong since it makes the test for no-digits before instead of after removing zeroes.) It'd be a good idea if you repeat the previous number-of-collisions experiment on this code. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Hash function for numeric (WIP)
On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote: I'm still not very comfortable with that. You're proposing to add a pretty obvious failure mechanism --- any numeric-returning function that failed to normalize its output would now create a subtle, hard-to-find bug. What about teaching hash_numeric() to explicitly ignore leading and trailing zero digits? Perhaps a suitable test would be to compare the number of hash collisions in a large set of randomly-chosen-but-distinct numeric values. Okay, I did a little testing. I created a table with ~2 million numerics, generated with equal probability by one of the two expressions: random()::numeric * ((random() * 100) ^ 20) * 100 random()::numeric * -100; There were 251 duplicate inputs that I didn't bother eliminating; of course, they should have effected both hash functions identically. Results: neilc=# create temp table r1 as select hash_numeric(a), count(*) from x group by hash_numeric(a); neilc=# create temp table r2 as select old_hash_numeric(a), count(*) from x group by old_hash_numeric(a); neilc=# select count(*) from r1 where count 1; count 123342 (1 row) neilc=# select count(*) from r2 where count 1; count --- 756 (1 row) old_hash_numeric() is the hash_any()-based hash, hash_numeric() is my coding of your suggested hash function (see attached patch). So it seems we need a better hash if we're not going to use hash_any(). The analysis required to write a robust hash function from scratch is precisely why I'd prefer to use hash_any() if possible. -Neil Index: src/backend/utils/adt/numeric.c === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/backend/utils/adt/numeric.c,v retrieving revision 1.101 diff -c -p -r1.101 numeric.c *** src/backend/utils/adt/numeric.c 27 Feb 2007 23:48:08 - 1.101 --- src/backend/utils/adt/numeric.c 3 May 2007 20:59:55 - *** *** 26,31 --- 26,32 #include limits.h #include math.h + #include access/hash.h #include catalog/pg_type.h #include libpq/pqformat.h #include utils/array.h *** cmp_numerics(Numeric num1, Numeric num2) *** 1149,1154 --- 1150,1213 return result; } + Datum + old_hash_numeric(PG_FUNCTION_ARGS) + { + Numeric key = PG_GETARG_NUMERIC(0); + Datum digit_hash; + Datum result; + + /* If it's NaN, don't try to hash the rest of the fields */ + if (NUMERIC_IS_NAN(key)) + PG_RETURN_UINT32(0); + + /* If n_weight is zero, it is a zero, regardless of any other fields */ + if (NUMERIC_NDIGITS(key) == 0) + PG_RETURN_UINT32(-1); + + /* + * XXX: we don't hash on the Numeric's scale, since two numerics + * can compare equal but have different scales. We also don't hash + * on the sign, although we easily could (since a sign difference + * means inequality, it shouldn't affect correctness). + */ + digit_hash = hash_any((unsigned char *) NUMERIC_DIGITS(key), + NUMERIC_NDIGITS(key) * sizeof(NumericDigit)); + + /* Mix in the weight, via XOR */ + result = digit_hash ^ key-n_weight; + PG_RETURN_DATUM(result); + } + + Datum + hash_numeric(PG_FUNCTION_ARGS) + { + Numeric key = PG_GETARG_NUMERIC(0); + NumericDigit *digits; + uint32 hash; + int32 shift; + int i; + + /* If it's NaN, don't try to hash the rest of the fields */ + if (NUMERIC_IS_NAN(key)) + PG_RETURN_UINT32(0); + + hash = 0; + shift = 3 * key-n_weight; + digits = NUMERIC_DIGITS(key); + + for (i = 0; i NUMERIC_NDIGITS(key); i++) + { + int32 this_shift = (shift 31); + hash |= ((uint32) digits[i]) this_shift; + if (this_shift 0) + hash |= ((uint32) digits[i]) (32 - this_shift); + shift -= 3; + } + + PG_RETURN_UINT32(hash); + } + /* -- * Index: src/include/catalog/pg_amop.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amop.h,v retrieving revision 1.80 diff -c -p -r1.80 pg_amop.h *** src/include/catalog/pg_amop.h 2 Apr 2007 03:49:40 - 1.80 --- src/include/catalog/pg_amop.h 3 May 2007 19:02:44 - *** DATA(insert ( 2232 19 19 1 f 2334 405 *** 568,573 --- 568,575 DATA(insert ( 2235 1033 1033 1 f 974 405 )); /* uuid_ops */ DATA(insert ( 2969 2950 2950 1 f 2972 405 )); + /* numeric_ops */ + DATA(insert ( 1998 1700 1700 1 f 1752 405 )); /* Index: src/include/catalog/pg_amproc.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amproc.h,v retrieving revision 1.64 diff -c -p -r1.64 pg_amproc.h *** src/include/catalog/pg_amproc.h 2 Apr 2007 03:49:40 - 1.64 --- src/include/catalog/pg_amproc.h 3 May 2007 19:02:44 - *** DATA(insert ( 1990 26 26 1 453 )); *** 148,153 --- 148,154
Re: [PATCHES] Hash function for numeric (WIP)
Neil Conway [EMAIL PROTECTED] writes: On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote: I'm still not very comfortable with that. You're proposing to add a pretty obvious failure mechanism --- any numeric-returning function that failed to normalize its output would now create a subtle, hard-to-find bug. What about teaching hash_numeric() to explicitly ignore leading and trailing zero digits? Hm, but apply hash_any() to the remaining digits? That might work, if you are careful about how you factor the weight into it (or just not try to use the weight in the hash). Perhaps a suitable test would be to compare the number of hash collisions in a large set of randomly-chosen-but-distinct numeric values. Okay, I did a little testing. [ test that totally destroys my proposed hash function ] OK, so *that* idea doesn't work. How about yours above? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Hash function for numeric (WIP)
Neil Conway [EMAIL PROTECTED] writes: On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: Perhaps a sufficiently robust way would be to form the hash as the XOR of each supplied digit, circular-shifted by say 3 times the digit's weight. The only objection I have to this is that it means we need to have another hash function in the backend. The Jenkins hash we use in hash_any() has been studied and we can have at least some confidence in its collision-resistance, etc. I'm still not very comfortable with that. You're proposing to add a pretty obvious failure mechanism --- any numeric-returning function that failed to normalize its output would now create a subtle, hard-to-find bug. Even if you can promise that all the functions in numeric.c get it right, what of user-written add-ons? And the only return for taking this risk is speculation that the performance of the hash function might be better. I think if you want to go this way, you should at least provide some evidence that we get a hashing performance benefit in exchange for adding a new restriction on numeric-value validity. Perhaps a suitable test would be to compare the number of hash collisions in a large set of randomly-chosen-but-distinct numeric values. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Hash function for numeric (WIP)
On Fri, 2007-04-27 at 04:09 -0400, Tom Lane wrote: I feel uncomfortable about this proposal because it will compute different hashes for values that differ only in having different numbers of trailing zeroes. Now the numeric.c code is supposed to suppress extra trailing zeroes on output, but that's never been a correctness property ... are we willing to make it one? I don't think that is such an onerous requirement: we could easily add code to enforce this invariant (that might even be worth doing regardless, to verify that the comments remain consistent with reality). There are various related cases involving unstripped leading zeroes. numeric.h claims that leading zeros will also be stripped -- is that not correct? Another point is that sign = NUMERIC_NAN makes it a NAN regardless of any other fields; ignoring the sign does not get the right result here. I believe the patch was actually correct for NUMERIC_NAN (it already special-cased it). On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: Something else I just remembered is that ndigits = 0 makes it a zero regardless of the weight. Good point, fixed. Perhaps a sufficiently robust way would be to form the hash as the XOR of each supplied digit, circular-shifted by say 3 times the digit's weight. -Neil ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Hash function for numeric (WIP)
Sorry for fat-fingering the previous reply -- I wanted to add: On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: Perhaps a sufficiently robust way would be to form the hash as the XOR of each supplied digit, circular-shifted by say 3 times the digit's weight. The only objection I have to this is that it means we need to have another hash function in the backend. The Jenkins hash we use in hash_any() has been studied and we can have at least some confidence in its collision-resistance, etc. Anyway, attached is a revised version of the hash_any()-based patch. -Neil Index: src/backend/utils/adt/numeric.c === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/backend/utils/adt/numeric.c,v retrieving revision 1.101 diff -c -p -r1.101 numeric.c *** src/backend/utils/adt/numeric.c 27 Feb 2007 23:48:08 - 1.101 --- src/backend/utils/adt/numeric.c 28 Apr 2007 22:28:54 - *** *** 26,31 --- 26,32 #include limits.h #include math.h + #include access/hash.h #include catalog/pg_type.h #include libpq/pqformat.h #include utils/array.h *** cmp_numerics(Numeric num1, Numeric num2) *** 1149,1154 --- 1150,1184 return result; } + Datum + hash_numeric(PG_FUNCTION_ARGS) + { + Numeric key = PG_GETARG_NUMERIC(0); + Datum digit_hash; + Datum result; + + /* If it's NaN, don't try to hash the rest of the fields */ + if (NUMERIC_IS_NAN(key)) + PG_RETURN_UINT32(0); + + /* If n_weight is zero, it is a zero, regardless of any other fields */ + if (NUMERIC_NDIGITS(0) == 0) + PG_RETURN_UINT32(-1); + + /* + * XXX: we don't hash on the Numeric's scale, since two numerics + * can compare equal but have different scales. We also don't hash + * on the sign, although we easily could (since a sign difference + * means inequality, it shouldn't affect correctness). + */ + digit_hash = hash_any((unsigned char *) NUMERIC_DIGITS(key), + NUMERIC_NDIGITS(key) * sizeof(NumericDigit)); + + /* Mix in the weight, via XOR */ + result = digit_hash ^ key-n_weight; + PG_RETURN_DATUM(result); + } + /* -- * Index: src/include/catalog/pg_amop.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amop.h,v retrieving revision 1.80 diff -c -p -r1.80 pg_amop.h *** src/include/catalog/pg_amop.h 2 Apr 2007 03:49:40 - 1.80 --- src/include/catalog/pg_amop.h 28 Apr 2007 22:11:42 - *** DATA(insert ( 2232 19 19 1 f 2334 405 *** 568,573 --- 568,575 DATA(insert ( 2235 1033 1033 1 f 974 405 )); /* uuid_ops */ DATA(insert ( 2969 2950 2950 1 f 2972 405 )); + /* numeric_ops */ + DATA(insert ( 1998 1700 1700 1 f 1752 405 )); /* Index: src/include/catalog/pg_amproc.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amproc.h,v retrieving revision 1.64 diff -c -p -r1.64 pg_amproc.h *** src/include/catalog/pg_amproc.h 2 Apr 2007 03:49:40 - 1.64 --- src/include/catalog/pg_amproc.h 28 Apr 2007 22:11:42 - *** DATA(insert ( 1990 26 26 1 453 )); *** 148,153 --- 148,154 DATA(insert ( 1992 30 30 1 457 )); DATA(insert ( 1995 25 25 1 400 )); DATA(insert ( 1997 1083 1083 1 452 )); + DATA(insert ( 1998 1700 1700 1 432 )); DATA(insert ( 1999 1184 1184 1 452 )); DATA(insert ( 2001 1266 1266 1 1696 )); DATA(insert ( 2040 1114 1114 1 452 )); Index: src/include/catalog/pg_opclass.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_opclass.h,v retrieving revision 1.75 diff -c -p -r1.75 pg_opclass.h *** src/include/catalog/pg_opclass.h 2 Apr 2007 03:49:40 - 1.75 --- src/include/catalog/pg_opclass.h 28 Apr 2007 22:11:42 - *** DATA(insert ( 405 macaddr_ops PGNSP P *** 129,134 --- 129,135 DATA(insert ( 403 name_ops PGNSP PGUID 1986 19 t 0 )); DATA(insert ( 405 name_ops PGNSP PGUID 1987 19 t 0 )); DATA(insert ( 403 numeric_ops PGNSP PGUID 1988 1700 t 0 )); + DATA(insert ( 405 numeric_ops PGNSP PGUID 1998 1700 t 0 )); DATA(insert OID = 1981 ( 403 oid_ops PGNSP PGUID 1989 26 t 0 )); #define OID_BTREE_OPS_OID 1981 DATA(insert ( 405 oid_opsPGNSP PGUID 1990 26 t 0 )); Index: src/include/catalog/pg_operator.h === RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_operator.h,v retrieving revision 1.151 diff -c -p -r1.151 pg_operator.h *** src/include/catalog/pg_operator.h 2 Apr 2007 03:49:40 - 1.151 --- src/include/catalog/pg_operator.h 28 Apr 2007 22:11:42 - *** DATA(insert OID = 1630 ( !~~* PGNSP *** 675,681
Re: [PATCHES] Hash function for numeric (WIP)
Neil Conway [EMAIL PROTECTED] writes: For any two numerics that compare equal, we need to compute the same hash value for both datums, even if their bit patterns differ. I feel uncomfortable about this proposal because it will compute different hashes for values that differ only in having different numbers of trailing zeroes. Now the numeric.c code is supposed to suppress extra trailing zeroes on output, but that's never been a correctness property ... are we willing to make it one? There are various related cases involving unstripped leading zeroes. Another point is that sign = NUMERIC_NAN makes it a NAN regardless of any other fields; ignoring the sign does not get the right result here. Lastly, calling hashint2 seems like overkill, you might as well just XOR the weight into the digit_hash. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Hash function for numeric (WIP)
I wrote: I feel uncomfortable about this proposal because it will compute different hashes for values that differ only in having different numbers of trailing zeroes. Now the numeric.c code is supposed to suppress extra trailing zeroes on output, but that's never been a correctness property ... are we willing to make it one? There are various related cases involving unstripped leading zeroes. Another point is that sign = NUMERIC_NAN makes it a NAN regardless of any other fields; ignoring the sign does not get the right result here. Something else I just remembered is that ndigits = 0 makes it a zero regardless of the weight. Perhaps a sufficiently robust way would be to form the hash as the XOR of each supplied digit, circular-shifted by say 3 times the digit's weight. This is insensitive to leading/trailing zeroes: if (is NAN) return -1; // or any other fixed value hash = 0; shift = 3 * weight; for (i = 0; i ndigits; i++) { thisshift = (shift 31); hash |= ((uint32) digit[i]) thisshift; if (thisshift 0) hash |= ((uint32) digit[i]) (32 - thisshift); shift -= 3; } return hash; That might look pretty ugly, but then again hash_any isn't especially cheap. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend