Re: [PATCHES] Hash function for numeric (WIP)

2007-05-08 Thread Neil Conway
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote:
 It'd be a good idea if you repeat the previous number-of-collisions
 experiment on this code.

I repeated the same experiment, and got essentially the same number of
collisions (829 collisions on ~2 million randomly generated numerics,
with 273 duplicates). Since the modified hash still uses hash_any() and
really only differs when there are leading/trailing zeros, this is
consistent with what I'd expect.

Patch applied to HEAD.

-Neil



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Hash function for numeric (WIP)

2007-05-06 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote:
 Hm, but apply hash_any() to the remaining digits?  That might work, if
 you are careful about how you factor the weight into it (or just not try
 to use the weight in the hash).

 Attached is a patch that implements this idea. Since leading or trailing
 zeroes are far from the common case, I think we should still include the
 weight in the hash when possible: the patch does so when it doesn't find
 a leading zero in the Numeric.

You can do it always if you simply decrement the weight for each leading
zero removed.  (Of course, if you end up with no nonzero digits, you
need to return a fixed hashcode such as 0, regardless of weight.  The
patch as given is wrong since it makes the test for no-digits before
instead of after removing zeroes.)

It'd be a good idea if you repeat the previous number-of-collisions
experiment on this code.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Hash function for numeric (WIP)

2007-05-03 Thread Neil Conway
On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote:
 I'm still not very comfortable with that.  You're proposing to add a
 pretty obvious failure mechanism --- any numeric-returning function
 that failed to normalize its output would now create a subtle,
 hard-to-find bug.

What about teaching hash_numeric() to explicitly ignore leading and
trailing zero digits?

 Perhaps a suitable test would be to compare the number of
 hash collisions in a large set of randomly-chosen-but-distinct
 numeric values.

Okay, I did a little testing. I created a table with ~2 million
numerics, generated with equal probability by one of the two
expressions:

random()::numeric * ((random() * 100) ^ 20) * 100
random()::numeric * -100;

There were 251 duplicate inputs that I didn't bother eliminating; of
course, they should have effected both hash functions identically.
Results:

neilc=# create temp table r1 as select hash_numeric(a), count(*) from x
group by hash_numeric(a);
neilc=# create temp table r2 as select old_hash_numeric(a), count(*)
from x group by old_hash_numeric(a);

neilc=# select count(*) from r1 where count  1;
 count  

 123342
(1 row)

neilc=# select count(*) from r2 where count  1;
 count 
---
   756
(1 row)

old_hash_numeric() is the hash_any()-based hash, hash_numeric() is my
coding of your suggested hash function (see attached patch).

So it seems we need a better hash if we're not going to use hash_any().
The analysis required to write a robust hash function from scratch is
precisely why I'd prefer to use hash_any() if possible.

-Neil

Index: src/backend/utils/adt/numeric.c
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/backend/utils/adt/numeric.c,v
retrieving revision 1.101
diff -c -p -r1.101 numeric.c
*** src/backend/utils/adt/numeric.c	27 Feb 2007 23:48:08 -	1.101
--- src/backend/utils/adt/numeric.c	3 May 2007 20:59:55 -
***
*** 26,31 
--- 26,32 
  #include limits.h
  #include math.h
  
+ #include access/hash.h
  #include catalog/pg_type.h
  #include libpq/pqformat.h
  #include utils/array.h
*** cmp_numerics(Numeric num1, Numeric num2)
*** 1149,1154 
--- 1150,1213 
  	return result;
  }
  
+ Datum
+ old_hash_numeric(PG_FUNCTION_ARGS)
+ {
+ 	Numeric 	key = PG_GETARG_NUMERIC(0);
+ 	Datum 		digit_hash;
+ 	Datum 		result;
+ 
+ 	/* If it's NaN, don't try to hash the rest of the fields */
+ 	if (NUMERIC_IS_NAN(key))
+ 		PG_RETURN_UINT32(0);
+ 
+ 	/* If n_weight is zero, it is a zero, regardless of any other fields */
+ 	if (NUMERIC_NDIGITS(key) == 0)
+ 		PG_RETURN_UINT32(-1);
+ 
+ 	/*
+ 	 * XXX: we don't hash on the Numeric's scale, since two numerics
+ 	 * can compare equal but have different scales. We also don't hash
+ 	 * on the sign, although we easily could (since a sign difference
+ 	 * means inequality, it shouldn't affect correctness).
+ 	 */
+ 	digit_hash = hash_any((unsigned char *) NUMERIC_DIGITS(key),
+ 		  NUMERIC_NDIGITS(key) * sizeof(NumericDigit));
+ 
+ 	/* Mix in the weight, via XOR */
+ 	result = digit_hash ^ key-n_weight;
+ 	PG_RETURN_DATUM(result);
+ }
+ 
+ Datum
+ hash_numeric(PG_FUNCTION_ARGS)
+ {
+ 	Numeric			 key = PG_GETARG_NUMERIC(0);
+ 	NumericDigit	*digits;
+ 	uint32			 hash;
+ 	int32			 shift;
+ 	int i;
+ 
+ 	/* If it's NaN, don't try to hash the rest of the fields */
+ 	if (NUMERIC_IS_NAN(key))
+ 		PG_RETURN_UINT32(0);
+ 
+ 	hash   = 0;
+ 	shift  = 3 * key-n_weight;
+ 	digits = NUMERIC_DIGITS(key);
+ 
+ 	for (i = 0; i  NUMERIC_NDIGITS(key); i++)
+ 	{
+ 		int32 this_shift = (shift  31);
+ 		hash |= ((uint32) digits[i])  this_shift;
+ 		if (this_shift  0)
+ 			hash |= ((uint32) digits[i])  (32 - this_shift);
+ 		shift -= 3;
+ 	}
+ 
+ 	PG_RETURN_UINT32(hash);
+ }
+ 
  
  /* --
   *
Index: src/include/catalog/pg_amop.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amop.h,v
retrieving revision 1.80
diff -c -p -r1.80 pg_amop.h
*** src/include/catalog/pg_amop.h	2 Apr 2007 03:49:40 -	1.80
--- src/include/catalog/pg_amop.h	3 May 2007 19:02:44 -
*** DATA(insert (	2232   19 19 1 f 2334	405 
*** 568,573 
--- 568,575 
  DATA(insert (	2235   1033 1033 1 f  974	405 ));
  /* uuid_ops */ 
  DATA(insert (	2969   2950 2950 1 f 2972 405 ));
+ /* numeric_ops */
+ DATA(insert (	1998   1700 1700 1 f 1752 405 ));
  
  
  /*
Index: src/include/catalog/pg_amproc.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amproc.h,v
retrieving revision 1.64
diff -c -p -r1.64 pg_amproc.h
*** src/include/catalog/pg_amproc.h	2 Apr 2007 03:49:40 -	1.64
--- src/include/catalog/pg_amproc.h	3 May 2007 19:02:44 -
*** DATA(insert (	1990   26 26 1 453 ));
*** 148,153 
--- 148,154 
 

Re: [PATCHES] Hash function for numeric (WIP)

2007-05-03 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote:
 I'm still not very comfortable with that.  You're proposing to add a
 pretty obvious failure mechanism --- any numeric-returning function
 that failed to normalize its output would now create a subtle,
 hard-to-find bug.

 What about teaching hash_numeric() to explicitly ignore leading and
 trailing zero digits?

Hm, but apply hash_any() to the remaining digits?  That might work, if
you are careful about how you factor the weight into it (or just not try
to use the weight in the hash).

 Perhaps a suitable test would be to compare the number of
 hash collisions in a large set of randomly-chosen-but-distinct
 numeric values.

 Okay, I did a little testing.
 [ test that totally destroys my proposed hash function ]

OK, so *that* idea doesn't work.  How about yours above?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Hash function for numeric (WIP)

2007-04-29 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
 Perhaps a sufficiently robust way would be to form the hash as the
 XOR of each supplied digit, circular-shifted by say 3 times the
 digit's weight. 

 The only objection I have to this is that it means we need to have
 another hash function in the backend. The Jenkins hash we use in
 hash_any() has been studied and we can have at least some confidence in
 its collision-resistance, etc.

I'm still not very comfortable with that.  You're proposing to add a
pretty obvious failure mechanism --- any numeric-returning function
that failed to normalize its output would now create a subtle,
hard-to-find bug.  Even if you can promise that all the functions in
numeric.c get it right, what of user-written add-ons?  And the only
return for taking this risk is speculation that the performance of the
hash function might be better.

I think if you want to go this way, you should at least provide some
evidence that we get a hashing performance benefit in exchange for
adding a new restriction on numeric-value validity.  Perhaps a suitable
test would be to compare the number of hash collisions in a large set of
randomly-chosen-but-distinct numeric values.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Hash function for numeric (WIP)

2007-04-28 Thread Neil Conway
On Fri, 2007-04-27 at 04:09 -0400, Tom Lane wrote:
 I feel uncomfortable about this proposal because it will compute
 different hashes for values that differ only in having different
 numbers of trailing zeroes.  Now the numeric.c code is supposed to
 suppress extra trailing zeroes on output, but that's never been a
 correctness property ... are we willing to make it one?

I don't think that is such an onerous requirement: we could easily add
code to enforce this invariant (that might even be worth doing
regardless, to verify that the comments remain consistent with reality).

 There are various related cases involving unstripped leading zeroes.

numeric.h claims that leading zeros will also be stripped -- is that not
correct?

 Another point is that sign = NUMERIC_NAN makes it a NAN regardless
 of any other fields; ignoring the sign does not get the right result
 here.

I believe the patch was actually correct for NUMERIC_NAN (it already
special-cased it).

On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
 Something else I just remembered is that ndigits = 0 makes it a zero
 regardless of the weight.

Good point, fixed.


  Perhaps a sufficiently robust way would be to form the hash as the
  XOR of each supplied digit, circular-shifted by say 3 times the
  digit's weight.

-Neil



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Hash function for numeric (WIP)

2007-04-28 Thread Neil Conway
Sorry for fat-fingering the previous reply -- I wanted to add:

On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
 Perhaps a sufficiently robust way would be to form the hash as the
 XOR of each supplied digit, circular-shifted by say 3 times the
 digit's weight. 

The only objection I have to this is that it means we need to have
another hash function in the backend. The Jenkins hash we use in
hash_any() has been studied and we can have at least some confidence in
its collision-resistance, etc.

Anyway, attached is a revised version of the hash_any()-based patch.

-Neil

Index: src/backend/utils/adt/numeric.c
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/backend/utils/adt/numeric.c,v
retrieving revision 1.101
diff -c -p -r1.101 numeric.c
*** src/backend/utils/adt/numeric.c	27 Feb 2007 23:48:08 -	1.101
--- src/backend/utils/adt/numeric.c	28 Apr 2007 22:28:54 -
***
*** 26,31 
--- 26,32 
  #include limits.h
  #include math.h
  
+ #include access/hash.h
  #include catalog/pg_type.h
  #include libpq/pqformat.h
  #include utils/array.h
*** cmp_numerics(Numeric num1, Numeric num2)
*** 1149,1154 
--- 1150,1184 
  	return result;
  }
  
+ Datum
+ hash_numeric(PG_FUNCTION_ARGS)
+ {
+ 	Numeric 	key = PG_GETARG_NUMERIC(0);
+ 	Datum 		digit_hash;
+ 	Datum 		result;
+ 
+ 	/* If it's NaN, don't try to hash the rest of the fields */
+ 	if (NUMERIC_IS_NAN(key))
+ 		PG_RETURN_UINT32(0);
+ 
+ 	/* If n_weight is zero, it is a zero, regardless of any other fields */
+ 	if (NUMERIC_NDIGITS(0) == 0)
+ 		PG_RETURN_UINT32(-1);
+ 
+ 	/*
+ 	 * XXX: we don't hash on the Numeric's scale, since two numerics
+ 	 * can compare equal but have different scales. We also don't hash
+ 	 * on the sign, although we easily could (since a sign difference
+ 	 * means inequality, it shouldn't affect correctness).
+ 	 */
+ 	digit_hash = hash_any((unsigned char *) NUMERIC_DIGITS(key),
+ 		  NUMERIC_NDIGITS(key) * sizeof(NumericDigit));
+ 
+ 	/* Mix in the weight, via XOR */
+ 	result = digit_hash ^ key-n_weight;
+ 	PG_RETURN_DATUM(result);
+ }
+ 
  
  /* --
   *
Index: src/include/catalog/pg_amop.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amop.h,v
retrieving revision 1.80
diff -c -p -r1.80 pg_amop.h
*** src/include/catalog/pg_amop.h	2 Apr 2007 03:49:40 -	1.80
--- src/include/catalog/pg_amop.h	28 Apr 2007 22:11:42 -
*** DATA(insert (	2232   19 19 1 f 2334	405 
*** 568,573 
--- 568,575 
  DATA(insert (	2235   1033 1033 1 f  974	405 ));
  /* uuid_ops */ 
  DATA(insert (	2969   2950 2950 1 f 2972 405 ));
+ /* numeric_ops */
+ DATA(insert (	1998   1700 1700 1 f 1752 405 ));
  
  
  /*
Index: src/include/catalog/pg_amproc.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_amproc.h,v
retrieving revision 1.64
diff -c -p -r1.64 pg_amproc.h
*** src/include/catalog/pg_amproc.h	2 Apr 2007 03:49:40 -	1.64
--- src/include/catalog/pg_amproc.h	28 Apr 2007 22:11:42 -
*** DATA(insert (	1990   26 26 1 453 ));
*** 148,153 
--- 148,154 
  DATA(insert (	1992   30 30 1 457 ));
  DATA(insert (	1995   25 25 1 400 ));
  DATA(insert (	1997   1083 1083 1 452 ));
+ DATA(insert (	1998   1700 1700 1 432 ));
  DATA(insert (	1999   1184 1184 1 452 ));
  DATA(insert (	2001   1266 1266 1 1696 ));
  DATA(insert (	2040   1114 1114 1 452 ));
Index: src/include/catalog/pg_opclass.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_opclass.h,v
retrieving revision 1.75
diff -c -p -r1.75 pg_opclass.h
*** src/include/catalog/pg_opclass.h	2 Apr 2007 03:49:40 -	1.75
--- src/include/catalog/pg_opclass.h	28 Apr 2007 22:11:42 -
*** DATA(insert (	405		macaddr_ops			PGNSP P
*** 129,134 
--- 129,135 
  DATA(insert (	403		name_ops			PGNSP PGUID 1986   19 t 0 ));
  DATA(insert (	405		name_ops			PGNSP PGUID 1987   19 t 0 ));
  DATA(insert (	403		numeric_ops			PGNSP PGUID 1988 1700 t 0 ));
+ DATA(insert (	405		numeric_ops			PGNSP PGUID 1998 1700 t 0 ));
  DATA(insert OID = 1981 ( 403	oid_ops		PGNSP PGUID 1989   26 t 0 ));
  #define OID_BTREE_OPS_OID 1981
  DATA(insert (	405		oid_opsPGNSP PGUID 1990   26 t 0 ));
Index: src/include/catalog/pg_operator.h
===
RCS file: /home/neilc/postgres/cvs_root/pgsql/src/include/catalog/pg_operator.h,v
retrieving revision 1.151
diff -c -p -r1.151 pg_operator.h
*** src/include/catalog/pg_operator.h	2 Apr 2007 03:49:40 -	1.151
--- src/include/catalog/pg_operator.h	28 Apr 2007 22:11:42 -
*** DATA(insert OID = 1630 (  !~~*  PGNSP 
*** 675,681 
  

Re: [PATCHES] Hash function for numeric (WIP)

2007-04-27 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 For any two numerics that compare equal, we need to compute the same
 hash value for both datums, even if their bit patterns differ.

I feel uncomfortable about this proposal because it will compute
different hashes for values that differ only in having different
numbers of trailing zeroes.  Now the numeric.c code is supposed to
suppress extra trailing zeroes on output, but that's never been a
correctness property ... are we willing to make it one?

There are various related cases involving unstripped leading zeroes.

Another point is that sign = NUMERIC_NAN makes it a NAN regardless
of any other fields; ignoring the sign does not get the right result
here.

Lastly, calling hashint2 seems like overkill, you might as well
just XOR the weight into the digit_hash.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Hash function for numeric (WIP)

2007-04-27 Thread Tom Lane
I wrote:
 I feel uncomfortable about this proposal because it will compute
 different hashes for values that differ only in having different
 numbers of trailing zeroes.  Now the numeric.c code is supposed to
 suppress extra trailing zeroes on output, but that's never been a
 correctness property ... are we willing to make it one?

 There are various related cases involving unstripped leading zeroes.

 Another point is that sign = NUMERIC_NAN makes it a NAN regardless
 of any other fields; ignoring the sign does not get the right result
 here.

Something else I just remembered is that ndigits = 0 makes it a zero
regardless of the weight.

Perhaps a sufficiently robust way would be to form the hash as the
XOR of each supplied digit, circular-shifted by say 3 times the
digit's weight.  This is insensitive to leading/trailing zeroes:

if (is NAN)
return -1;  // or any other fixed value
hash = 0;
shift = 3 * weight;
for (i = 0; i  ndigits; i++)
{
thisshift = (shift  31);
hash |= ((uint32) digit[i])  thisshift;
if (thisshift  0)
hash |= ((uint32) digit[i])  (32 - thisshift);
shift -= 3;
}
return hash;

That might look pretty ugly, but then again hash_any isn't especially
cheap.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend