Re: [SQL] hash: 64-bits and seeding

2019-03-07 Thread Huon.Wilson
://github.com/apache/spark/pull/24019. - Huon From: Reynold Xin Date: Thursday, 7 March 2019 at 6:33 pm To: "Wilson, Huon (Data61, Eveleigh ATP)" Cc: "dev@spark.apache.org" Subject: Re: [SQL] hash: 64-bits and seeding Rather than calling it hash64, it'd be better to j

Re: [SQL] hash: 64-bits and seeding

2019-03-06 Thread Reynold Xin
Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is. On Wed, Mar 06, 2019 at 7:59 PM, <

[SQL] hash: 64-bits and seeding

2019-03-06 Thread Huon.Wilson
Hi, I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations: - hash: 32-bit output (only 4 billion