Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Huon.Wilson
Hi Nicolas,

On 6/3/19, 7:48 am, "Nicolas Paris"  wrote:

Hi Huon

Good catch. A 64 bit hash is definitely a useful function.

> the birthday paradox implies  >50% chance of at least one for tables 
larger than 77000 rows

Do you know how many rows to have 50% chances for a 64 bit hash ?

5 billion: it's roughly equal to the square root of the total number of 
possible hash values. You can see detailed table at 
https://en.wikipedia.org/wiki/Birthday_problem#Probability_table .

Note, for my application a few collisions is fine. There's a few ways of trying 
to quantify this, and one is the maximum number of items that all hash to a 
single particular hash value: if one has 4 billion rows with 32-bit hash, the 
size of this largest set is likely to be 14 (and, there's going to be many 
other smaller sets of colliding values). With a 64-bit hash, it is likely to be 
2, and the table size can be as large as ~8 trillion before the expected 
maximum exceeds 3. 
(https://en.wikipedia.org/wiki/Balls_into_bins_problem#Random_allocation)

Another way is the expected number of collisions, for the three cases above it 
is 1.6 billion (32-bit hash, 4 billion rows), 0.5 (64-bit, 4 billion), and 2.1 
million (64-bit, 8 trillion). 
(http://matt.might.net/articles/counting-hash-collisions/)

About the seed column, to me there is no need for such an argument: you
just can add an integer as a regular column.
   
You are correct that this works, but it increases the amount of computation 
(doubles it, when just trying to hash a single column). For multiple columns, 
col1, col2, ... colN, the `hash` function works approximately like (in 
pseudo-scala, and simplified from Spark's actual implementation):

val InitialSeed = 42L
def hash(col1, col2, ..., colN) = {
  var value = InitialSeed
  value = hashColumn(col1, seed = value)
  value = hashColumn(col2, seed = value)
  ...
  value = hashColumn(colN, seed = value)
  return value
}

If that starting value can be customized, then a call like `hash(lit(mySeed), 
column)` (which has to do the work to hash two columns) can be changed to 
instead just start at `mySeed`, and only hash one column. That said, for the 
hashes spark uses (xxHash and MurmurHash3), the hashing operation isn't too 
expensive, especially for ints/longs.

Huon
 
About the process for pull requests, I cannot help much


On Tue, Mar 05, 2019 at 04:30:31AM +, huon.wil...@data61.csiro.au wrote:
> Hi,
> 
> I’m working on something that requires deterministic randomness, i.e. a 
row gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot 
of collisions for many tables: the birthday paradox implies  >50% chance of at 
least one for tables larger than 77000 rows)
> - sha1/sha2/md5: single binary column input, string output
> 
> It seems there’s already support for a 64-bit hash function that can work 
with an arbitrary number of arbitrary-typed columns: XxHash64, and exposing 
this for DataFrames seems like it’s essentially one line in sql/functions.scala 
to match `hash` (plus docs, tests, function registry etc.):
> 
> def hash64(cols: Column*): Column = withExpr { new 
XxHash64(cols.map(_.expr)) }
> 
> For my use case, this can then be used to get a 64-bit “random” column 
like 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?
> 
> Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed 
first.
> 
> - Huon
> 
>  
> 


> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





Re: [SQL] 64-bit hash function, and seeding

2019-03-05 Thread Nicolas Paris
Hi Huon

Good catch. A 64 bit hash is definitely a useful function.

> the birthday paradox implies  >50% chance of at least one for tables larger 
> than 77000 rows

Do you know how many rows to have 50% chances for a 64 bit hash ?


About the seed column, to me there is no need for such an argument: you
just can add an integer as a regular column.

About the process for pull requests, I cannot help much


On Tue, Mar 05, 2019 at 04:30:31AM +, huon.wil...@data61.csiro.au wrote:
> Hi,
> 
> I’m working on something that requires deterministic randomness, i.e. a row 
> gets the same “random” value no matter the order of the DataFrame. A seeded 
> hash seems to be the perfect way to do this, but the existing hashes have 
> various limitations:
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot of 
> collisions for many tables: the birthday paradox implies  >50% chance of at 
> least one for tables larger than 77000 rows)
> - sha1/sha2/md5: single binary column input, string output
> 
> It seems there’s already support for a 64-bit hash function that can work 
> with an arbitrary number of arbitrary-typed columns: XxHash64, and exposing 
> this for DataFrames seems like it’s essentially one line in 
> sql/functions.scala to match `hash` (plus docs, tests, function registry 
> etc.):
> 
> def hash64(cols: Column*): Column = withExpr { new 
> XxHash64(cols.map(_.expr)) }
> 
> For my use case, this can then be used to get a 64-bit “random” column like 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
> https://github.com/apache/spark/compare/master...huonw:hash64; should I open 
> a JIRA and submit it as a pull request?
> 
> Additionally, both hash and the new hash64 already have support for being 
> seeded, but this isn’t exposed directly and instead requires something like 
> the `lit` above. Would it make sense to add overloads like the following?
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed first.
> 
> - Huon
> 
>  
> 


> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[SQL] 64-bit hash function, and seeding

2019-03-04 Thread Huon.Wilson
Hi,

I’m working on something that requires deterministic randomness, i.e. a row 
gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of 
collisions for many tables: the birthday paradox implies  >50% chance of at 
least one for tables larger than 77000 rows)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with 
an arbitrary number of arbitrary-typed columns: XxHash64, and exposing this for 
DataFrames seems like it’s essentially one line in sql/functions.scala to match 
`hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new 
XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like 

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

- Huon

 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org