Hi Huon Good catch. A 64 bit hash is definitely a useful function.
> the birthday paradox implies >50% chance of at least one for tables larger > than 77000 rows Do you know how many rows to have 50% chances for a 64 bit hash ? About the seed column, to me there is no need for such an argument: you just can add an integer as a regular column. About the process for pull requests, I cannot help much On Tue, Mar 05, 2019 at 04:30:31AM +0000, huon.wil...@data61.csiro.au wrote: > Hi, > > I’m working on something that requires deterministic randomness, i.e. a row > gets the same “random” value no matter the order of the DataFrame. A seeded > hash seems to be the perfect way to do this, but the existing hashes have > various limitations: > > - hash: 32-bit output (only 4 billion possibilities will result in a lot of > collisions for many tables: the birthday paradox implies >50% chance of at > least one for tables larger than 77000 rows) > - sha1/sha2/md5: single binary column input, string output > > It seems there’s already support for a 64-bit hash function that can work > with an arbitrary number of arbitrary-typed columns: XxHash64, and exposing > this for DataFrames seems like it’s essentially one line in > sql/functions.scala to match `hash` (plus docs, tests, function registry > etc.): > > def hash64(cols: Column*): Column = withExpr { new > XxHash64(cols.map(_.expr)) } > > For my use case, this can then be used to get a 64-bit “random” column like > > val seed = rng.nextLong() > hash64(lit(seed), col1, col2) > > I’ve created a (hopefully) complete patch by mimicking ‘hash’ at > https://github.com/apache/spark/compare/master...huonw:hash64; should I open > a JIRA and submit it as a pull request? > > Additionally, both hash and the new hash64 already have support for being > seeded, but this isn’t exposed directly and instead requires something like > the `lit` above. Would it make sense to add overloads like the following? > > def hash(seed: Int, cols: Columns*) = … > def hash64(seed: Long, cols: Columns*) = … > > Though, it does seem a bit unfortunate to be forced to pass the seed first. > > - Huon > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- nicolas --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org