[ 
https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147358#comment-17147358
 ] 

koert kuipers edited comment on SPARK-32109 at 6/28/20, 2:58 PM:
-----------------------------------------------------------------

the issue is that row here isnt really a sequence. it represent an object.

if you have say an object Person(name: String, nickname: String) you would not 
want Person("john", null) and Person(null, "john") to have same hashCode.

see for example the suggested hashcode implementations in effective java by 
joshua bloch. they do something similar to what you suggest to solve this 
problem. so unfortunately i think our current implementation is flawed :(

p.s. even for pure sequences i do not think this implementation as it is right 
now is acceptable. but that is less of a worry than the object represenation of 
row.


was (Author: koert):
the issue is that Row here isnt really a sequence. it represent an object.

if you have say an object Person(name: String, nickname: String) you would not 
want Person("john", null) and Person(null, "john") to have same hashCode.

see for example the suggested hashcode implementations in effective java by 
joshua bloch. they do something similar to what you suggest to solve this 
problem. so unfortunately i think our current implementation is flawed :(

PS even for pure sequences i do not think this implementation as it is right 
now is acceptable. but that is less of a worry than the object represenation of 
row.

> SQL hash function handling of nulls makes collision too likely
> --------------------------------------------------------------
>
>                 Key: SPARK-32109
>                 URL: https://issues.apache.org/jira/browse/SPARK-32109
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: koert kuipers
>            Priority: Minor
>
> this ticket is about org.apache.spark.sql.functions.hash and sparks handling 
> of nulls when hashing sequences.
> {code:java}
> scala> spark.sql("SELECT hash('bar', null)").show()
> +---------------+
> |hash(bar, NULL)|
> +---------------+
> |    -1808790533|
> +---------------+
> scala> spark.sql("SELECT hash(null, 'bar')").show()
> +---------------+
> |hash(NULL, bar)|
> +---------------+
> |    -1808790533|
> +---------------+
>  {code}
> these are differences sequences. e.g. these could be positions 0 and 1 in a 
> dataframe which are diffferent columns with entirely different meanings. the 
> hashes should not be the same.
> another example:
> {code:java}
> scala> Seq(("john", null), (null, "john")).toDF("name", 
> "alias").withColumn("hash", hash(col("name"), col("alias"))).show
> +----+-----+---------+
> |name|alias|     hash|
> +----+-----+---------+
> |john| null|487839701|
> |null| john|487839701|
> +----+-----+---------+ {code}
> instead of ignoring nulls each null show do a transform to the hash so that 
> the order of elements including the nulls matters for the outcome.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to