Ming Beckwith created SPARK-17913:
-------------------------------------

             Summary: Filter/join expressions can return incorrect results when 
comparing strings to longs
                 Key: SPARK-17913
                 URL: https://issues.apache.org/jira/browse/SPARK-17913
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0, 1.6.2
            Reporter: Ming Beckwith


Reproducer:

{code}
  case class E(subject: Long, predicate: String, objectNode: String)

  def test(sc: SparkContext) = {
    val sqlContext: SQLContext = new SQLContext(sc)
    import sqlContext.implicits._

    val broken = List(
      (19157170390056969L, "right", 19157170390056969L),
      (19157170390056973L, "wrong", 19157170390056971L),
      (19157190254313477L, "wrong", 19157190254313475L),
      (19157180859056133L, "wrong", 19157180859056131L),
      (19157170390056969L, "number", 161),
      (19157170390056971L, "string", "a string"),
      (19157190254313475L, "string", "another string"),
      (19157180859056131L, "number", 191)
    )

    val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, 
b._3.toString)).toDF()
    val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
    val fixed = brokenDF.filter(brokenDF("subject").cast("string") === 
brokenDF("objectNode"))

    println("***** incorrect filter results *****")
    println(brokenFilter.show())
    println("***** correct filter results *****")
    println(fixed.show())

    println("***** both sides cast to double *****")
    println(brokenFilter.explain())
  }

Broken filter returns:

+-----------------+---------+-----------------+
|          subject|predicate|       objectNode|
+-----------------+---------+-----------------+
|19157170390056969|    right|19157170390056969|
|19157170390056973|    wrong|19157170390056971|
|19157190254313477|    wrong|19157190254313475|
|19157180859056133|    wrong|19157180859056131|
+-----------------+---------+-----------------+
{code}

The physical plan shows both sides of the expression are being cast to Double 
before evaluation. So while comparing numbers to a string number appears to 
work in many cases, when the numbers are sufficiently large and close together 
there is enough loss of precision to cause incorrect results. 

{code}
== Physical Plan ==
Filter (cast(subject#0L as double) = cast(objectNode#2 as double))

After casting the left side into strings, the filter returns the expected 
result:

+-----------------+---------+-----------------+
|          subject|predicate|       objectNode|
+-----------------+---------+-----------------+
|19157170390056969|    right|19157170390056969|
+-----------------+---------+-----------------+
{code}

Expected behavior in this case is probably to choose one side and cast the 
other (compare string to string or long to long) instead of using a data type 
with less precision. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to