Laurens Versluis created SPARK-41266:
----------------------------------------

             Summary: Spark does not parse timestamp strings when using the IN 
operator
                 Key: SPARK-41266
                 URL: https://issues.apache.org/jira/browse/SPARK-41266
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1
         Environment: Windows 10, Spark 3.2.1 with Java 11
            Reporter: Laurens Versluis


Likely affects more versions, tested only with 3.2.1.

 

Summary:

Spark will convert a timestamp string to a timestamp when using the equal 
operator (=), yet won't do this when using the IN operator.

 

Details:

While debugging an issue why we got no results on a query, we found out that 
when using the equal symbol `=` in the WHERE clause combined with a 
TimeStampType column that Spark will convert the string to a timestamp and 
filter.

However, when using the IN operator (our query), it will not do so, and perform 
a cast to string. We expected the behavior to be similar, or at least that 
Spark realizes the IN clause operates on a TimeStampType column and thus 
attempts to convert to timestamp first before falling back to string comparison.

 

*Minimal reproducible example:*

Suppose we have a one-line dataset with the follow contents and schema:

 
{noformat}
+----------------------------+
|starttime                   |
+----------------------------+
|2019-08-11 19:33:05         |
+----------------------------+
root
 |-- starttime: timestamp (nullable = true){noformat}
Then if we fire the following queries, we will not get results for the 
IN-clause one using a timestamp string with timezone information:

 

 
{code:java}
// Works - Spark casts the argument to a string and the internal representation 
of the time seems to match it...
singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
// Works
singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
// Works
singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
// Doesn't work
singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show();
//Works
singleCol.filter("starttime IN (to_timestamp('2019-08-11T19:33:05Z'))").show(); 
{code}
 

We can see from the output that a cast to string is taking place:
{noformat}
[...] isnotnull(starttime#59),(cast(starttime#59 as string) = 2019-08-11 
19:33:05){noformat}
Since the = operator does work, it would be consistent if operators such as the 
IN operator would have similar, consistent behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to