[ 
https://issues.apache.org/jira/browse/SPARK-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lior Chaga resolved SPARK-22840.
--------------------------------
    Resolution: Not A Bug

> Incorrect results when using distinct on window
> -----------------------------------------------
>
>                 Key: SPARK-22840
>                 URL: https://issues.apache.org/jira/browse/SPARK-22840
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>         Attachments: sample.parquet.zip
>
>
> Given the following schema:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- start_time: long (nullable = true)
>  |-- stats: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- calibratedRecsHistory: double (nullable = true)
>  |    |    |-- eventTime: long (nullable = true)
>  |    |    |-- itemId: long (nullable = true)
>  |    |    |-- recsHistory: long (nullable = true)
> {code}
> Data contains multiple rows per id and start_time, with all stats elements 
> for a specific id and start_time is identical in all rows, I've noticed 
> inconsistent results when using window with FIRST(stats) DESC, and 
> LAST(stats) ASC.
> Specifically, the latter (LAST with ASC) produces more results.
> This is the query for seeing that:
> {code}
> SELECT DISTINCT
>         id ,
>         LAST(stats) over w
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time DESC)
> except        
> SELECT DISTINCT
>         id ,
>         FIRST(stats) over w 
>         FROM sample
>         WINDOW w AS (PARTITION BY id  SORT BY start_time ASC)
> {code}
> Each of the subqueries should return the stats for the latest start_time, 
> partitioned by id.
> Changing the order of the subqueries returns nothing...
> The query with FIRST and ASC produces correct results.
> the data for sample is attached in [^sample.parquet.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to