[ 
https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17297.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0
                   2.0.1

Issue resolved by pull request 15142
[https://github.com/apache/spark/pull/15142]

> Clarify window/slide duration as absolute time, not relative to a calendar
> --------------------------------------------------------------------------
>
>                 Key: SPARK-17297
>                 URL: https://issues.apache.org/jira/browse/SPARK-17297
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.0.0
>            Reporter: Pete Baker
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String 
> slideDuration, String startTime)}} function {{startTime}} parameter behaves 
> as follows:
> {quote}
> startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to 
> start window intervals. For example, in order to have hourly tumbling windows 
> that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide 
> startTime as 15 minutes.
> {quote}
> Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd 
> expect to see events from each day fall into the correct day's bucket.   This 
> doesn't happen as expected in every case, however, due to the way that this 
> feature and timestamp / timezone support interact. 
> Using a fixed UTC reference, there is an assumption that all days are the 
> same length ({{1 day === 24 h}}}).  This is not the case for most timezones 
> where the offset from UTC changes by 1h for 6 months out of the year.  In 
> this case, on the days that clocks go forward/back, one day is 23h long, 1 
> day is 25h long.
> The result of this is that, for daylight savings time, some rows within 1h of 
> midnight are aggregated to the wrong day.
> Options for a fix are:
> # This is the expected behavior, and the window() function should not be used 
> for this type of aggregation with a long window length and the {{window}} 
> function documentation should be updated as such, or
> # The window function should respect timezones and work on the assumption 
> that {{1 day !== 24 h}}.  The {{startTime}} should be updated to snap to the 
> local timezone, rather than UTC.
> # Support for both absolute and relative window lengths may be added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to