[jira] [Assigned] (SPARK-17297) Clarify window/slide duration as absolute time, not relative to a calendar
[ https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17297: Assignee: Sean Owen (was: Apache Spark) > Clarify window/slide duration as absolute time, not relative to a calendar > -- > > Key: SPARK-17297 > URL: https://issues.apache.org/jira/browse/SPARK-17297 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Pete Baker >Assignee: Sean Owen >Priority: Minor > > In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String > slideDuration, String startTime)}} function {{startTime}} parameter behaves > as follows: > {quote} > startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to > start window intervals. For example, in order to have hourly tumbling windows > that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide > startTime as 15 minutes. > {quote} > Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd > expect to see events from each day fall into the correct day's bucket. This > doesn't happen as expected in every case, however, due to the way that this > feature and timestamp / timezone support interact. > Using a fixed UTC reference, there is an assumption that all days are the > same length ({{1 day === 24 h}}}). This is not the case for most timezones > where the offset from UTC changes by 1h for 6 months out of the year. In > this case, on the days that clocks go forward/back, one day is 23h long, 1 > day is 25h long. > The result of this is that, for daylight savings time, some rows within 1h of > midnight are aggregated to the wrong day. > Options for a fix are: > # This is the expected behavior, and the window() function should not be used > for this type of aggregation with a long window length and the {{window}} > function documentation should be updated as such, or > # The window function should respect timezones and work on the assumption > that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the > local timezone, rather than UTC. > # Support for both absolute and relative window lengths may be added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17297) Clarify window/slide duration as absolute time, not relative to a calendar
[ https://issues.apache.org/jira/browse/SPARK-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17297: Assignee: Apache Spark (was: Sean Owen) > Clarify window/slide duration as absolute time, not relative to a calendar > -- > > Key: SPARK-17297 > URL: https://issues.apache.org/jira/browse/SPARK-17297 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Pete Baker >Assignee: Apache Spark >Priority: Minor > > In Spark 2.0.0, the {{window(Column timeColumn, String windowDuration, String > slideDuration, String startTime)}} function {{startTime}} parameter behaves > as follows: > {quote} > startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to > start window intervals. For example, in order to have hourly tumbling windows > that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide > startTime as 15 minutes. > {quote} > Given a {{windowDuration}} of {{1 day}} and a {{startTime}} of {{0h}}, I'd > expect to see events from each day fall into the correct day's bucket. This > doesn't happen as expected in every case, however, due to the way that this > feature and timestamp / timezone support interact. > Using a fixed UTC reference, there is an assumption that all days are the > same length ({{1 day === 24 h}}}). This is not the case for most timezones > where the offset from UTC changes by 1h for 6 months out of the year. In > this case, on the days that clocks go forward/back, one day is 23h long, 1 > day is 25h long. > The result of this is that, for daylight savings time, some rows within 1h of > midnight are aggregated to the wrong day. > Options for a fix are: > # This is the expected behavior, and the window() function should not be used > for this type of aggregation with a long window length and the {{window}} > function documentation should be updated as such, or > # The window function should respect timezones and work on the assumption > that {{1 day !== 24 h}}. The {{startTime}} should be updated to snap to the > local timezone, rather than UTC. > # Support for both absolute and relative window lengths may be added -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org