[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17062 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105570634 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -168,6 +170,208 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th is a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForIntervalType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +// - MICROSEC - + +// basic case +checkHiveHashForIntervalType("interval 1 microsecond", 24273) + +// negative +checkHiveHashForIntervalType("interval -1 microsecond", 22273) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0 microsecond", 23273) +checkHiveHashForIntervalType("interval 999 microsecond", 1022273) +checkHiveHashForIntervalType("interval -999 microsecond", -975727) + +// - MILLISEC - + +// basic case +checkHiveHashForIntervalType("interval 1 millisecond", 1023273) + +// negative +checkHiveHashForIntervalType("interval -1 millisecond", -976727) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105570430 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -168,6 +170,208 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th is a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForIntervalType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +// - MICROSEC - + +// basic case +checkHiveHashForIntervalType("interval 1 microsecond", 24273) + +// negative +checkHiveHashForIntervalType("interval -1 microsecond", 22273) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0 microsecond", 23273) +checkHiveHashForIntervalType("interval 999 microsecond", 1022273) +checkHiveHashForIntervalType("interval -999 microsecond", -975727) + +// - MILLISEC - + +// basic case +checkHiveHashForIntervalType("interval 1 millisecond", 1023273) + +// negative +checkHiveHashForIntervalType("interval -1 millisecond", -976727) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105570229 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -168,6 +170,208 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th is a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForIntervalType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +// - MICROSEC - + +// basic case +checkHiveHashForIntervalType("interval 1 microsecond", 24273) + +// negative +checkHiveHashForIntervalType("interval -1 microsecond", 22273) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0 microsecond", 23273) +checkHiveHashForIntervalType("interval 999 microsecond", 1022273) +checkHiveHashForIntervalType("interval -999 microsecond", -975727) + +// - MILLISEC - + +// basic case +checkHiveHashForIntervalType("interval 1 millisecond", 1023273) + +// negative +checkHiveHashForIntervalType("interval -1 millisecond", -976727) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105569790 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -168,6 +170,208 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th is a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForIntervalType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +// - MICROSEC - + +// basic case +checkHiveHashForIntervalType("interval 1 microsecond", 24273) + +// negative +checkHiveHashForIntervalType("interval -1 microsecond", 22273) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0 microsecond", 23273) +checkHiveHashForIntervalType("interval 999 microsecond", 1022273) +checkHiveHashForIntervalType("interval -999 microsecond", -975727) + +// - MILLISEC - + +// basic case +checkHiveHashForIntervalType("interval 1 millisecond", 1023273) + +// negative +checkHiveHashForIntervalType("interval -1 millisecond", -976727) + +// edge / boundary cases +checkHiveHashForIntervalType("interval 0
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105265211 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -168,6 +170,208 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th is a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { --- End diff -- Hive queries for all the tests below. Outputs are generated by running against Hive-1.2.1 ``` // - MICROSEC - SELECT HASH(interval_day_time("0 0:0:0.01") ); SELECT HASH(interval_day_time("-0 0:0:0.01") ); SELECT HASH(interval_day_time("0 0:0:0.00") ); SELECT HASH(interval_day_time("0 0:0:0.000999") ); SELECT HASH(interval_day_time("-0 0:0:0.000999") ); // - MILLISEC - SELECT HASH(interval_day_time("0 0:0:0.001") ); SELECT HASH(interval_day_time("-0 0:0:0.001") ); SELECT HASH(interval_day_time("0 0:0:0.000") ); SELECT HASH(interval_day_time("0 0:0:0.999") ); SELECT HASH(interval_day_time("-0 0:0:0.999") ); // - SECOND - SELECT HASH( INTERVAL '1' SECOND); SELECT HASH( INTERVAL '-1' SECOND); SELECT HASH( INTERVAL '0' SECOND); SELECT HASH( INTERVAL '2147483647' SECOND); SELECT HASH( INTERVAL '-2147483648' SECOND); // - MINUTE - SELECT HASH(
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105261953 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForTimestampType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +checkHiveHashForTimestampType("interval 1 day", 3220073) +checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073) +checkHiveHashForTimestampType("interval -23 day 56 hour -113 minute 9898989 second", + -2128468593) --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r105261981 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -732,6 +741,38 @@ object HiveHashFunction extends InterpretedHashFunction { HiveHasher.hashUnsafeBytes(base, offset, len) } + /** + * Mimics TimestampWritable.hashCode() in Hive + */ + def hashTimestamp(timestamp: Long): Long = { +val timestampInSeconds = timestamp / 100 +val nanoSecondsPortion = (timestamp % 100) * 1000 + +var result = timestampInSeconds +result <<= 30 // the nanosecond part fits in 30 bits +result |= nanoSecondsPortion +((result >>> 32) ^ result).toInt + } + + /** + * Hive allows input intervals to be defined using units below but the intervals + * have to be from the same category: + * - year, month (stored as HiveIntervalYearMonth) + * - day, hour, minute, second, nanosecond (stored as HiveIntervalDayTime) + * + * eg. (INTERVAL '30' YEAR + INTERVAL '-23' DAY) fails in Hive + * + * This method mimics HiveIntervalDayTime.hashCode() in Hive. If the `INTERVAL` is backed as + * HiveIntervalYearMonth in Hive, then this method will not produce Hive compatible result. + * The reason being Spark's representation of calendar does not have such categories based on + * the interval and is unified. + */ + def hashCalendarInterval(calendarInterval: CalendarInterval): Long = { +val totalSeconds = calendarInterval.milliseconds() / 1000 --- End diff -- Spark's CalendarInterval has precision upto microseconds while Hive can have precision upto nanoseconds. So, there is no way for us to support that in the hashing function. I have documented this in the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r104282934 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -732,6 +741,38 @@ object HiveHashFunction extends InterpretedHashFunction { HiveHasher.hashUnsafeBytes(base, offset, len) } + /** + * Mimics TimestampWritable.hashCode() in Hive + */ + def hashTimestamp(timestamp: Long): Long = { +val timestampInSeconds = timestamp / 100 +val nanoSecondsPortion = (timestamp % 100) * 1000 + +var result = timestampInSeconds +result <<= 30 // the nanosecond part fits in 30 bits +result |= nanoSecondsPortion +((result >>> 32) ^ result).toInt + } + + /** + * Hive allows input intervals to be defined using units below but the intervals + * have to be from the same category: + * - year, month (stored as HiveIntervalYearMonth) + * - day, hour, minute, second, nanosecond (stored as HiveIntervalDayTime) + * + * eg. (INTERVAL '30' YEAR + INTERVAL '-23' DAY) fails in Hive + * + * This method mimics HiveIntervalDayTime.hashCode() in Hive. If the `INTERVAL` is backed as + * HiveIntervalYearMonth in Hive, then this method will not produce Hive compatible result. + * The reason being Spark's representation of calendar does not have such categories based on + * the interval and is unified. + */ + def hashCalendarInterval(calendarInterval: CalendarInterval): Long = { +val totalSeconds = calendarInterval.milliseconds() / 1000 --- End diff -- How does Hive deal with nanoseconds, if we divide it by 1000? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r104282564 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForTimestampType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +checkHiveHashForTimestampType("interval 1 day", 3220073) +checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073) +checkHiveHashForTimestampType("interval -23 day 56 hour -113 minute 9898989 second", + -2128468593) --- End diff -- Coud you add more test cases? ``` checkHiveHashForTimestampType("interval 0 day 0 hour 0 minute 0 second", 23273) checkHiveHashForTimestampType("interval 0 day 0 hour", 23273) checkHiveHashForTimestampType("interval -1 day", 3220036) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103357272 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForTimestampType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +checkHiveHashForTimestampType("interval 1 day", 3220073) --- End diff -- SELECT HASH ( INTERVAL '1' DAY ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103357588 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForTimestampType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +checkHiveHashForTimestampType("interval 1 day", 3220073) +checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073) +checkHiveHashForTimestampType("interval -23 day 56 hour -113 minute 9898989 second", --- End diff -- SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL '-113' MINUTE + INTERVAL '9898989' SECOND ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103300592 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases --- End diff -- same as `Date`, invalid timestamp values are not allowed in Spark and it will fail. Hive will not fail but fallback to `null` and return `0` as hash value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103281696 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) --- End diff -- expected values computed over hive 1.2. using: ``` SELECT HASH( CAST( "2017-01-01" AS DATE) ) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103300013 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases --- End diff -- Spark does not allow creating `Date` which do not fit its spec and throws exception. Hive will not fail but fallback to `null` and return `0` as hash value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103357472 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) + +// with higher precision +checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655) + +// with different timezone +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471, + TimeZone.getTimeZone("US/Pacific")) + +// boundary cases +checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784) +checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240) + +// epoch +checkHiveHashForTimestampType("1970-01-01 00:00:00", 0) + +// before epoch +checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885) + +// Invalid input: bad timestamp string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 0:0:0", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 99:99:45", 0)) + intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 00:00:00", 0)) + +// Invalid input: Hive accepts upto 9 decimal place precision but Spark uses upto 6 + intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 10:56:29.", 0)) + } + + test("hive-hash for CalendarInterval type") { +def checkHiveHashForTimestampType(interval: String, expected: Long): Unit = { + checkHiveHash(CalendarInterval.fromString(interval), CalendarIntervalType, expected) +} + +checkHiveHashForTimestampType("interval 1 day", 3220073) +checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073) --- End diff -- SELECT HASH ( INTERVAL '1' DAY + INTERVAL '15' HOUR ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17062#discussion_r103300293 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala --- @@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // scalastyle:on nonascii } + test("hive-hash for date type") { +def checkHiveHashForDateType(dateString: String, expected: Long): Unit = { + checkHiveHash( +DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get, +DateType, +expected) +} + +// basic case +checkHiveHashForDateType("2017-01-01", 17167) + +// boundary cases +checkHiveHashForDateType("-01-01", -719530) +checkHiveHashForDateType("-12-31", 2932896) + +// epoch +checkHiveHashForDateType("1970-01-01", 0) + +// before epoch +checkHiveHashForDateType("1800-01-01", -62091) + +// Invalid input: bad date string. Hive returns 0 for such cases +intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0)) + intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0)) + +// Invalid input: Empty string. Hive returns 0 for this case +intercept[NoSuchElementException](checkHiveHashForDateType("", 0)) + +// Invalid input: February 30th for a leap year. Hive supports this but Spark doesn't + intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861)) + } + + test("hive-hash for timestamp type") { +def checkHiveHashForTimestampType( +timestamp: String, +expected: Long, +timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = { + checkHiveHash( +DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), timeZone).get, +TimestampType, +expected) +} + +// basic case +checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271) --- End diff -- Corresponding hive query. ``` select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP)); ``` Note that this is with system's timezone set to UTC (export TZ=/usr/share/zoneinfo/UTC) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/17062 [SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash ## What changes were proposed in this pull request? - Timestamp hashing is done as per [TimestampWritable.hashCode()](https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/io/TimestampWritable.java#L406) in Hive - Interval hashing is done as per [HiveIntervalDayTime.hashCode()](https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/storage-api/src/java/org/apache/hadoop/hive/common/type/HiveIntervalDayTime.java#L178). Note that there are inherent differences in how Hive and Spark store intervals under the hood which limits the ability to be in completely sync with hive's hashing function. I have explained this in the method doc. - Date type was already supported. This PR adds test for that. ## How was this patch tested? Added unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark SPARK-17495_time_related_types Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17062.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17062 commit cc359fc45547b7ba3fd4c1d11d3dcfbaf71ea66a Author: Tejas PatilDate: 2017-02-25T00:18:03Z [SPARK-17495] [SQL] Support date, timestamp datatypes in Hive hash commit 332475c1641f61080aa41dda9f1ceec237351d75 Author: Tejas Patil Date: 2017-02-25T02:23:41Z minor refac --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org