[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48869485 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions { */ def crc32(e: Column): Column = withExpr { Crc32(e.expr) } + /** + * Calculates the hash code of given columns, and returns the result as a int column. + * + * @group misc_funcs + * @since 2.0 + */ + @scala.annotation.varargs + def hash(col: Column, cols: Column*): Column = withExpr { --- End diff -- You can use the following form: (firstarg:Int)(more:Int*) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168801707 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48778819 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * Internally this function will write arguments into an [[UnsafeRow]], and calculate hash code of + * the unsafe row using murmur3 hasher with a seed. + * We should use this hash function for both shuffle and bucket, so that we can guarantee shuffle + * and bucketing have same data distribution. + */ +case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression { + def this(arguments: Seq[Expression]) = this(arguments, 42) --- End diff -- I think this is fine. Can you file a follow up jira to look at this again? I think we want to remove the projection to unsafe row soon (before we ship this and persist metadata that way). This should be decoupled from unsafe row ideally. For example, if the row is (int, double, string): the generated hash function shoudl be something like int hash = seed; hash = murmur3(getInt(0), hash) hash = murmur3(getDouble(1), hash) hash = murmur3(getString(2), hash) return hash This is likely not the currently computed hash value so can't defer this for too long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168614902 **[Test build #48644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)** for PR 10435 at commit [`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168614906 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168614909 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48644/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168652917 **[Test build #48656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)** for PR 10435 at commit [`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168614140 **[Test build #48644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)** for PR 10435 at commit [`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48808255 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions { */ def crc32(e: Column): Column = withExpr { Crc32(e.expr) } + /** + * Calculates the hash code of given columns, and returns the result as a int column. + * + * @group misc_funcs + * @since 2.0 + */ + @scala.annotation.varargs + def hash(col: Column, cols: Column*): Column = withExpr { --- End diff -- the hash function should take at least one parameter, is `@scala.annotation.varargs` support this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168877190 I've merged this. You can address the API comment in the next pull request. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48808153 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions { */ def crc32(e: Column): Column = withExpr { Crc32(e.expr) } + /** + * Calculates the hash code of given columns, and returns the result as a int column. + * + * @group misc_funcs + * @since 2.0 + */ + @scala.annotation.varargs + def hash(col: Column, cols: Column*): Column = withExpr { --- End diff -- this should just be a single column vararg, rather than one followed by vararg? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168856535 **[Test build #48699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)** for PR 10435 at commit [`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168876234 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48699/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168876055 **[Test build #48699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)** for PR 10435 at commit [`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168876233 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10435 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168674438 **[Test build #48656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)** for PR 10435 at commit [`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168674649 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168674651 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48656/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168700418 I think we should open another PR to use this hash expression in `Exchange`, as it will break a lof of tests and make it harder to review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48705398 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- sounds good to me, let me try it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48703350 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull most of them is testing something else but coincidently include hash expression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48705356 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- maybe we can have a flag to control this -- when in hive compatibility test, fall back to Hive's, and otherwise our own? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48676826 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- can you give me a list? i think we should consider just blacklisting them ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168125552 **[Test build #48542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)** for PR 10435 at commit [`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168134147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48542/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168134146 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168123375 **[Test build #48540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)** for PR 10435 at commit [`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48646962 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- 35 tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48646170 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- Here I didn't use `hash` for the name, as it will break a lot of hive compatibility tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168128676 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48540/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168128648 **[Test build #48540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)** for PR 10435 at commit [`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168128675 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48646183 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * Internally this function will write arguments into an [[UnsafeRow]], and calculate hash code of + * the unsafe row using murmur3 hasher with a seed. + * We should use this hash function for both shuffle and bucket, so that we can guarantee shuffle + * and bucketing have same data distribution. + */ +case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression { + def this(arguments: Seq[Expression]) = this(arguments, 42) --- End diff -- use 42 as default seed, which is same with `UnsafeRow.hashCode`, should we make `42` a constant variable in `Murmur3_x86_32`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48646861 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -278,6 +278,7 @@ object FunctionRegistry { // misc functions expression[Crc32]("crc32"), expression[Md5]("md5"), +expression[Murmur3Hash]("murmur3_hash"), --- End diff -- How many does it break? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168047956 On the contrary I think we should consider not using hash code on an object, and always use hash code expression for two reasons: 1. We still need a hash function 2. We get code gen using an expression 3. It is easier to control (being able to pass a seed or use it for bloom filters) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168072275 It makes sense to still have a Hash expression (called more specifically, Mumur3Hash) that does what this patch originally intended. I think this will be a useful primitive. The underlying implementation can just use UnsafeRow.hashCode for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
GitHub user cloud-fan reopened a pull request: https://github.com/apache/spark/pull/10435 [SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions The hash algorithm is based on https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L130-L158 Also use this expression in `Exchange`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark hash-expr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10435 commit c8b0ea65b147c898467c00d551455abae74eddf0 Author: Wenchen FanDate: 2015-12-22T15:22:59Z add hash expression commit 8a89287ed42aadde3d51d95c389ddb1e98c3ccf2 Author: Wenchen Fan Date: 2015-12-23T11:48:37Z address comments commit 1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e Author: Wenchen Fan Date: 2015-12-24T01:28:54Z fix hash code commit 8703b1a127235c49614d326334548f125b81383b Author: Wenchen Fan Date: 2015-12-30T01:58:54Z add comment to explain the benefit of hash expression --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168134068 **[Test build #48542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)** for PR 10435 at commit [`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-168017519 Closing, will open another PR to use `UnsafeRow.hashCode` for shuffle and fix tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan closed the pull request at: https://github.com/apache/spark/pull/10435 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48554681 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. --- End diff -- Let's also add comments to explain the benefit of this function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167918200 **[Test build #48442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)** for PR 10435 at commit [`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48588221 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 37 + elementHash` with an initial value `result = 37`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by the same way of array. + * + * This hash algorithm is basically same with `GenericInternalRow.hashCode`, but using this hash + * expression is better as it can produce consistent hash values between safe and unsafe data + * structure, and can be slightly faster by codegen. + * It's also the hash function for both shuffle and bucketing, so that we can guarantee shuffle and + * bucketing have same data distribution. + */ +case class Hash(children: Seq[Expression]) extends Expression { --- End diff -- What if we just turned this into Mumur3Hash instead? This would just do UnsafeProjection.create() project(input).hashCode() Murmur3 will give us much nicer hashing properties. The current hash function can be bad in reasonable cases. For example, if the long column is a timestamp in milis from a source that samples every second. Most of the low digits will be similar (e.g. values are 1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very bad job of breaking this up and this will generate some very skewed partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48589312 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 37 + elementHash` with an initial value `result = 37`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by the same way of array. + * + * This hash algorithm is basically same with `GenericInternalRow.hashCode`, but using this hash + * expression is better as it can produce consistent hash values between safe and unsafe data + * structure, and can be slightly faster by codegen. + * It's also the hash function for both shuffle and bucketing, so that we can guarantee shuffle and + * bucketing have same data distribution. + */ +case class Hash(children: Seq[Expression]) extends Expression { --- End diff -- good point! after decided to not follow hive, I agree Mumur3Hash is a better choice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48587902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 --- End diff -- If we're going to change this, we should use a different value for null. Pick a large random number instead. 0 will be computed as the hash for more reasonable data (e.g. more likely an int column contains 0 than a large prime) and we can cheaply reduce some collisions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167932291 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48442/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167932253 **[Test build #48442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)** for PR 10435 at commit [`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167932288 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167747465 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167747467 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48404/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167747137 **[Test build #48404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)** for PR 10435 at commit [`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48527967 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. --- End diff -- Looks like in Hive hash value of boolean is 1 for true and 0 for false? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48528123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 37 + elementHash` with an initial value `result = 37`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by the same way of array. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.isEmpty) { + TypeCheckResult.TypeCheckFailure("input to function hash cannot be empty") +} else { + TypeCheckResult.TypeCheckSuccess +} + } + + override def eval(input: InternalRow): Any = { +var result = 37 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 37 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 0 else 1 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) + +case array: ArrayData => + val elementType = dataType.asInstanceOf[ArrayType].elementType + var result = 0 + var i = 0 + while (i < array.numElements()) { +val hashValue = computeHash(array.get(i, elementType), elementType) +result = result * 37 + hashValue --- End diff -- 37? Looks like Hive uses 31. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48537829 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:0 for true, 1 for false. --- End diff -- Oh sorry I forgot to update the PR description. According to the [discussion](https://github.com/apache/spark/pull/10435#discussion_r48482588), we decided to not follow hive for performance concerns. So here I followed the original hash code of internal row so that we don't need to fix a lot of tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167698232 **[Test build #48388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)** for PR 10435 at commit [`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167583523 need some help about the R test, I can't see the expected result through the error log... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48482588 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- To follow hive, I turn `UTF8String` to `String` first and then call `hashCode`, but I'm a little worried about this: * this is definitely slower than just `UTF8String.hashCode`, and it's a critical path that we will run it for every row during `Exchange`, will it hurt performance? * hive has string type, varchar type and char type, and they have [different hash code](https://github.com/apache/hive/blob/release-1.2.0/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L527-L541), but we only have `StringType`, which is hard to match all of them. cc @nongli @yhuai @marmbrus @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167614804 I see ``` 1. Failure (at test_sparkSQL.R#1160): group by, agg functions -- 30 not equal to collect(max(gd))[1, 2] 30 - 19 == 11 2. Failure (at test_sparkSQL.R#1635): crosstab() on a DataFrame expected is not identical to ordered. Differences: Names: 2 string mismatches Component 2: Mean relative difference: 2 Component 3: Mean relative difference: 2 Error: Test failures Execution halted Had test failures; see logs. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48500689 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- Definitely don't do toString. Any particular reason why we need to match Hive's hash code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48500881 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- We have to match Hive's hashcode if we want to be able to join data Hive has bucketed with our own data. +1 to avoiding toString. We should also avoid boxing and runtime type reflection for the hash code (which this function is doing). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48502287 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- It seems to me we don't need to follow Hive's hash code, but we should make sure we don't pick the plan if it's a table bucketed by HIve. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167710232 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167710215 **[Test build #48388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)** for PR 10435 at commit [`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167710233 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48388/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167713829 Actually I have already created a pr #9883 for this long time ago... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167727788 **[Test build #48399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)** for PR 10435 at commit [`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48516214 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- @marmbrus I think we will only use the codegen version, should we remove this branch and throw exception if it's called? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10435#discussion_r48516332 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp }) } } + +/** + * A function that calculates hash value for a group of expressions. + * + * The hash value for an expression depends on its type: + * - null: 0 + * - boolean:1 for true, 0 for false. + * - byte, short, int: the input itself. + * - long: input XOR (input >>> 32) + * - float: java.lang.Float.floatToIntBits(input) + * - double: l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32) + * - binary: java.util.Arrays.hashCode(input) + * - array: recursively calculate hash value for each element, and aggregate them by + *`result = result * 31 + elementHash` with an initial value `result = 0`. + * - map:recursively calculate hash value for each key-value pair, and aggregate + *them by `result += keyHash XOR valueHash`. + * - struct: similar to array, calculate hash value for each field and aggregate them. + * - other type: input.hashCode(). + *e.g. calculate hash value for string type by `UTF8String.hashCode()`. + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`. + * + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with + * hive tables. + */ +case class Hash(children: Seq[Expression]) extends Expression { + + override def dataType: DataType = IntegerType + + override def foldable: Boolean = children.forall(_.foldable) + + override def nullable: Boolean = false + + override def eval(input: InternalRow): Any = { +var result = 0 +for (e <- children) { + val hashValue = computeHash(e.eval(input), e.dataType) + result = result * 31 + hashValue +} +result + } + + private def computeHash(v: Any, dataType: DataType): Int = v match { +case null => 0 +case b: Boolean => if (b) 1 else 0 +case b: Byte => b.toInt +case s: Short => s.toInt +case i: Int => i +case l: Long => (l ^ (l >>> 32)).toInt +case f: Float => java.lang.Float.floatToIntBits(f) +case d: Double => + val b = java.lang.Double.doubleToLongBits(d) + (b ^ (b >>> 32)).toInt +case a: Array[Byte] => java.util.Arrays.hashCode(a) +case s: UTF8String => s.toString.hashCode --- End diff -- Oh, I see. Its probably fine to have this as a fallback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167737270 **[Test build #48404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)** for PR 10435 at commit [`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167730354 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48399/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167730331 **[Test build #48399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)** for PR 10435 at commit [`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167730352 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167399420 **[Test build #48352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)** for PR 10435 at commit [`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167403074 **[Test build #48352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)** for PR 10435 at commit [`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167403091 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48352/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167403090 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167395236 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396600 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396598 **[Test build #48349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396601 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48349/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396554 **[Test build #48349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396643 **[Test build #48350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396684 **[Test build #48350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396686 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48350/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167396685 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167447980 **[Test build #48356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)** for PR 10435 at commit [`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167452745 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48356/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167452744 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167452730 **[Test build #48356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)** for PR 10435 at commit [`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167338506 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167338507 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48346/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167338499 **[Test build #48346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)** for PR 10435 at commit [`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167328998 **[Test build #48346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)** for PR 10435 at commit [`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167329447 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48345/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167329445 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167246248 **[Test build #48325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)** for PR 10435 at commit [`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167251759 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48325/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167251754 **[Test build #48325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)** for PR 10435 at commit [`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `case class Hash(children: Seq[Expression]) extends Expression `\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10435#issuecomment-167251758 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org