[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-05 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48869485
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
*/
   def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
 
+  /**
+   * Calculates the hash code of given columns, and returns the result as 
a int column.
+   *
+   * @group misc_funcs
+   * @since 2.0
+   */
+  @scala.annotation.varargs
+  def hash(col: Column, cols: Column*): Column = withExpr {
--- End diff --

You can use the following form:
(firstarg:Int)(more:Int*)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread nongli
Github user nongli commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168801707
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48778819
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * Internally this function will write arguments into an [[UnsafeRow]], 
and calculate hash code of
+ * the unsafe row using murmur3 hasher with a seed.
+ * We should use this hash function for both shuffle and bucket, so that 
we can guarantee shuffle
+ * and bucketing have same data distribution.
+ */
+case class Murmur3Hash(children: Seq[Expression], seed: Int) extends 
Expression {
+  def this(arguments: Seq[Expression]) = this(arguments, 42)
--- End diff --

I think this is fine.

Can you file a follow up jira to look at this again? I think we want to 
remove the projection to unsafe row soon (before we ship this and persist 
metadata that way). This should be decoupled from unsafe row ideally. For 
example, if the row is (int, double, string): the generated hash function 
shoudl be something like

int hash = seed;
hash = murmur3(getInt(0), hash)
hash = murmur3(getDouble(1), hash)
hash = murmur3(getString(2), hash)
return hash

This is likely not the currently computed hash value so can't defer this 
for too long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168614902
  
**[Test build #48644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)**
 for PR 10435 at commit 
[`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168614906
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168614909
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48644/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168652917
  
**[Test build #48656 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)**
 for PR 10435 at commit 
[`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168614140
  
**[Test build #48644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)**
 for PR 10435 at commit 
[`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48808255
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
*/
   def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
 
+  /**
+   * Calculates the hash code of given columns, and returns the result as 
a int column.
+   *
+   * @group misc_funcs
+   * @since 2.0
+   */
+  @scala.annotation.varargs
+  def hash(col: Column, cols: Column*): Column = withExpr {
--- End diff --

the hash function should take at least one parameter, is 
`@scala.annotation.varargs` support this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168877190
  
I've merged this. You can address the API comment in the next pull request. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48808153
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
*/
   def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
 
+  /**
+   * Calculates the hash code of given columns, and returns the result as 
a int column.
+   *
+   * @group misc_funcs
+   * @since 2.0
+   */
+  @scala.annotation.varargs
+  def hash(col: Column, cols: Column*): Column = withExpr {
--- End diff --

this should just be  a single column vararg, rather than one followed by 
vararg?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168856535
  
**[Test build #48699 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)**
 for PR 10435 at commit 
[`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168876234
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48699/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168876055
  
**[Test build #48699 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)**
 for PR 10435 at commit 
[`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168876233
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168674438
  
**[Test build #48656 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)**
 for PR 10435 at commit 
[`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168674649
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168674651
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48656/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-04 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168700418
  
I think we should open another PR to use this hash expression in 
`Exchange`, as it will break a lof of tests and make it harder to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48705398
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

sounds good to me, let me try it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48703350
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull

most of them is testing something else but coincidently include hash 
expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-03 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48705356
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

maybe we can have a flag to control this -- when in hive compatibility 
test, fall back to Hive's, and otherwise our own?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2016-01-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48676826
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

can you give me a list? i think we should consider just blacklisting them 
...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168125552
  
**[Test build #48542 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)**
 for PR 10435 at commit 
[`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168134147
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48542/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168134146
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168123375
  
**[Test build #48540 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)**
 for PR 10435 at commit 
[`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48646962
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

35 tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48646170
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

Here I didn't use `hash` for the name, as it will break a lot of hive 
compatibility tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168128676
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48540/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168128648
  
**[Test build #48540 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)**
 for PR 10435 at commit 
[`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Murmur3Hash(children: Seq[Expression], seed: Int) extends 
Expression `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168128675
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48646183
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * Internally this function will write arguments into an [[UnsafeRow]], 
and calculate hash code of
+ * the unsafe row using murmur3 hasher with a seed.
+ * We should use this hash function for both shuffle and bucket, so that 
we can guarantee shuffle
+ * and bucketing have same data distribution.
+ */
+case class Murmur3Hash(children: Seq[Expression], seed: Int) extends 
Expression {
+  def this(arguments: Seq[Expression]) = this(arguments, 42)
--- End diff --

use 42 as default seed, which is same with `UnsafeRow.hashCode`, should we 
make `42` a constant variable in `Murmur3_x86_32`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48646861
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -278,6 +278,7 @@ object FunctionRegistry {
 // misc functions
 expression[Crc32]("crc32"),
 expression[Md5]("md5"),
+expression[Murmur3Hash]("murmur3_hash"),
--- End diff --

How many does it break?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168047956
  
On the contrary I think we should consider not using hash code on an 
object, and always use hash code expression for two reasons:

1. We still need a hash function

2. We get code gen using an expression

3. It is easier to control (being able to pass a seed or use it for bloom 
filters)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread nongli
Github user nongli commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168072275
  
It makes sense to still have a Hash expression (called more specifically, 
Mumur3Hash) that does what this patch originally intended. I think this will be 
a useful primitive. The underlying implementation can just use 
UnsafeRow.hashCode for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
GitHub user cloud-fan reopened a pull request:

https://github.com/apache/spark/pull/10435

[SPARK-12480][SQL] add Hash expression that can calculate hash value for a 
group of expressions

The hash algorithm is based on 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L130-L158
Also use this expression in `Exchange`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark hash-expr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10435


commit c8b0ea65b147c898467c00d551455abae74eddf0
Author: Wenchen Fan 
Date:   2015-12-22T15:22:59Z

add hash expression

commit 8a89287ed42aadde3d51d95c389ddb1e98c3ccf2
Author: Wenchen Fan 
Date:   2015-12-23T11:48:37Z

address comments

commit 1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e
Author: Wenchen Fan 
Date:   2015-12-24T01:28:54Z

fix hash code

commit 8703b1a127235c49614d326334548f125b81383b
Author: Wenchen Fan 
Date:   2015-12-30T01:58:54Z

add comment to explain the benefit of hash expression




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168134068
  
**[Test build #48542 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)**
 for PR 10435 at commit 
[`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-168017519
  
Closing, will open another PR to use `UnsafeRow.hashCode` for shuffle and 
fix tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-30 Thread cloud-fan
Github user cloud-fan closed the pull request at:

https://github.com/apache/spark/pull/10435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48554681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
--- End diff --

Let's also add comments to explain the benefit of this function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167918200
  
**[Test build #48442 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)**
 for PR 10435 at commit 
[`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48588221
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 37 + elementHash` with an 
initial value `result = 37`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by the same 
way of array.
+ *
+ * This hash algorithm is basically same with 
`GenericInternalRow.hashCode`, but using this hash
+ * expression is better as it can produce consistent hash values between 
safe and unsafe data
+ * structure, and can be slightly faster by codegen.
+ * It's also the hash function for both shuffle and bucketing, so that we 
can guarantee shuffle and
+ * bucketing have same data distribution.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
--- End diff --

What if we just turned this into Mumur3Hash instead?

This would just do UnsafeProjection.create()
project(input).hashCode()

Murmur3 will give us much nicer hashing properties. The current hash 
function can be bad in reasonable cases. 

For example, if the long column is a timestamp in milis from a source that 
samples every second. Most of the low digits will be similar (e.g. values are 
1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very 
bad job of breaking this up and this will generate some very skewed partitions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48589312
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 37 + elementHash` with an 
initial value `result = 37`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by the same 
way of array.
+ *
+ * This hash algorithm is basically same with 
`GenericInternalRow.hashCode`, but using this hash
+ * expression is better as it can produce consistent hash values between 
safe and unsafe data
+ * structure, and can be slightly faster by codegen.
+ * It's also the hash function for both shuffle and bucketing, so that we 
can guarantee shuffle and
+ * bucketing have same data distribution.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
--- End diff --

good point!
after decided to not follow hive, I agree Mumur3Hash is a better choice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread nongli
Github user nongli commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48587902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
--- End diff --

If we're going to change this, we should use a different value for null. 
Pick a large random number instead.

0 will be computed as the hash for more reasonable data (e.g. more likely 
an int column contains 0 than a large prime) and we can cheaply reduce some 
collisions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167932291
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48442/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167932253
  
**[Test build #48442 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)**
 for PR 10435 at commit 
[`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167932288
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167747465
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167747467
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48404/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167747137
  
**[Test build #48404 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)**
 for PR 10435 at commit 
[`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48527967
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
--- End diff --

Looks like in Hive hash value of boolean is 1 for true and 0 for false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48528123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 37 + elementHash` with an 
initial value `result = 37`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by the same 
way of array.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.isEmpty) {
+  TypeCheckResult.TypeCheckFailure("input to function hash cannot be 
empty")
+} else {
+  TypeCheckResult.TypeCheckSuccess
+}
+  }
+
+  override def eval(input: InternalRow): Any = {
+var result = 37
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 37 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 0 else 1
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+
+case array: ArrayData =>
+  val elementType = dataType.asInstanceOf[ArrayType].elementType
+  var result = 0
+  var i = 0
+  while (i < array.numElements()) {
+val hashValue = computeHash(array.get(i, elementType), elementType)
+result = result * 37 + hashValue
--- End diff --

37? Looks like Hive uses 31.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48537829
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:0 for true, 1 for false.
--- End diff --

Oh sorry I forgot to update the PR description.
According to the 
[discussion](https://github.com/apache/spark/pull/10435#discussion_r48482588), 
we decided to not follow hive for performance concerns. So here I followed the 
original hash code of internal row so that we don't need to fix a lot of tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167698232
  
**[Test build #48388 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)**
 for PR 10435 at commit 
[`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167583523
  
need some help about the R test, I can't see the expected result through 
the error log...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48482588
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

To follow hive, I turn `UTF8String` to `String` first and then call 
`hashCode`, but I'm a little worried about this:

* this is definitely slower than just `UTF8String.hashCode`, and it's a 
critical path that we will run it for every row during `Exchange`, will it hurt 
performance?
* hive has string type, varchar type and char type, and they have 
[different hash 
code](https://github.com/apache/hive/blob/release-1.2.0/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L527-L541),
 but we only have `StringType`, which is hard to match all of them.

cc @nongli @yhuai @marmbrus @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167614804
  
I see
```
1. Failure (at test_sparkSQL.R#1160): group by, agg functions 
--
30 not equal to collect(max(gd))[1, 2]
30 - 19 == 11

2. Failure (at test_sparkSQL.R#1635): crosstab() on a DataFrame 

expected is not identical to ordered. Differences: 
Names: 2 string mismatches
Component 2: Mean relative difference: 2
Component 3: Mean relative difference: 2
Error: Test failures
Execution halted
Had test failures; see logs.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48500689
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

Definitely don't do toString. Any particular reason why we need to match 
Hive's hash code?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48500881
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

We have to match Hive's hashcode if we want to be able to join data Hive 
has bucketed with our own data.

+1 to avoiding toString.  We should also avoid boxing and runtime type 
reflection for the hash code (which this function is doing).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48502287
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

It seems to me we don't need to follow Hive's hash code, but we should make 
sure we don't pick the plan if it's a table bucketed by HIve.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167710232
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167710215
  
**[Test build #48388 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)**
 for PR 10435 at commit 
[`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167710233
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48388/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167713829
  
Actually I have already created a pr #9883 for this long time ago...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167727788
  
**[Test build #48399 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)**
 for PR 10435 at commit 
[`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48516214
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

@marmbrus  I think we will only use the codegen version, should we remove 
this branch and throw exception if it's called?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/10435#discussion_r48516332
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
 })
   }
 }
+
+/**
+ * A function that calculates hash value for a group of expressions.
+ *
+ * The hash value for an expression depends on its type:
+ *  - null:   0
+ *  - boolean:1 for true, 0 for false.
+ *  - byte, short, int:   the input itself.
+ *  - long:   input XOR (input >>> 32)
+ *  - float:  java.lang.Float.floatToIntBits(input)
+ *  - double: l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
+ *  - binary: java.util.Arrays.hashCode(input)
+ *  - array:  recursively calculate hash value for each 
element, and aggregate them by
+ *`result = result * 31 + elementHash` with an 
initial value `result = 0`.
+ *  - map:recursively calculate hash value for each 
key-value pair, and aggregate
+ *them by `result += keyHash XOR valueHash`.
+ *  - struct: similar to array, calculate hash value for each 
field and aggregate them.
+ *  - other type: input.hashCode().
+ *e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
+ * Finally we aggregate the hash values for each expression by `result = 
result * 31 + exprHash`.
+ *
+ * This hash algorithm follows hive's bucketing hash function, so that our 
bucketing function can
+ * be compatible with hive's, e.g. we can benefit from bucketing even the 
data source is mixed with
+ * hive tables.
+ */
+case class Hash(children: Seq[Expression]) extends Expression {
+
+  override def dataType: DataType = IntegerType
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+var result = 0
+for (e <- children) {
+  val hashValue = computeHash(e.eval(input), e.dataType)
+  result = result * 31 + hashValue
+}
+result
+  }
+
+  private def computeHash(v: Any, dataType: DataType): Int = v match {
+case null => 0
+case b: Boolean => if (b) 1 else 0
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case i: Int => i
+case l: Long => (l ^ (l >>> 32)).toInt
+case f: Float => java.lang.Float.floatToIntBits(f)
+case d: Double =>
+  val b = java.lang.Double.doubleToLongBits(d)
+  (b ^ (b >>> 32)).toInt
+case a: Array[Byte] => java.util.Arrays.hashCode(a)
+case s: UTF8String => s.toString.hashCode
--- End diff --

Oh, I see.  Its probably fine to have this as a fallback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167737270
  
**[Test build #48404 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)**
 for PR 10435 at commit 
[`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167730354
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48399/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167730331
  
**[Test build #48399 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)**
 for PR 10435 at commit 
[`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167730352
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167399420
  
**[Test build #48352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)**
 for PR 10435 at commit 
[`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167403074
  
**[Test build #48352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)**
 for PR 10435 at commit 
[`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167403091
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48352/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167403090
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167395236
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396600
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396598
  
**[Test build #48349 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)**
 for PR 10435 at commit 
[`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396601
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48349/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396554
  
**[Test build #48349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)**
 for PR 10435 at commit 
[`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396643
  
**[Test build #48350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)**
 for PR 10435 at commit 
[`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396684
  
**[Test build #48350 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)**
 for PR 10435 at commit 
[`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396686
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48350/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167396685
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167447980
  
**[Test build #48356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)**
 for PR 10435 at commit 
[`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167452745
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48356/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167452744
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167452730
  
**[Test build #48356 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)**
 for PR 10435 at commit 
[`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167338506
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167338507
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48346/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167338499
  
**[Test build #48346 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)**
 for PR 10435 at commit 
[`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167328998
  
**[Test build #48346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)**
 for PR 10435 at commit 
[`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167329447
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48345/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167329445
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167246248
  
**[Test build #48325 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)**
 for PR 10435 at commit 
[`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167251759
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48325/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167251754
  
**[Test build #48325 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)**
 for PR 10435 at commit 
[`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

2015-12-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10435#issuecomment-167251758
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >