date:20171012

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82722 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82722/testReport)**
 for PR 19488 at commit 
[`32bdf77`](https://github.com/apache/spark/commit/32bdf771fe70444ac23adf796702b5a26e085805).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19488
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: SPARK-22266 The same aggregate function was evaluated mu...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19488
  
Update the title to `[SPARK-22266][SQL] The same aggregate function was 
evaluated multiple times`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: SPARK-22266 The same aggregate function was evaluated mu...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19488: SPARK-22266 The same aggregate function was evalu...

2017-10-12 Thread maryannxue

GitHub user maryannxue opened a pull request:

https://github.com/apache/spark/pull/19488

SPARK-22266 The same aggregate function was evaluated multiple times

## What changes were proposed in this pull request?

To let the same aggregate function that appear multiple times in an 
Aggregate be evaluated only once, we need to deduplicate the aggregate 
expressions. The original code was trying to use a "distinct" call to get a set 
of aggregate expressions, but did not work, since the "distinct" did not 
compare semantic equality. And even if it did, further work should be done in 
result expression rewriting.
In this PR, I changed the "set" to a map mapping the semantic identity of a 
aggregate expression to itself. Thus, later on, when rewriting result 
expressions (i.e., output expressions), the aggregate expression reference can 
be fixed.

## How was this patch tested?

Added a new test in SQLQuerySuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maryannxue/spark spark-22266

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19488.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19488


commit 32bdf771fe70444ac23adf796702b5a26e085805
Author: maryannxue 
Date:   2017-10-13T05:31:10Z

SPARK-22266 The same aggregate function was evaluated multiple times




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19475: [SPARK-22257][SQL]Reserve all non-deterministic e...

2017-10-12 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19475


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-12 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18460
  
Thank you, @gatorsmile . Sure, I agree.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19475
  
Thanks! Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...

2017-10-12 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19470
  
One minor comment doesn't affect this. LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19475
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82714/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19475
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19475: [SPARK-22257][SQL]Reserve all non-deterministic expressi...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19475
  
**[Test build #82714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82714/testReport)**
 for PR 19475 at commit 
[`60a0360`](https://github.com/apache/spark/commit/60a0360b946e9b9b3ee851d8ea5e68b498251d52).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...

2017-10-12 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144469314
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,64 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  Seq("orc", "parquet").foreach { format =>
+test(s"SPARK-18355 Read data from a hive table with a new column - 
$format") {
+  val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+  Seq("true", "false").foreach { value =>
+withSQLConf(
+  HiveUtils.CONVERT_METASTORE_ORC.key -> value,
+  HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) {
--- End diff --

As you separate orc and parquet to two test in fact, maybe you just need to 
test against one config at one time, i.e., orc -> 
HiveUtils.CONVERT_METASTORE_ORC, parquet -> 
HiveUtils.CONVERT_METASTORE_PARQUET.key.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18460
  
BTW, we are unable to merge this to Spark 2.2 although this is a bug fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #82721 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82721/testReport)**
 for PR 19433 at commit 
[`93e17fc`](https://github.com/apache/spark/commit/93e17fc74958d4fa8f3bea38731ecec662e4ca66).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18460
  
LGTM cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19448
  
Will check it if I am not confident next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19448
  
Ok. Next time, please check it with the committers who are familiar with 
Spark SQL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19448
  
I did this as I was confident if it is a bug because doc says it should 
work but actually not, without breaking the previous support.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19470
  
LGTM too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19448
  
This one starts at least since Spark 1.5. If you are not confident whether 
this is bug or not, please check it before merging it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19470
  
**[Test build #82720 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82720/testReport)**
 for PR 19470 at commit 
[`8e7fe9b`](https://github.com/apache/spark/commit/8e7fe9b9e3fc6121686caad45dcfdb4ff08f0c4a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19483: [SPARK-21165][SQL] FileFormatWriter should handle...

2017-10-12 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19483


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82717/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #82717 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82717/testReport)**
 for PR 19433 at commit 
[`c9a8e01`](https://github.com/apache/spark/commit/c9a8e01cead78d2ea6eb6a9ffa007314cbbcf60a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19483
  
thanks for the review, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should r...

2017-10-12 Thread cloud-fan

Github user cloud-fan closed the pull request at:

https://github.com/apache/spark/pull/19484


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82719 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82719/testReport)**
 for PR 19269 at commit 
[`ac3de3c`](https://github.com/apache/spark/commit/ac3de3c2c5c50df8e5f2c618a78a4fa4c6756797).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...

2017-10-12 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144466637
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
--- End diff --

Yep. I'll add `parquet`, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...

2017-10-12 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18692#discussion_r144466472
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] 
with PredicateHelper {
   if (j.joinType == newJoinType) f else Filter(condition, 
j.copy(joinType = newJoinType))
   }
 }
+
+/**
+ * A rule that uses propagated constraints to infer join conditions. The 
optimization is applicable
+ * only to CROSS joins.
+ *
+ * For instance, if there is a CROSS join, where the left relation has 'a 
= 1' and the right
+ * relation has 'b = 1', then the rule infers 'a = b' as a join predicate.
+ */
+object InferJoinConditionsFromConstraints extends Rule[LogicalPlan] with 
PredicateHelper {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (SQLConf.get.constraintPropagationEnabled) {
+  inferJoinConditions(plan)
+} else {
+  plan
+}
+  }
+
+  private def inferJoinConditions(plan: LogicalPlan): LogicalPlan = plan 
transform {
+case join @ Join(left, right, Cross, conditionOpt) =>
+  val leftConstraints = 
join.constraints.filter(_.references.subsetOf(left.outputSet))
+  val rightConstraints = 
join.constraints.filter(_.references.subsetOf(right.outputSet))
--- End diff --

I don't think we need to separate the constraints as left only and right 
only.
The following case can infer `t1.col1 = t2.col1`:
```scala
Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t2")
val df = spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.col1 >= t2.col1 " +
   "WHERE t1.col1 = t1.col2 + t2.col2 and t2.col1 = t1.col2 + t2.col2")
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19448
  
How come fixing the behaviour as documented is not a bug fix? I think that 
basically mean we don't backport fixes for things not working as documented for 
other internal configurations.

This does not extend the functionailities. This fixes functionalities to 
work as documented and expected, and I call it a bugfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19483
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82712/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144465324
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
--- End diff --

Improve the test case for checking the other formats?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19484
  
Thanks! Merged to 2.2. Could you close this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144465575
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
--- End diff --

since it depends on the CONVERT_METASTORE_XXX conf, maybe also test parquet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19483
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark s...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19470
  
LGTM, pending jenkins


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19483
  
**[Test build #82712 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82712/testReport)**
 for PR 19483 at commit 
[`f4a7337`](https://github.com/apache/spark/commit/f4a7337b3c4c2b58931550afc5d57902fa98ba96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19448
  
That conf is an internal one. The end users will not see it. This is not a 
bug fix. 

We should not extend the existing functions or introduce new 
behaviors/features in 2.2.x releases.





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19448
  
Since the risk is low, I did not revert it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82713/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19269
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82713/testReport)**
 for PR 19269 at commit 
[`d65f2c4`](https://github.com/apache/spark/commit/d65f2c4f7a1d25072f02b97d6454b1ee2670f1b1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, 
query: LogicalPlan)`
  * `class RowToInternalRowDataWriterFactory(`
  * `class RowToInternalRowDataWriter(rowWriter: DataWriter[Row], encoder: 
ExpressionEncoder[Row])`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18692
  
cc @gengliangwang Review this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-12 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18460
  
Hi, @gatorsmile and @cloud-fan .
Could you review this again, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19470
  
**[Test build #82718 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82718/testReport)**
 for PR 19470 at commit 
[`ef2123e`](https://github.com/apache/spark/commit/ef2123ecc516fce6feb2a4abe051b4ef862c51a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@tejasapatil Thank you for your comment. I hope that benchmark result is a 
response to concern about virtual method raised by @hvanhovell.  
@hvanhovell **What do you think?**

As a long term plan, it would be good to all (low-level) memory access 
(i.e. thru `Platform`) would be in form of `MemoryBlock`. There are two 
reasons. One is easy to understand. The other is performance.
This transition would be step by step. I prepared [another 
refactoring](https://github.com/apache/spark/pull/19472) for some classes.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...

2017-10-12 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19470
  
@cloud-fan . Thank you so much for review!
I updated the PR except one: If`fieldValue` is `null`, we also use 
`setNull` again in `else`. So, the current one is simpler.
```scala
if (fieldRef == null) {
  row.setNull...
} else {
  val fieldValue = ...
  ...
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19484
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19484
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82711/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19484: [SPARK-22252][SQL][2.2] FileFormatWriter should respect ...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19484
  
**[Test build #82711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82711/testReport)**
 for PR 19484 at commit 
[`e527540`](https://github.com/apache/spark/commit/e527540776f2016228dd7ae7ba4e45aff602401d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19448
  
I think this is a bug to fix as the previous behaviour does not work as 
documented:

```
subclass of org.apache.hadoop.mapreduce.OutputCommitter...
```

and does not change existing behaviour.

Could you elaborate why you think it is not eligible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/19222
  
At high level, this idea is good and worth moving forward with. I still 
have to dig into your analysis in response to concern raised by @hvanhovell. 

In terms of the PR itself, is the long term plan to ensure that all memory 
access would be in form of `MemoryBlock` ? I mean there are still methods which 
do vanilla `Object` based access which I feel should be gotten rid of (not in 
this PR as it will touch a lot more places).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-12 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
Sure, I agree with you. I will try performance evaluation for other methods 
like `getByte()`, and so on.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/19222
  
I pulled up frequency of methods from `UTF8String` which are being invoked 
from FB prod clusters and picked top 25.

```
.writeToMemory()
.getBytes()
.toString()
.fromString()
.clone()
.toLong()
.fromString()
.translate()
.compare()
.getByte()
.compareTo()
.fromAddress()
.getBytes()
.fromBytes()
.()
.substringSQL()
.substring()
.fromBytes()
.fromAddress()
.equals()
.compareTo()
.compare()
.toString()
.hashCode()
.toInt()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/19222
  
Apart from `UTF8String.trim`, can you try other some other method ? If we 
have to eval perf., its better to pick a method which would be most frequently 
used... if I have to guess, `trim()` won't be most of the most called method in 
that class.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-12 Thread krishna-pandey

Github user krishna-pandey commented on the issue:

https://github.com/apache/spark/pull/19419
  
@tgravescs These generic headers are about providing available client-side 
protection for the application. I also think even if there is no sensitive data 
to formulate an attack by itself here, the information can be used in 
conjunction to target other ecosystem components. Also, in future we may add an 
interface for data access. Now is the time to think of Security First. 
Cross-site Scripting is one of the most prevalent attack vector and has been an 
OWASP Top 10 risk for web applications for decades. As the effort to have these 
in place here is minimal, IMHO we should set these. 

As you rightly mentioned, deployment on cloud can expand the attack surface 
pretty wide in absence of right firewall policy. Also let's not forget insider 
threat inside corporate networks.

Going forward may be we will have enough insight to choose which headers 
are needed to be enabled by default and enforce them from application side and 
not leave it to Users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19451
  
Actually you already have it in the classdoc, so please just update the pr 
description with it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

it shouldn't be an analysisexception, if there is a bug in catalyst.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461913
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

but i agree we should throw a more meaningful error message in case there 
is a bug


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144461813
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
--- End diff --

I'd put this in a new file. the optimizer file is getting too long.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #82717 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82717/testReport)**
 for PR 19433 at commit 
[`c9a8e01`](https://github.com/apache/spark/commit/c9a8e01cead78d2ea6eb6a9ffa007314cbbcf60a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19451
  
Can you update the pr description with an example plan before / after this 
optimization, and also put that example in the comment section of the doc.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19487
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19487
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82709/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19464
  
**[Test build #82716 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82716/testReport)**
 for PR 19464 at commit 
[`25f98d0`](https://github.com/apache/spark/commit/25f98d0d89e4566339d9ba7701975af4e175c918).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19487: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19487
  
**[Test build #82709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82709/testReport)**
 for PR 19487 at commit 
[`b15ebe4`](https://github.com/apache/spark/commit/b15ebe4b6721f9533150aa5831986bea081843e2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

2017-10-12 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19480#discussion_r144459033
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 }
   }
 
+  test("SPARK-6: group splitted expressions into one method per nested 
class") {
--- End diff --

@mgaido91 I can reproduce the issue by following test case. You can check 
it:

```scala
  test("SPARK-6: too much splitted expressions should not exceen 
constant pool limit") {
withSQLConf(
  (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) {
  val colNumber = 1000
  val baseDF = spark.range(10).toDF()
  val newCols = (1 to colNumber).map { colIndex =>
expr(s"id + $colIndex").as(s"_$colIndex")
  }
  val input = baseDF.select(newCols: _*)
  val aggs2 = (1 to colNumber).flatMap { colIndex =>
val colName = s"_$colIndex"
Seq(expr(s"stddev($colName)"),
  expr(s"stddev_samp($colName)"),
  expr(s"stddev_pop($colName)"),
  expr(s"variance($colName)"),
  expr(s"var_samp($colName)"),
  expr(s"var_pop($colName)"),
  expr(s"skewness($colName)"),
  expr(s"kurtosis($colName)"))
  }
  input.agg(aggs2.head, aggs2.tail: _*).collect()
}
  }
```


```
[info]   Cause: org.codehaus.janino.JaninoRuntimeException: failed to 
compile: org.codehaus.janino.JaninoRuntimeExc
eption: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
 
has grown past JVM limit of 0x
[info]   at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressi
ons$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1079)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19448
  
This is not eligible for backporting. We should not do it next time. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-10-12 Thread smurching

Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433
  
Sorry, realized I conflated feature subsampling and `subsampleWeights` 
(instance weights for training examples). IMO feature subsampling can be added 
in a follow-up PR, but `subsampleWeights` should go in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144458346
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

maybe just `spark.maxRemoteBlockSizeFetchToMemory`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144458322
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

`spark.storage.maxRemoteBlockSizeFetchToMemory` is not very clear that it 
works for shuffle too...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2017-10-12 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r144458324
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java ---
@@ -17,47 +17,168 @@
 
 package org.apache.spark.unsafe.memory;
 
-import javax.annotation.Nullable;
-
 import org.apache.spark.unsafe.Platform;
 
+import javax.annotation.Nullable;
+
 /**
- * A consecutive block of memory, starting at a {@link MemoryLocation} 
with a fixed size.
+ * A declaration of interfaces of MemoryBlock classes .
  */
-public class MemoryBlock extends MemoryLocation {
+public abstract class MemoryBlock {
+  @Nullable
+  protected final Object obj;
 
-  private final long length;
+  protected final long offset;
+
+  protected final long length;
 
   /**
* Optional page number; used when this MemoryBlock represents a page 
allocated by a
-   * TaskMemoryManager. This field is public so that it can be modified by 
the TaskMemoryManager,
-   * which lives in a different package.
+   * TaskMemoryManager. This field can be updated using setPageNumber 
method so that
--- End diff --

@tejasapatil While we make it private, accessors (`setPageNumber` and 
`getPageNumber`) has `final` attribute. Thus, JIT compiler will inline these 
methods.  
As a result, I think that this version with `final` accessors can achieve 
the similar performance compared to the approach with `public` fields.

What do you think?

BTW, I will fix typo soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19483
  
that will be great, thanks @tejasapatil !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19470
  
LGTM except some minor comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144458108
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+Seq("true", "false").foreach { value =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+withTempDatabase { db =>
+  client.runSqlHive(
+s"""
+   |CREATE TABLE $db.t(
+   |  click_id string,
+   |  search_id string,
+   |  uid bigint)
+   |PARTITIONED BY (
+   |  ts string,
+   |  hour string)
+   |STORED AS ORC
+ """.stripMargin)
+
+  client.runSqlHive(
+s"""
+   |INSERT INTO TABLE $db.t
+   |PARTITION (ts = '98765', hour = '01')
+   |VALUES (12, 2, 12345)
+ """.stripMargin
+  )
+
+  checkAnswer(
+sql(s"SELECT * FROM $db.t"),
+Row("12", "2", 12345, "98765", "01"))
+
+  client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy 
string)")
+
+  checkAnswer(
+sql(s"SELECT click_id, search_id FROM $db.t"),
+Row("12", "2"))
+
+  checkAnswer(
+sql(s"SELECT search_id, click_id FROM $db.t"),
+Row("2", "12"))
+
+  checkAnswer(
+sql(s"SELECT search_id FROM $db.t"),
+Row("2"))
+
+  checkAnswer(
+sql(s"SELECT dummy, click_id FROM $db.t"),
+Row(null, "12"))
+
+  checkAnswer(
+sql(s"SELECT * FROM $db.t"),
+Row("12", "2", 12345, null, "98765", "01"))
+}
+  }
+}
+  }
+
+  // This test case is added to prevent regression.
+  test("SPARK-22267 Spark SQL incorrectly reads ORC files when column 
order is different") {
--- End diff --

it's weird to have a test verifying a bug, I think it's good enough to have 
a JIRA tracking this bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457977
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+Seq("true", "false").foreach { value =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+withTempDatabase { db =>
+  client.runSqlHive(
+s"""
+   |CREATE TABLE $db.t(
+   |  click_id string,
+   |  search_id string,
+   |  uid bigint)
+   |PARTITIONED BY (
+   |  ts string,
+   |  hour string)
+   |STORED AS ORC
+ """.stripMargin)
+
+  client.runSqlHive(
+s"""
+   |INSERT INTO TABLE $db.t
+   |PARTITION (ts = '98765', hour = '01')
+   |VALUES (12, 2, 12345)
+ """.stripMargin
+  )
+
+  checkAnswer(
+sql(s"SELECT * FROM $db.t"),
+Row("12", "2", 12345, "98765", "01"))
+
+  client.runSqlHive(s"ALTER TABLE $db.t ADD COLUMNS (dummy 
string)")
+
+  checkAnswer(
+sql(s"SELECT click_id, search_id FROM $db.t"),
+Row("12", "2"))
+
+  checkAnswer(
+sql(s"SELECT search_id, click_id FROM $db.t"),
+Row("2", "12"))
+
+  checkAnswer(
+sql(s"SELECT search_id FROM $db.t"),
+Row("2"))
+
+  checkAnswer(
+sql(s"SELECT dummy, click_id FROM $db.t"),
+Row(null, "12"))
+
+  checkAnswer(
+sql(s"SELECT * FROM $db.t"),
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457932
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2050,4 +2050,80 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file 
schema") {
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+Seq("true", "false").foreach { value =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
+withTempDatabase { db =>
+  client.runSqlHive(
+s"""
+   |CREATE TABLE $db.t(
+   |  click_id string,
+   |  search_id string,
+   |  uid bigint)
+   |PARTITIONED BY (
+   |  ts string,
+   |  hour string)
+   |STORED AS ORC
+ """.stripMargin)
+
+  client.runSqlHive(
+s"""
+   |INSERT INTO TABLE $db.t
+   |PARTITION (ts = '98765', hour = '01')
+   |VALUES (12, 2, 12345)
+ """.stripMargin
+  )
+
+  checkAnswer(
+sql(s"SELECT * FROM $db.t"),
--- End diff --

please list all columns here instead of `*`, to make the test more clear


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19452
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82710/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457796
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -272,25 +272,35 @@ private[orc] object OrcRelation extends 
HiveInspectors {
   def unwrapOrcStructs(
   conf: Configuration,
   dataSchema: StructType,
+  requiredSchema: StructType,
   maybeStructOI: Option[StructObjectInspector],
   iterator: Iterator[Writable]): Iterator[InternalRow] = {
 val deserializer = new OrcSerde
-val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType))
-val unsafeProjection = UnsafeProjection.create(dataSchema)
+val mutableRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+val unsafeProjection = UnsafeProjection.create(requiredSchema)
 
 def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = {
-  val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map {
-case (field, ordinal) => oi.getStructFieldRef(field.name) -> 
ordinal
+  val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map {
+case (field, ordinal) =>
+  var ref = oi.getStructFieldRef(field.name)
+  if (ref == null) {
+val maybeIndex = dataSchema.getFieldIndex(field.name)
+if (maybeIndex.isDefined) {
+  ref = oi.getStructFieldRef("_col" + maybeIndex.get)
+}
+  }
+  ref -> ordinal
   }.unzip
 
-  val unwrappers = fieldRefs.map(unwrapperFor)
+  val unwrappers = fieldRefs.map(r => if (r == null) null else 
unwrapperFor(r))
 
   iterator.map { value =>
 val raw = deserializer.deserialize(value)
 var i = 0
 val length = fieldRefs.length
 while (i < length) {
-  val fieldValue = oi.getStructFieldData(raw, fieldRefs(i))
+  val fieldRef = fieldRefs(i)
+  val fieldValue = if (fieldRef == null) null else 
oi.getStructFieldData(raw, fieldRefs(i))
   if (fieldValue == null) {
--- End diff --

nit:
```
if (fieldRef == null) {
  row.setNull...
} else {
  val fieldValue = ...
  ...
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...

2017-10-12 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19452
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19452: [SPARK-22136][SS] Evaluate one-sided conditions early in...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19452
  
**[Test build #82710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82710/testReport)**
 for PR 19452 at commit 
[`6753825`](https://github.com/apache/spark/commit/67538255c00e01a8b1553c82c2c83b5ae0a7ddde).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457643
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   if (maybePhysicalSchema.isEmpty) {
 Iterator.empty
   } else {
-val physicalSchema = maybePhysicalSchema.get
-OrcRelation.setRequiredColumns(conf, physicalSchema, 
requiredSchema)
+OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)
--- End diff --

oh i see, we only need to pass the required column indices to orc reader.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457479
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -272,25 +272,35 @@ private[orc] object OrcRelation extends 
HiveInspectors {
   def unwrapOrcStructs(
   conf: Configuration,
   dataSchema: StructType,
+  requiredSchema: StructType,
   maybeStructOI: Option[StructObjectInspector],
   iterator: Iterator[Writable]): Iterator[InternalRow] = {
 val deserializer = new OrcSerde
-val mutableRow = new SpecificInternalRow(dataSchema.map(_.dataType))
-val unsafeProjection = UnsafeProjection.create(dataSchema)
+val mutableRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+val unsafeProjection = UnsafeProjection.create(requiredSchema)
 
 def unwrap(oi: StructObjectInspector): Iterator[InternalRow] = {
-  val (fieldRefs, fieldOrdinals) = dataSchema.zipWithIndex.map {
-case (field, ordinal) => oi.getStructFieldRef(field.name) -> 
ordinal
+  val (fieldRefs, fieldOrdinals) = requiredSchema.zipWithIndex.map {
+case (field, ordinal) =>
+  var ref = oi.getStructFieldRef(field.name)
+  if (ref == null) {
+val maybeIndex = dataSchema.getFieldIndex(field.name)
--- End diff --

the `requiredSchema` is guaranteed to be contained in the `dataSchema`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457357
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   if (maybePhysicalSchema.isEmpty) {
 Iterator.empty
   } else {
-val physicalSchema = maybePhysicalSchema.get
-OrcRelation.setRequiredColumns(conf, physicalSchema, 
requiredSchema)
+OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)
--- End diff --

does it work? seems here we lie to the orc reader about the physical schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19470: [SPARK-14387][SPARK-18355][SQL] Use Spark schema ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19470#discussion_r144457235
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   if (maybePhysicalSchema.isEmpty) {
--- End diff --

nit
```
val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), 
Some(conf)).isEmpty
if (isEmptyFile) {
  ...
} else ...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r144457118
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java ---
@@ -17,47 +17,168 @@
 
 package org.apache.spark.unsafe.memory;
 
-import javax.annotation.Nullable;
-
 import org.apache.spark.unsafe.Platform;
 
+import javax.annotation.Nullable;
+
 /**
- * A consecutive block of memory, starting at a {@link MemoryLocation} 
with a fixed size.
+ * A declaration of interfaces of MemoryBlock classes .
  */
-public class MemoryBlock extends MemoryLocation {
+public abstract class MemoryBlock {
+  @Nullable
+  protected final Object obj;
 
-  private final long length;
+  protected final long offset;
+
+  protected final long length;
 
   /**
* Optional page number; used when this MemoryBlock represents a page 
allocated by a
-   * TaskMemoryManager. This field is public so that it can be modified by 
the TaskMemoryManager,
-   * which lives in a different package.
+   * TaskMemoryManager. This field can be updated using setPageNumber 
method so that
--- End diff --

typo: `TaskMemoryManager`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r144457126
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java ---
@@ -17,47 +17,168 @@
 
 package org.apache.spark.unsafe.memory;
 
-import javax.annotation.Nullable;
-
 import org.apache.spark.unsafe.Platform;
 
+import javax.annotation.Nullable;
+
 /**
- * A consecutive block of memory, starting at a {@link MemoryLocation} 
with a fixed size.
+ * A declaration of interfaces of MemoryBlock classes .
  */
-public class MemoryBlock extends MemoryLocation {
+public abstract class MemoryBlock {
+  @Nullable
+  protected final Object obj;
 
-  private final long length;
+  protected final long offset;
+
+  protected final long length;
 
   /**
* Optional page number; used when this MemoryBlock represents a page 
allocated by a
-   * TaskMemoryManager. This field is public so that it can be modified by 
the TaskMemoryManager,
-   * which lives in a different package.
+   * TaskMemoryManager. This field can be updated using setPageNumber 
method so that
--- End diff --

I think keeping it public would be more performant VS mutating the field 
using a method. I know that this using method is better way in terms of design 
but there are places in codebase where we do this for solely for perf. reasons



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144456518
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
+
+  def combineFilters(plan: LogicalPlan): LogicalPlan = {
+@tailrec
+def fixedPoint(plan: LogicalPlan, acc: LogicalPlan): LogicalPlan = {
+  if (acc.fastEquals(plan)) acc else fixedPoint(acc, 
CombineFilters(acc))
+}
+
+fixedPoint(plan, CombineFilters(plan))
+  }
+
+  def replaceAttributesIn(condition: Expression, node: LogicalPlan): 
Expression = {
--- End diff --

We do not need this function. Could we inline these logics in the rule?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/19483
  
>> I'll refactor it later, to use requiredChildOrdering to do the sort. 

The hive bucketing PR does that : 
https://github.com/apache/spark/pull/19001 I can isolate that piece and put out 
a PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144456402
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
+
+  def combineFilters(plan: LogicalPlan): LogicalPlan = {
+@tailrec
+def fixedPoint(plan: LogicalPlan, acc: LogicalPlan): LogicalPlan = {
--- End diff --

The func name is confusing...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144456109
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

Although it is impossible, but please use
```Scala
  plan.find(!_.isInstanceOf[Filter]).getOrElse {
throw new AnalysisException("xyz")
  } 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144455482
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
--- End diff --

please add `private` in all the internal functions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-12 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r144455603
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1244,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
+ * }}}
+ *
+ * Note:
+ * 1. We should combine all the [[Filter]] of the right node before 
flipping it using NOT operator.
+ */
+object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case Except(left, right) if isEligible(left, right) =>
+  val filterCondition = 
combineFilters(right).asInstanceOf[Filter].condition
+  Distinct(
+Filter(Not(replaceAttributesIn(filterCondition, left)), left)
+  )
+  }
+
+  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
+case (left, right: Filter) => 
nonFilterChild(left).sameResult(nonFilterChild(right))
+case _ => false
+  }
+
+  def nonFilterChild(plan: LogicalPlan): LogicalPlan = 
plan.find(!_.isInstanceOf[Filter]).get
--- End diff --

Nit: no need to add the return type `: LogicalPlan` for the private 
functions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19458#discussion_r144456908
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, 
deleteFilesOnStop: Boolea
 
   /** List all the blocks currently stored on disk by the disk manager. */
   def getAllBlocks(): Seq[BlockId] = {
-getAllFiles().map(f => BlockId(f.getName))
+getAllFiles().flatMap { f =>
+  val blockId = BlockId.guess(f.getName)
--- End diff --

+1 for using try-catch here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144456817
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

I would prefer to use `spark.storage.maxRemoteBlockSizeFetchToMemory`, 
since driver side block manager will also leverage this feature.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19263: [SPARK-22050][CORE] Allow BlockUpdated events to be opti...

2017-10-12 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19263
  
**[Test build #82715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82715/testReport)**
 for PR 19263 at commit 
[`21b262a`](https://github.com/apache/spark/commit/21b262a4cf945b22ce1ace84acded29852bc73b5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19476: [SPARK-22062][CORE] Spill large block to disk in ...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19476#discussion_r144456522
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -426,4 +426,11 @@ package object config {
 .toSequence
 .createOptional
 
+  private[spark] val MAX_REMOTE_BLOCK_SIZE_TO_MEM =
--- End diff --

how about `MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM` and 
`spark.executor.maxRemoteBlockSizeFetchToMemory`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19483: [SPARK-21165][SQL] FileFormatWriter should handle mismat...

2017-10-12 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19483
  
I'll refactor it later, to use `requiredChildOrdering` to do the sort. I 
just wanna make this bug fix as simple as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 591 matches

Mail list logo